知识点
- requests
- json
- re
- pprint
- 版 本:anaconda5.2.0(python3.6.5)
- 编辑器:pycharm
- 确定需求 (要爬取的内容是什么?)
爬取某个关键词对应的视频 保存mp4 - 通过开发者工具进行抓包分析 分析数据从哪里来的(找出真正的数据来源)?
静态加载页面
笔趣阁为例
动态加载页面
开发者工具抓数据包
二. 代码实现过程
- 找到目标网址
- 发送请求
get post - 解析数据 (获取视频地址 视频标题)
- 发送请求 请求每个视频地址
- 保存视频

文章插图

文章插图
三. 单个视频导入所需模块import jsonimport requestsimport re发送请求data = https://tazarkount.com/read/{'operationName': "visionSearchPhoto",'query': "query visionSearchPhoto($keyword: String, $pcursor: String, $searchSessionId: String, $page: String, $webPageArea: String) {\nvisionSearchPhoto(keyword: $keyword, pcursor: $pcursor, searchSessionId: $searchSessionId, page: $page, webPageArea: $webPageArea) {\nresult\nllsid\nwebPageArea\nfeeds {\ntype\nauthor {\nid\nname\nfollowing\nheaderUrl\nheaderUrls {\ncdn\nurl\n__typename\n}\n__typename\n}\ntags {\ntype\nname\n__typename\n}\nphoto {\nid\nduration\ncaption\nlikeCount\nrealLikeCount\ncoverUrl\nphotoUrl\nliked\ntimestamp\nexpTag\ncoverUrls {\ncdn\nurl\n__typename\n}\nphotoUrls {\ncdn\nurl\n__typename\n}\nanimatedCoverUrl\nstereoType\nvideoRatio\n__typename\n}\ncanAddComment\ncurrentPcursor\nllsid\nstatus\n__typename\n}\nsearchSessionId\npcursor\naladdinBanner {\nimgUrl\nlink\n__typename\n}\n__typename\n}\n}\n",'variables': {'keyword': '张三','pcursor': ' ','page': "search",'searchSessionId': "MTRfMjcwOTMyMTQ2XzE2Mjk5ODcyODQ2NTJf5oWi5pGHXzQzMQ"}response = requests.post('https://www.kuaishou.com/graphql', data=https://tazarkount.com/read/data)加请求头headers = {# Content-Type(内容类型)的格式有四种(对应data):分别是# 爬虫基础/xml: 把xml作为一个文件来传输# multipart/form-data: 用于文件上传'content-type': 'application/json',# 用户身份标识'Cookie': 'kpf=PC_WEB; kpn=KUAISHOU_VISION; clientid=3; did=web_721a784b472981d650bcb8bbc5e9c9c2',# 浏览器信息 (伪装成浏览器发送请求)'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',}json序列化操作# json数据交换格式, 在JSON出现之前, 大家一直用XML来传递数据# 由于各个语言都支持 JSON,JSON 又支持各种数据类型,所以JSON常用于我们日常的 HTTP 交互、数据存储等 。# 将python对象编码成Json字符串data = https://tazarkount.com/read/json.dumps(data)json_data = requests.post('https://www.kuaishou.com/graphql', headers=headers, data=https://tazarkount.com/read/data).json()字典取值feeds = json_data['data']['visionSearchPhoto']['feeds']for feed in feeds:caption = feed['photo']['caption']photoUrl = feed['photo']['photoUrl']new_title = re.sub(r'[/\:*?<>/\n] ', '-', caption)再次发送请求resp = requests.get(photoUrl).content保存数据with open('video\\' + title + '.mp4', mode='wb') as f:f.write(resp)print(title, '爬取成功!!!')

文章插图
四. 翻页爬取导入模块import concurrent.futuresimport time发送请求def get_json(url, data):response = requests.post(url, headers=headers, data=https://tazarkount.com/read/data).json()return response修改标题def change_title(title):# windows系统文件命名 不能含有特殊字符...# windows文件命名 字符串不能超过 256...new_title = re.sub(r'[/\\|:?<>"*\n]', '_', title)if len(new_title) > 50:new_title = new_title[:10]return new_title数据提取def parse(json_data):data_list = json_data['data']['visionSearchPhoto']['feeds']info_list = []for data in data_list:# 提取标题title = data['photo']['caption']new_title = change_title(title)url_1 = data['photo']['photoUrl']info_list.append([new_title, url_1])return info_list保存数据def save(title, url_1):resp = requests.get(url_1).contentwith open('video\\' + title + '.mp4', mode='wb') as f:f.write(resp)print(title, '爬取成功!!!')主函数 调动所有的函数def run(url, data):"""主函数 调动所有的函数"""json_data = https://tazarkount.com/read/get_json(url, data)info_list = parse(json_data)for title, url_1 in info_list:save(title, url_1)if __name__ =='__main__':start_time = time.time()with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:for page in range(1, 5):url = 'https://www.kuaishou.com/graphql'data = https://tazarkount.com/read/{'operationName': "visionSearchPhoto",'query': "query visionSearchPhoto($keyword: String, $pcursor: String, $searchSessionId: String, $page: String, $webPageArea: String) {\nvisionSearchPhoto(keyword: $keyword, pcursor: $pcursor, searchSessionId: $searchSessionId, page: $page, webPageArea: $webPageArea) {\nresult\nllsid\nwebPageArea\nfeeds {\ntype\nauthor {\nid\nname\nfollowing\nheaderUrl\nheaderUrls {\ncdn\nurl\n__typename\n}\n__typename\n}\ntags {\ntype\nname\n__typename\n}\nphoto {\nid\nduration\ncaption\nlikeCount\nrealLikeCount\ncoverUrl\nphotoUrl\nliked\ntimestamp\nexpTag\ncoverUrls {\ncdn\nurl\n__typename\n}\nphotoUrls {\ncdn\nurl\n__typename\n}\nanimatedCoverUrl\nstereoType\nvideoRatio\n__typename\n}\ncanAddComment\ncurrentPcursor\nllsid\nstatus\n__typename\n}\nsearchSessionId\npcursor\naladdinBanner {\nimgUrl\nlink\n__typename\n}\n__typename\n}\n}\n",'variables': {'keyword': '曹芬',# 'keyword': keyword,'pcursor': str(page),'page': "search",'searchSessionId': "MTRfMjcwOTMyMTQ2XzE2Mjk5ODcyODQ2NTJf5oWi5pGHXzQzMQ"}}data = https://tazarkount.com/read/json.dumps(data)executor.submit(run, url, data, )print('一共花费了:', time.time()-start_time)

文章插图
【python爬取数据存入excel 【Python爬虫】“曹芬~~嘿嘿”是什么梗?批量下载快手平台视频数据】

文章插图
耗时为57.7秒
- 春季老年人吃什么养肝?土豆、米饭换着吃
- 三八妇女节节日祝福分享 三八妇女节节日语录
- 老人谨慎!选好你的“第三只脚”
- 校方进行了深刻的反思 青岛一大学生坠亡校方整改校规
- 脸皮厚的人长寿!有这特征的老人最长寿
- 长寿秘诀:记住这10大妙招 100%增寿
- 春季老年人心血管病高发 3条保命要诀
- 眼睛花不花要看四十八 老年人怎样延缓老花眼
- 香槟然能防治老年痴呆症? 一天三杯它人到90不痴呆
- 老人手抖的原因 为什么老人手会抖
