URL分析 网站:https://www.dbbqb.com/
随便开一张表情包,url如下:
https://www.dbbqb.com/detail/320000.html 根据变更url,可知url构造规则:
https://www.dbbqb.com/detail/表情包数字.html 网页分析 打开F12,发现是ajax的:
切到XHR页,发现json中的一项和图片url相同:
api接口构造规则:
https://www.dbbqb.com/api/image/表情包数字 项目结构
可使用shell:
touch main.pymkdir image 代码 from threading import Threadimport jsonimport osimport requestsfrom bs4 import BeautifulSoupUSER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4947.3 Safari/537.36'HEADERS = {'User-Agent': USER_AGENT}def download_image(url: str, num: int):# 根据图片url下载图片response = requests.get(url, headers=HEADERS)with open(os.path.join('image', f'{num}.jpg'), 'wb') as f:f.write(response.content)def download(image_num: int):# 根据给定的表情包id爬取图片headers = HEADERS.copy()headers[':path'] = f'/api/image/{image_num}'# 这里要加:path,反反爬url = f'https://www.dbbqb.com/api/image/{image_num}'# url构造response = requests.get(url, headers=HEADERS)response.encoding = 'utf-8'if response.status_code != 200:# 防意外print(f'错误(ID: {image_num})')returndata = https://tazarkount.com/read/json.loads(response.text)try:path = data['path']except KeyError:print(f'JSON数据错误: {data} (ID: {image_num})')returnimg_url = f'https://image.dbbqb.com/{path}'download_image(img_url, image_num)print(f'下载表情包成功(ID: {image_num})')def main():threads = []# 懒得写线程队列for i in range(1, 320001):th = Thread(target=download, args=(i,))# 注意:python的元组只有一项一定要加一个,threads.append(th)for t in threads:t.start()if __name__ == '__main__':main() 需要注意,有些地方没有表情包,所以会打印错误信息,属于正常现象
效果 【并行爬虫实例:python爬取32万个表情包】部分截图:
- 春季老年人吃什么养肝?土豆、米饭换着吃
- 三八妇女节节日祝福分享 三八妇女节节日语录
- 老人谨慎!选好你的“第三只脚”
- 校方进行了深刻的反思 青岛一大学生坠亡校方整改校规
- 脸皮厚的人长寿!有这特征的老人最长寿
- 长寿秘诀:记住这10大妙招 100%增寿
- 春季老年人心血管病高发 3条保命要诀
- 眼睛花不花要看四十八 老年人怎样延缓老花眼
- 香槟然能防治老年痴呆症? 一天三杯它人到90不痴呆
- 老人手抖的原因 为什么老人手会抖
