Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

关于微博反爬机制 #54

Open
XueSeason opened this issue Apr 6, 2017 · 3 comments
Open

关于微博反爬机制 #54

XueSeason opened this issue Apr 6, 2017 · 3 comments

Comments

@XueSeason
Copy link

虽然说现在微博对爬取的频率做了限制,不过站在微博的角度思考,人家也很纠结,又想让搜索引擎爬取到数据,又要防止其他爬虫对服务器带来压力。

本人尝试过将爬虫对 UA 改为诸如百度爬虫之类的,可以在不模拟登录的情况下,高频率爬取到很多数据。可以把这条建议追加到 README 中。

@LiuXingMing
Copy link
Owner

LiuXingMing commented Apr 6, 2017

一般而言网站判断爬虫是否是搜索引擎,不仅仅是判断UA,还有对应的IP段。你确定加了百度爬虫的UA可以更高频率地抓取?不登录,很多内容应该是没权限查看的呀。你用的是哪个UA?我试一下

@XueSeason
Copy link
Author

别的网站我不敢确定,但是微博我是试验过的。用账号模拟登录爬取,速度一快就被封 IP,用百度 UA,发现加快频率后依旧可以爬到数据,而且避免了封号的风险。

UA 参考

'User-Agent': 'Baiduspider+(+http://www.baidu.com/search/spider.htm)'

@XueSeason
Copy link
Author

不过如你所说,不登录很多信息的确无法查看。如果只是简单地爬取粉丝数和最新的微博这种需求的话,还是可以考虑的。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants