Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DISCUSSION] Captcha #142

Closed
PaulMcInnis opened this issue Mar 30, 2021 · 8 comments
Closed

[DISCUSSION] Captcha #142

PaulMcInnis opened this issue Mar 30, 2021 · 8 comments

Comments

@PaulMcInnis
Copy link
Owner

Hey everyone,

It seems that indeed and others have caught on to scraping and have taken action to stop it.

We can integrate web-driven scraping but this is not easily automated or tested.

I think this may be a serious problem for this tool in general, the regexes we have built still work, but captcha is catching the scrapers very easily, after under a hundred jobs or so.

Does anyone have any ideas to help with this issue?

@PaulMcInnis
Copy link
Owner Author

One option is that we go the route of a web-driven scraper, perhaps this tool could be made into some kind of browser extension?

@PaulMcInnis
Copy link
Owner Author

Another option is to forgo scraping detailed job information entirely, but this will significantly degrade the matching and data quality.

@Nllii
Copy link

Nllii commented Mar 30, 2021

I tried using this code from geohot a couple of years back, I never got it to work. its's not practical code, just a doodle.

https://github.com/geohot/lolrecaptcha

https://www.blackhat.com/docs/asia-16/materials/asia-16-Sivakorn-Im-Not-a-Human-Breaking-the-Google-reCAPTCHA-wp.pdf

@PaulMcInnis
Copy link
Owner Author

Well, one aspect of this is that I dont want to automate the captcha dodging since I think that is ethically dubious, but I think we have other options for the workflow maybe.

One datapoint that im having a bit of trouble collecting is on average how many jobs one can scrape before they get captcha'd (None error on detail scrape).

@aseams
Copy link

aseams commented May 13, 2021

Maybe it could pick from a list of proxies? Would probably get rid of the captcha all together.
Edit: Also I'd like to add that at 200 jobs exactly, I got the captcha treatment.

@PaulMcInnis
Copy link
Owner Author

yeah I get dinged pretty quick nowadays, I figure i'm on their $hit list 😆

Not a bad idea around the proxies, that would be an interesting feature, I'll create a little feature-stub for this.

@Nllii
Copy link

Nllii commented May 19, 2021

yeah I get dinged pretty quick nowadays, I figure i'm on their $hit list 😆

Not a bad idea around the proxies, that would be an interesting feature, I'll create a little feature-stub for this.

for proxies I have used https://github.com/TheSpeedX/PROXY-List ,mainly for mega.io limiting upload and downloads.
https://github.com/tonikelope/megabasterd.git , MBD has a feature where it picks the next proxy once it gets throttle; it triggers the next proxy in the list. I haven't used proxies on JobFunnel yet. Can't wait to try it out if I get block.

P.s.. Youtube still gives me captcha once a week now. It was every 4 hours since December 2020, now it's once a week. I think they are outsourcing machine learning labels to me.

@PaulMcInnis
Copy link
Owner Author

Based on this discussion, we will move forwards with #145

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants