[DISCUSSION] Captcha #142

PaulMcInnis · 2021-03-30T00:21:02Z

Hey everyone,

It seems that indeed and others have caught on to scraping and have taken action to stop it.

We can integrate web-driven scraping but this is not easily automated or tested.

I think this may be a serious problem for this tool in general, the regexes we have built still work, but captcha is catching the scrapers very easily, after under a hundred jobs or so.

Does anyone have any ideas to help with this issue?

PaulMcInnis · 2021-03-30T00:21:38Z

One option is that we go the route of a web-driven scraper, perhaps this tool could be made into some kind of browser extension?

PaulMcInnis · 2021-03-30T00:29:24Z

Another option is to forgo scraping detailed job information entirely, but this will significantly degrade the matching and data quality.

Nllii · 2021-03-30T01:02:13Z

I tried using this code from geohot a couple of years back, I never got it to work. its's not practical code, just a doodle.

https://github.com/geohot/lolrecaptcha

https://www.blackhat.com/docs/asia-16/materials/asia-16-Sivakorn-Im-Not-a-Human-Breaking-the-Google-reCAPTCHA-wp.pdf

PaulMcInnis · 2021-03-30T13:52:47Z

Well, one aspect of this is that I dont want to automate the captcha dodging since I think that is ethically dubious, but I think we have other options for the workflow maybe.

One datapoint that im having a bit of trouble collecting is on average how many jobs one can scrape before they get captcha'd (None error on detail scrape).

aseams · 2021-05-13T01:47:18Z

Maybe it could pick from a list of proxies? Would probably get rid of the captcha all together.
Edit: Also I'd like to add that at 200 jobs exactly, I got the captcha treatment.

PaulMcInnis · 2021-05-19T17:47:32Z

yeah I get dinged pretty quick nowadays, I figure i'm on their $hit list 😆

Not a bad idea around the proxies, that would be an interesting feature, I'll create a little feature-stub for this.

Nllii · 2021-05-19T19:37:33Z

yeah I get dinged pretty quick nowadays, I figure i'm on their $hit list 😆

Not a bad idea around the proxies, that would be an interesting feature, I'll create a little feature-stub for this.

for proxies I have used https://github.com/TheSpeedX/PROXY-List ,mainly for mega.io limiting upload and downloads.
https://github.com/tonikelope/megabasterd.git , MBD has a feature where it picks the next proxy once it gets throttle; it triggers the next proxy in the list. I haven't used proxies on JobFunnel yet. Can't wait to try it out if I get block.

P.s.. Youtube still gives me captcha once a week now. It was every 4 hours since December 2020, now it's once a week. I think they are outsourcing machine learning labels to me.

PaulMcInnis · 2021-09-21T15:27:24Z

Based on this discussion, we will move forwards with #145

PaulMcInnis added enhancement help wanted labels Mar 30, 2021

PaulMcInnis added this to the 4.0 milestone Mar 30, 2021

thebigG mentioned this issue Jun 1, 2021

[PROPOSAL] Decouple The Web Engine #145

Closed

PaulMcInnis closed this as completed Sep 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DISCUSSION] Captcha #142

[DISCUSSION] Captcha #142

PaulMcInnis commented Mar 30, 2021

PaulMcInnis commented Mar 30, 2021

PaulMcInnis commented Mar 30, 2021

Nllii commented Mar 30, 2021

PaulMcInnis commented Mar 30, 2021

aseams commented May 13, 2021 •

edited

Loading

PaulMcInnis commented May 19, 2021

Nllii commented May 19, 2021

PaulMcInnis commented Sep 21, 2021

[DISCUSSION] Captcha #142

[DISCUSSION] Captcha #142

Comments

PaulMcInnis commented Mar 30, 2021

PaulMcInnis commented Mar 30, 2021

PaulMcInnis commented Mar 30, 2021

Nllii commented Mar 30, 2021

PaulMcInnis commented Mar 30, 2021

aseams commented May 13, 2021 • edited Loading

PaulMcInnis commented May 19, 2021

Nllii commented May 19, 2021

PaulMcInnis commented Sep 21, 2021

aseams commented May 13, 2021 •

edited

Loading