[PROPOSAL] Decouple The Web Engine #145

thebigG · 2021-06-01T17:58:15Z

Hi there

Hope you are all doing well!

Is your feature request related to a problem? Please describe.
This is related to many problems that have appeared recently(CAPTCHA), but also related to issues we have had in the past(Dynamically loaded websites such as Glassdoor). Look at issues #144 and #142.

Describe the solution you'd like
I think CAPTCHA related problems could be solved by taking the approach suggested on #142 by using https://github.com/pgaref/HTTP_Request_Randomizer. However I'm thinking that the best way to approach this would be to make the web engine(using selenium) a factory. Instead of having the web engine be part of the Job class, it could be decoupled altogether and have a function that looks something like:

def get_web_engine(headless: bool, arg1, arg2, etc):
   proxy = get_random_proxy()
   engine = init_web_engine
   ...
   return engine

This way if we get CAPTCHA in any step of scraping(whether it is while getting the description, number of job pages, etc) we can just request a new web engine from the function above that has a new proxy.

As you can see this also implies switching to Selenium, which I guess I'm proposing here as well. The reason for this is that if we switch to Selenium, we support static and dynamic sites. And it looks like the web drivers do have headless support, which is one of the main reasons why in the past we didn't use Selenium.

Describe alternatives you've considered
So far this is the only way I can think about tackling this at the moment. If anyone else has any other ideas, please don't hesitate to provide feedback!

Additional context

Hope these ideas make sense.
Cheers
Lorenzo

The text was updated successfully, but these errors were encountered:

PaulMcInnis · 2021-06-20T17:42:20Z

It may be also worth looking into what other web scraping services do, as there do exist commercial offerings which provide similar capabilities as jobfunnel.

Other stopgaps are selenium on scrape failure, or more configurability for VPNs (i.e. switch VPNs after N scrapes / scrape failure).

We can fairly easily detect the "I am human" page. In the short term I think we should provide a better error for Indeed specifically around detecting this page.

As an aside I just tested it now and got to ~66 scrapes before the CAPTCHA, oh well.

thebigG · 2021-06-20T17:49:01Z

As an aside I just tested it now and got to ~66 scrapes before the CAPTCHA, oh well.

Right. I noticed this too a couple of weeks back. And this is exactly why I thought the factory pattern for Selenium might be a good fit. If a scrape fails(and like you said we should have better mechanisms for error detection for when CAPTCHA shows up), then we just send the request via a random proxy.

thebigG added the enhancement label Jun 1, 2021

thebigG assigned PaulMcInnis and thebigG and unassigned PaulMcInnis Jun 1, 2021

PaulMcInnis mentioned this issue Sep 21, 2021

[DISCUSSION] Captcha #142

Closed

Repository owner locked and limited conversation to collaborators Sep 21, 2021

PaulMcInnis closed this as completed Sep 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

[PROPOSAL] Decouple The Web Engine #145

[PROPOSAL] Decouple The Web Engine #145

thebigG commented Jun 1, 2021

PaulMcInnis commented Jun 20, 2021 •

edited

Loading

thebigG commented Jun 20, 2021

This issue was moved to a discussion.

This issue was moved to a discussion.

[PROPOSAL] Decouple The Web Engine #145

[PROPOSAL] Decouple The Web Engine #145

Comments

thebigG commented Jun 1, 2021

PaulMcInnis commented Jun 20, 2021 • edited Loading

thebigG commented Jun 20, 2021

This issue was moved to a discussion.

PaulMcInnis commented Jun 20, 2021 •

edited

Loading