Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PROPOSAL] Decouple The Web Engine #145

Closed
thebigG opened this issue Jun 1, 2021 · 2 comments
Closed

[PROPOSAL] Decouple The Web Engine #145

thebigG opened this issue Jun 1, 2021 · 2 comments
Assignees

Comments

@thebigG
Copy link
Collaborator

thebigG commented Jun 1, 2021

Hi there

Hope you are all doing well!

Is your feature request related to a problem? Please describe.
This is related to many problems that have appeared recently(CAPTCHA), but also related to issues we have had in the past(Dynamically loaded websites such as Glassdoor). Look at issues #144 and #142.

Describe the solution you'd like
I think CAPTCHA related problems could be solved by taking the approach suggested on #142 by using https://github.com/pgaref/HTTP_Request_Randomizer. However I'm thinking that the best way to approach this would be to make the web engine(using selenium) a factory. Instead of having the web engine be part of the Job class, it could be decoupled altogether and have a function that looks something like:

def get_web_engine(headless: bool, arg1, arg2, etc):
   proxy = get_random_proxy()
   engine = init_web_engine
   ...
   return engine

This way if we get CAPTCHA in any step of scraping(whether it is while getting the description, number of job pages, etc) we can just request a new web engine from the function above that has a new proxy.

As you can see this also implies switching to Selenium, which I guess I'm proposing here as well. The reason for this is that if we switch to Selenium, we support static and dynamic sites. And it looks like the web drivers do have headless support, which is one of the main reasons why in the past we didn't use Selenium.

Describe alternatives you've considered
So far this is the only way I can think about tackling this at the moment. If anyone else has any other ideas, please don't hesitate to provide feedback!

Additional context

Hope these ideas make sense.
Cheers
Lorenzo

@PaulMcInnis
Copy link
Owner

PaulMcInnis commented Jun 20, 2021

It may be also worth looking into what other web scraping services do, as there do exist commercial offerings which provide similar capabilities as jobfunnel.

Other stopgaps are selenium on scrape failure, or more configurability for VPNs (i.e. switch VPNs after N scrapes / scrape failure).

We can fairly easily detect the "I am human" page. In the short term I think we should provide a better error for Indeed specifically around detecting this page.

As an aside I just tested it now and got to ~66 scrapes before the CAPTCHA, oh well.

@thebigG
Copy link
Collaborator Author

thebigG commented Jun 20, 2021

As an aside I just tested it now and got to ~66 scrapes before the CAPTCHA, oh well.

Right. I noticed this too a couple of weeks back. And this is exactly why I thought the factory pattern for Selenium might be a good fit. If a scrape fails(and like you said we should have better mechanisms for error detection for when CAPTCHA shows up), then we just send the request via a random proxy.

Repository owner locked and limited conversation to collaborators Sep 21, 2021

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Projects
None yet
Development

No branches or pull requests

2 participants