Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues scraping large cities #123

Closed
oosokoya opened this issue Nov 16, 2020 · 7 comments
Closed

Issues scraping large cities #123

oosokoya opened this issue Nov 16, 2020 · 7 comments
Labels

Comments

@oosokoya
Copy link

Description

Over the last few weeks i've had trouble using job funnel to scrape jobs in large cities (e.g. New York, Atlanta, etc.). Smaller cities such as Oklahoma City seem to be ok (under 5 pages with under 300 jobs to scrape). When scraping larger cities there are often over 27 pages and 1300+ jobs to be scraped which seems to cause an issue and after the job is complete error messages are displayed ( shown in the actual behavior section) and no excel file is created.

Note that I installed the job funnel onto a new machine and encounter the exact same problem.

Steps to Reproduce

Example search

locale: USA_ENGLISH
State: NY
City : New York
Radius: 30 miles
Key Words: Project Manager

All other settings are default

Expected behavior

Indeed and Monster sites scraped and excel file with results created

Actual behavior

Scraping process is completed however the following error message is generated (No excel file is created)

File "C:\Users\Bukky\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\jobfunnel\backend\scrapers\base.py", line 196, in scrape
job_soups = self.get_job_soups_from_search_result_listings()
File "C:\Users\Bukky\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\jobfunnel\backend\scrapers\monster.py", line 224, in get_job_soups_from_search_result_listings
__get_job_soups_by_key_id(next_listings_page_soup)
File "C:\Users\Bukky\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\jobfunnel\backend\scrapers\monster.py", line 206, in __get_job_soups_by_key_id
return {
File "C:\Users\Bukky\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\jobfunnel\backend\scrapers\monster.py", line 207, in
self.get(JobField.KEY_ID, job_soup): job_soup
File "C:\Users\Bukky\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\jobfunnel\backend\scrapers\monster.py", line 109, in get
return soup.find('h2', attrs={'class': 'title'}).find('a').get(
AttributeError: 'NoneType' object has no attribute 'find'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\Users\Bukky\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\Scripts\funnel-script.py", line 33, in
sys.exit(load_entry_point('JobFunnel==3.0.1', 'console_scripts', 'funnel')())
File "C:\Users\Bukky\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\jobfunnel_main_.py", line 28, in main
job_funnel.run()
File "C:\Users\Bukky\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\jobfunnel\backend\jobfunnel.py", line 114, in run
scraped_jobs_dict = self.scrape()
File "C:\Users\Bukky\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\jobfunnel\backend\jobfunnel.py", line 236, in scrape
incoming_jobs_dict = scraper.scrape()
File "C:\Users\Bukky\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\jobfunnel\backend\scrapers\base.py", line 198, in scrape
raise ValueError(
ValueError: Unable to extract jobs from initial search result page:
'NoneType' object has no attribute 'find'

Possible issues

I noticed on one of the scrapes when I placed the link into a browser a captcha page came up asking me to verify that I wasn't a robot. Could it be that larger scrapes trigger the captcha causing the scrape to error and the whole process to fail ??

Let me know if you require more information.

Environment

Windows 10 machine

@oosokoya oosokoya added the bug label Nov 16, 2020
@thebigG
Copy link
Collaborator

thebigG commented Nov 17, 2020

Interesting. Will try to reproduce.

Yes, we have had issues with CAPTCHA in the past. We have even managed to have workarounds for it. But it's really tricky because basically we have to use Selenium, which literally opens up a browser window and then the user would have the ability to solve a CAPTCHA if one comes up. The problem with this approach is that it is very slow when compared to static scraping(which is what is done currently) and is not as a smooth experience for users compared to what we have now.

Like I said, will try to reproduce and will give you more feedback on your issue.

@thebigG
Copy link
Collaborator

thebigG commented Nov 17, 2020

Quick question: which site gave you the CAPTCHA? Or was it both of them?

@oosokoya
Copy link
Author

I noticed the captcha on the Indeed site. I'll do some troubleshooting and see if anything occurs on the monster site.

@thebigG
Copy link
Collaborator

thebigG commented Nov 17, 2020

Thanks for the quick response! That's a new one. I'm currently running an instance of JobFunnel with your keywords/args and it has not crashed so far. Will keep you posted.

@thebigG
Copy link
Collaborator

thebigG commented Nov 17, 2020

I was able to reproduce! I highly suspect you got the same error as me. Do you mind checking the log generated by jobfunnel? It should be under a folder with a name similar to ...search_results. The log file should be called log.log. If you can find it, check if there is an error similar that says something akin to share duplicate key_id:in there.

This looks like an issue with the following snippet of code in base.py:

                if job:
                    # Handle inter-scraped data duplicates by key.
                    # TODO: move this functionality into duplicates filter
                    if job.key_id in jobs_dict:
                        self.logger.error(
                            "Job %s and %s share duplicate key_id: %s",
                            job.title, jobs_dict[job.key_id].title, job.key_id
                        )
                    else:
                        jobs_dict[job.key_id] = job

Don't have any more time tonight to investigate this further because it's getting kind of late 😅, but if I had to guess it looks like for some reason there is a key conflict in the job dictionary.

Will investigate further when I get more time tomorrow evening.

Thanks so much for bringing this to our attention!

Cheers!

@oosokoya
Copy link
Author

I have just checked the logs and can see the same thing "share duplicate key_id:"

@thebigG
Copy link
Collaborator

thebigG commented Nov 21, 2020

Haven't had time to look at this issue in-depth. I've had my hands full with jobfunnel testing at the moment. For now, as a quick fix, you can comment out one of your providers in your settings file like so:
# - INDEED

and scrape using one scraper at the time.

Hope this helps!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants