Issues scraping large cities #123

oosokoya · 2020-11-16T18:58:15Z

Description

Over the last few weeks i've had trouble using job funnel to scrape jobs in large cities (e.g. New York, Atlanta, etc.). Smaller cities such as Oklahoma City seem to be ok (under 5 pages with under 300 jobs to scrape). When scraping larger cities there are often over 27 pages and 1300+ jobs to be scraped which seems to cause an issue and after the job is complete error messages are displayed ( shown in the actual behavior section) and no excel file is created.

Note that I installed the job funnel onto a new machine and encounter the exact same problem.

Steps to Reproduce

Example search

locale: USA_ENGLISH
State: NY
City : New York
Radius: 30 miles
Key Words: Project Manager

All other settings are default

Expected behavior

Indeed and Monster sites scraped and excel file with results created

Actual behavior

Scraping process is completed however the following error message is generated (No excel file is created)

File "C:\Users\Bukky\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\jobfunnel\backend\scrapers\base.py", line 196, in scrape
job_soups = self.get_job_soups_from_search_result_listings()
File "C:\Users\Bukky\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\jobfunnel\backend\scrapers\monster.py", line 224, in get_job_soups_from_search_result_listings
__get_job_soups_by_key_id(next_listings_page_soup)
File "C:\Users\Bukky\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\jobfunnel\backend\scrapers\monster.py", line 206, in __get_job_soups_by_key_id
return {
File "C:\Users\Bukky\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\jobfunnel\backend\scrapers\monster.py", line 207, in
self.get(JobField.KEY_ID, job_soup): job_soup
File "C:\Users\Bukky\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\jobfunnel\backend\scrapers\monster.py", line 109, in get
return soup.find('h2', attrs={'class': 'title'}).find('a').get(
AttributeError: 'NoneType' object has no attribute 'find'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\Users\Bukky\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\Scripts\funnel-script.py", line 33, in
sys.exit(load_entry_point('JobFunnel==3.0.1', 'console_scripts', 'funnel')())
File "C:\Users\Bukky\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\jobfunnel_main_.py", line 28, in main
job_funnel.run()
File "C:\Users\Bukky\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\jobfunnel\backend\jobfunnel.py", line 114, in run
scraped_jobs_dict = self.scrape()
File "C:\Users\Bukky\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\jobfunnel\backend\jobfunnel.py", line 236, in scrape
incoming_jobs_dict = scraper.scrape()
File "C:\Users\Bukky\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\jobfunnel\backend\scrapers\base.py", line 198, in scrape
raise ValueError(
ValueError: Unable to extract jobs from initial search result page:
'NoneType' object has no attribute 'find'

Possible issues

I noticed on one of the scrapes when I placed the link into a browser a captcha page came up asking me to verify that I wasn't a robot. Could it be that larger scrapes trigger the captcha causing the scrape to error and the whole process to fail ??

Let me know if you require more information.

Environment

Windows 10 machine

thebigG · 2020-11-17T01:42:33Z

Interesting. Will try to reproduce.

Yes, we have had issues with CAPTCHA in the past. We have even managed to have workarounds for it. But it's really tricky because basically we have to use Selenium, which literally opens up a browser window and then the user would have the ability to solve a CAPTCHA if one comes up. The problem with this approach is that it is very slow when compared to static scraping(which is what is done currently) and is not as a smooth experience for users compared to what we have now.

Like I said, will try to reproduce and will give you more feedback on your issue.

thebigG · 2020-11-17T01:43:46Z

Quick question: which site gave you the CAPTCHA? Or was it both of them?

oosokoya · 2020-11-17T02:23:38Z

I noticed the captcha on the Indeed site. I'll do some troubleshooting and see if anything occurs on the monster site.

thebigG · 2020-11-17T02:36:05Z

Thanks for the quick response! That's a new one. I'm currently running an instance of JobFunnel with your keywords/args and it has not crashed so far. Will keep you posted.

thebigG · 2020-11-17T04:24:14Z

I was able to reproduce! I highly suspect you got the same error as me. Do you mind checking the log generated by jobfunnel? It should be under a folder with a name similar to ...search_results. The log file should be called log.log. If you can find it, check if there is an error similar that says something akin to share duplicate key_id:in there.

This looks like an issue with the following snippet of code in base.py:

                if job:
                    # Handle inter-scraped data duplicates by key.
                    # TODO: move this functionality into duplicates filter
                    if job.key_id in jobs_dict:
                        self.logger.error(
                            "Job %s and %s share duplicate key_id: %s",
                            job.title, jobs_dict[job.key_id].title, job.key_id
                        )
                    else:
                        jobs_dict[job.key_id] = job

Don't have any more time tonight to investigate this further because it's getting kind of late 😅, but if I had to guess it looks like for some reason there is a key conflict in the job dictionary.

Will investigate further when I get more time tomorrow evening.

Thanks so much for bringing this to our attention!

Cheers!

oosokoya · 2020-11-17T14:51:04Z

I have just checked the logs and can see the same thing "share duplicate key_id:"

thebigG · 2020-11-21T03:35:48Z

Haven't had time to look at this issue in-depth. I've had my hands full with jobfunnel testing at the moment. For now, as a quick fix, you can comment out one of your providers in your settings file like so:
# - INDEED

and scrape using one scraper at the time.

Hope this helps!

…McInnis/JobFunnel#123. Partially addresses PaulMcInnis/JobFunnel#133.

oosokoya added the bug label Nov 16, 2020

thebigG mentioned this issue Jan 20, 2021

Exception is caught when calling scrape now; this should fix our CI Problems #130

Merged

16 tasks

thebigG mentioned this issue Jan 28, 2021

Germany_German Support #132

Closed

thebigG mentioned this issue Feb 16, 2021

-Provider names are used as a prefix for job ids now. #134

Merged

16 tasks

thebigG closed this as completed in 56a44ec Feb 16, 2021

EmersonCosta0915 pushed a commit to EmersonCosta0915/JobFunnel that referenced this issue Aug 6, 2024

-Provider names are used as a prefix for job ids now. Should fix Paul…

203c036

…McInnis/JobFunnel#123. Partially addresses PaulMcInnis/JobFunnel#133.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues scraping large cities #123

Issues scraping large cities #123

oosokoya commented Nov 16, 2020

thebigG commented Nov 17, 2020 •

edited

Loading

thebigG commented Nov 17, 2020

oosokoya commented Nov 17, 2020

thebigG commented Nov 17, 2020 •

edited

Loading

thebigG commented Nov 17, 2020

oosokoya commented Nov 17, 2020

thebigG commented Nov 21, 2020

Issues scraping large cities #123

Issues scraping large cities #123

Comments

oosokoya commented Nov 16, 2020

Steps to Reproduce

Expected behavior

Actual behavior

Environment

thebigG commented Nov 17, 2020 • edited Loading

thebigG commented Nov 17, 2020

oosokoya commented Nov 17, 2020

thebigG commented Nov 17, 2020 • edited Loading

thebigG commented Nov 17, 2020

oosokoya commented Nov 17, 2020

thebigG commented Nov 21, 2020

thebigG commented Nov 17, 2020 •

edited

Loading

thebigG commented Nov 17, 2020 •

edited

Loading