Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Released version 2.1.8 failed on March-28-2021 #140

Closed
Nllii opened this issue Mar 29, 2021 · 7 comments
Closed

Released version 2.1.8 failed on March-28-2021 #140

Nllii opened this issue Mar 29, 2021 · 7 comments
Labels

Comments

@Nllii
Copy link

Nllii commented Mar 29, 2021

Description

pip3 install git+https://github.com/PaulMcInnis/[email protected]
Being using this a couple of years now. For some reason, this failed.
What I have done so far:

  1. Deleted all the data files in search(master_list.csv, jobfunnel.log,jobs_2021-03-22.pkl,jobs_2021-03-28.pkl,filter_list.json)
  2. Disabled adblocker.

Error

admin@Admins-MacBook-Pro ~ % bash job.sh                                                        
finding you jobs
jobfunnel initialized at 2021-03-28
no master-list, filter-list was not updated
jobfunnel indeed to pickle running @ 2021-03-28
failed to scrape Indeed: 'NoneType' object has no attribute 'contents'
jobfunnel monster to pickle running @ 2021-03-28
failed to scrape Monster: 'NoneType' object has no attribute 'text'
jobfunnel glassdoor to pickle running @ 2021-03-28
failed to scrape GlassDoor: 'NoneType' object has no attribute 'text'
Traceback (most recent call last):
  File "/usr/local/bin/funnel", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/site-packages/jobfunnel/__main__.py", line 48, in main
    jp.update_masterlist()
  File "/usr/local/lib/python3.8/site-packages/jobfunnel/jobfunnel.py", line 358, in update_masterlist
    raise ValueError("No scraped jobs, cannot update masterlist")
ValueError: No scraped jobs, cannot update masterlist
DONE
admin@Admins-MacBook-Pro ~ % 

Apart from Google and Youtube insisting on captcha every 3 hours for my IP, this has become unusable. The traffic coming from my machine is this code running.

@Nllii Nllii added the bug label Mar 29, 2021
@Nllii Nllii changed the title Released version 2.1.8 failed on March-28-2021, But works on kaggle WHY? Released version 2.1.8 failed on March-28-2021 Mar 29, 2021
@PaulMcInnis
Copy link
Owner

PaulMcInnis commented Mar 29, 2021

We are working on a release with a number of changes which may help.

Are you able to test the current master of this repository?
This is best done by installing in-place, you should backup any masterlist and filterlists.

@PaulMcInnis
Copy link
Owner

Seems we are having issues with scraping due to a regex.

@PaulMcInnis
Copy link
Owner

OK just merged a PR that may fix this, but you should try using current master

@Nllii
Copy link
Author

Nllii commented Mar 30, 2021

Are you able to test the current master of repository?
Yes, I checkout the master repo last year, I had to revert back to 2.1.8. 2.1.8 was faster and straight forward nothing fancy.

I don't know if this helps, but, if the end-user already has a copy of 2.1.8 on kaggle and re-runs it again this is the outcome.
https://www.kaggle.com/bellphegor/job-search

  1. It will filter the jobs --max_listing_days 2 and find jobs on indeed to add to the csv file after filtering
  2. Then it will fail when the end-user runs it again.
  3. Why does it fail the second time when run. I will try to get the current masterlist and filterlists from kaggle to duplicate the outcome.

shell

jobfunnel indeed to pickle running @ 2021-03-29
Found 4 indeed results for query=phlebotomist
getting indeed page 0 : http://www.indeed.com/jobs?q=phlebotomist&l=HOUSTON%2C+TX&radius=25&limit=50&filter=0&start=0
getting indeed page 1 : http://www.indeed.com/jobs?q=phlebotomist&l=HOUSTON%2C+TX&radius=25&limit=50&filter=0&start=50
getting indeed page 2 : http://www.indeed.com/jobs?q=phlebotomist&l=HOUSTON%2C+TX&radius=25&limit=50&filter=0&start=100
getting indeed page 3 : http://www.indeed.com/jobs?q=phlebotomist&l=HOUSTON%2C+TX&radius=25&limit=50&filter=0&start=150
date_filter running

delay of 10.00s, getting indeed search: http://www.indeed.com/viewjob?jk=ac44060dadbe32b3
delay of 10.00s, getting indeed search: http://www.indeed.com/viewjob?jk=a028d791865bb433
delay of 10.00s, getting indeed search: http://www.indeed.com/viewjob?jk=db0271a737679ea2
indeed scrape job took 206.649s
jobfunnel monster to pickle running @ 2021-03-29
failed to scrape Monster: 'NoneType' object has no attribute 'text'
no jobs filtered, missing search/data/filter_list.json
removed 0 jobs in blacklist from master-list
Found and removed 6 re-posts/duplicates via TFIDF cosine similarity!
no masterlist detected, added 5 jobs to search/master_list.csv
done. see un-archived jobs in search/master_list.csv

@Nllii
Copy link
Author

Nllii commented Mar 30, 2021

OK just merged a PR that may fix this, but you should try using current master

Awesome thanks, I will update the module.

@PaulMcInnis
Copy link
Owner

I just cut a release as well, so you should be able to simply try out 3.0.2.

I hear you on the complexity increase as well, flexibility definitely has a cost.

Given that we don't really have any upgrade or versioning plan currently, I would maintain a backup of all my search results as much as possible.

The older code has flakey match and update code which the newer versions with TFIDF & id matching can help with.

@Nllii
Copy link
Author

Nllii commented Mar 30, 2021

Awesome thanks, released version 3.0.2 works.
shell

bash job.sh       
finding you jobs
[2021-03-29 19:31:31,209] [INFO] JobFunnel: Scraping local providers with: ['IndeedScraperUSAEng']
[2021-03-29 19:31:40,281] [INFO] IndeedScraperUSAEng: Found 4 pages of search results for query=Phlebotomist
[2021-03-29 19:31:48,047] [INFO] IndeedScraperUSAEng: Scraped 188 job listings from search results pages
100%|#######################################################| 188/188 [02:18<00:00,  1.35it/s]
[2021-03-29 19:34:07,240] [INFO] JobFunnel: Completed all scraping, found 188 new jobs.
[2021-03-29 19:34:07,394] [INFO] JobFunnel: Done. View your current jobs in demo_job_search_results/demo_search.csv
DONE

@Nllii Nllii closed this as completed Mar 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants