Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JobFunnel: Failed to scrape jobs for IndeedScraperUSAEng #137

Closed
evb-gh opened this issue Mar 17, 2021 · 18 comments · Fixed by #139
Closed

JobFunnel: Failed to scrape jobs for IndeedScraperUSAEng #137

evb-gh opened this issue Mar 17, 2021 · 18 comments · Fixed by #139
Labels

Comments

@evb-gh
Copy link

evb-gh commented Mar 17, 2021

Description

Running funnel with load -s settings_USA.yml gives the following error:

[2021-03-16 18:34:58,123] [INFO] JobFunnel: Scraping local providers with: ['IndeedScraperUSAEng']
[2021-03-16 18:34:59,720] [ERROR] JobFunnel: Failed to scrape jobs for IndeedScraperUSAEng
[2021-03-16 18:34:59,720] [INFO] JobFunnel: Completed all scraping, found 0 new jobs.
[2021-03-16 18:34:59,882] [INFO] JobFunnel: Done. View your current jobs in demo_job_search_results/demo_search.csv

Environment

  • Build: 3.0.1
  • macOS 10.14

Would like to debug further but not sure how to do it.

@evb-gh evb-gh added the bug label Mar 17, 2021
@corielljacob
Copy link

Also receiving this. Monster working fine but Indeed fails every time, even with different search keywords. Using DEBUG logging, I was able to get the URL it was trying to hit and it seemed fine.

Environment:

  • Build 3.0.1
  • Ubuntu 18.04.4 LTS
  • Python 3.9.2

@PaulMcInnis
Copy link
Owner

PaulMcInnis commented Mar 24, 2021

Thanks for opening an issue, I think we have some long outstanding issues with Parsing of the search URL for certain queries, if you are open to sharing your search URLs from logs it would be very helpful to identify what the issue is.

We current have CI for the US Indeed scraper but it only performs a basic search.

Additionally, can you confirm that you are able to obtain results (non advertisement results) for the search you are performing on the Indeed website?

@corielljacob
Copy link

corielljacob commented Mar 24, 2021

Sure. My jobfunnel has also been failing the Monster scrape the past few days (using crontab to run once daily). I would also try to debug if I could but I'm not very familiar with running python projects and I couldnt figure out how to run from PyCharm with the source 😅
URL: https://www.indeed.com/jobs?q=Software Engineer&l=tulsa%2C+OK&radius=25&limit=50&filter=0
JobFunnel

I also used the URL: https://www.indeed.com/jobs?q=Software&l=tulsa%2C+OK&radius=25&limit=50&filter=0
Just to see if maybe the space was throwing things off. That URL also failed.

@PaulMcInnis
Copy link
Owner

PaulMcInnis commented Mar 24, 2021

Ok, yeah looks like we need to improve the url parsing! Can you try instead searching for two separate keywords, like this:

- Software
- Engineer

@PaulMcInnis
Copy link
Owner

Oh i see that you tried with a single keyword as well, ok. I think this might be some other issue.

One thing to try is to use current master of this repo. You can do that by installing it in place with, pip install -e <path to this repo>

@corielljacob
Copy link

corielljacob commented Mar 24, 2021

Went ahead and added the keywords separately like you mentioned anyway as well as installing the current master. However, it looks like still no change (was potentially already using current master)
image

@PaulMcInnis
Copy link
Owner

Ah ok, thanks for being so responsive, we’ll have to take a deeper look.

If you are feeling confident I invite you to break execution in the scraper where we collect the number of pages of results from the search url, I suspect the issue is there since it ends up scraping no jobs.

@corielljacob
Copy link

I would be interested in doing some debugging, but I may need some advice with how I can do so from something like PyCharm (open to another IDE you recommend). This is a tad out of scope for the issue so pardon my intrusion.
I am trying to run JobFunnel-master\jobfunnel_main_.py\ but doing so gets me an import error
image

Like I mentioned, I'm not super familiar with running python, especially in a project like this so this may be completely the wrong place to try and start running 😅 but if you can point me in the right direction for how I might get to a point where I can set breakpoints and such, I'd be happy to play around.

@PaulMcInnis
Copy link
Owner

Unfortunately PyCharm doesn't work for this project due to use of abstract base classes.

The best way to debug is to add a import pdb; pdb.set_trace() in the code where you would like to debug

then you have access to a complete python interpreter, i.e. pp var_im_interested_in

@marchbnr
Copy link
Contributor

You should be able to debug modules, such as jobfunnel, in pycharm like this:
https://stackoverflow.com/a/51268846

marchbnr added a commit to marchbnr/JobFunnel that referenced this issue Mar 25, 2021
Resolves an issue where indeed responses are not being decoded correctly.
Might resolve issue PaulMcInnis#137
marchbnr added a commit to marchbnr/JobFunnel that referenced this issue Mar 29, 2021
Resolves an issue where indeed responses are not being decoded correctly.
Might resolve issue PaulMcInnis#137
@evb-gh
Copy link
Author

evb-gh commented Mar 29, 2021

If anyone reading this that has the time and knowledge can I ask you to write a step by step example of how to debug this code?
I would like to understand how to debug this repo by running it from a local directory with either pyCharm, cli or emacs.

@PaulMcInnis
Copy link
Owner

PaulMcInnis commented Mar 29, 2021

RE pycharm, users have had issues using it with this repository in the past due to the ABC implementation: #90 (comment)

I highly recommend just adding the line import pdb; pdb.set_trace() anywhere in the base scraper or indeed scraper and playing around with the available methods and variables (pp vars(self))

NOTE: to use pdb with multiprocessing.pool you will additionally want to set the number of workers to 1.

@evb-gh
Copy link
Author

evb-gh commented Mar 29, 2021

Thanks for the quick reply. I apologize if my questions seem lazy (I have very little experience with python) but how do I run the code with test parameters (location, keywords) from local cloned repository?

@PaulMcInnis
Copy link
Owner

Thanks for the quick reply. I apologize if my questions seem lazy (I have very little experience with python) but how do I run the code with test parameters (location, keywords) from local cloned repository?

totally fine, happy to help!

You should be able to run with test parameters by doing this:

wget https://git.io/JUWeP -O my_settings.yaml
funnel load -s my_settings.yaml

@evb-gh
Copy link
Author

evb-gh commented Mar 29, 2021

Running funnel load -s my_settings.yaml doesn't it run the code from /usr/local/bin/funnel which then executes code form /usr/local/lib/python3.9/site-packages/jobfunnel?

What I'm trying to do is:

  1. Clone the repo locally to ~/jobfunnel
  2. Add import pdb; pdb.set_trace() to indeed.py or base.py
  3. Run the code from ~/jobfunnel with my_settings.yml
  4. Debug

@PaulMcInnis
Copy link
Owner

Right i recommend doing this to have a test version of jobfunnel:

  1. git clone this repo somewhere
  2. checkout the branch you want to test
  3. virtualenv venv
  4. source venv/bin/activate
  5. pip3 install -e ./jobfunnel

When done you can exit virtualenv with deactivate

@PaulMcInnis
Copy link
Owner

PaulMcInnis commented Mar 29, 2021

Ok so i think the best place to start is indeed.py line 303 in the current master, query_resp.find returns None and I believe this is due to the encoding of the the request_html being incorrect somehow. I'm taking a look as well since I want this to work for everyone :P

<bound m�D������]���nd of <html><body><p>�J ��_�~�ް��уƽ����� O�
���#T��v�r�M����i�7����ϼ���r��v�'�C�F�!�c�W��
i���K��+^6�n�����hy\)���΋���Y���b!	j��Z��VH���k����L_���wР�BXk@��9B�N����$|�&gt;L����'�K�w�p�D��%6�c�*�	��,�l���X&amp;l�h@0���%�� �E�r�D\��xP��nȸc�[��C8�qH��_l����V1��-{.��<tl4z>�Jj6���
!K�!�^��B��2�R�����6�u'hǐ��gB��8�����2"���]��|�^�X�%���`�qx7R����\M�j�tR\]N��.bj�Y���n�6Åp�qr �`����7��v���ҪBnr��,�������zٳ���k!��

@PaulMcInnis PaulMcInnis linked a pull request Mar 29, 2021 that will close this issue
15 tasks
PaulMcInnis pushed a commit that referenced this issue Mar 29, 2021
Resolves an issue where indeed responses are not being decoded correctly.
Might resolve issue #137
@PaulMcInnis
Copy link
Owner

didn't mean to close this abruptly but I think the encoding was causing this. Please pull the latest changes and try, but this has resolved the issue on my end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants