-
Notifications
You must be signed in to change notification settings - Fork 224
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DISCUSSION] Duplicate Ids Across Different Scrapers and Return Code for Scraping #133
Comments
Thanks for bringing this up!
For 1. We may want to consider prefixing job ids with the provider, then we can do away with this check entirely. For 2. I think we just need better tests which potentially ensure that we are able to always get a result, and which test independently the failure cases |
I think your solution for 1 can work really well 👍. For 2, yes, we need more testing. I've been trying to write more unit tests, but have been held back by issues. And I don't want to start writing tests for things that have pending issues, but we'll get there one day 😅. |
Ahah yea there is a lot of testing ground to cover, in an ideal world we have 100% coverage but this is open source after all 🤓. To me priority ought to be getting our large city issue resolved and all else can take a back seat. |
…McInnis#123. Partially addresses PaulMcInnis#133.
I think the majority of this issue should be resolved by changes within #136, however we may have a limitation of relying on the provider URIs here. We may want to add duplicate jobs to the CSV with a special annotation since it will still be possible if the Provider does not supply us independant URIs. |
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
Hello Everyone,
Hope you are all doing well.
Just wanted to start a discussion about some things as I much rather discuss them first before making a PR.
I want to discuss mainly two issues.
This is on
jobfunnel.py:218
. This is currently what is holding #132 from being merged in. And I wonder if this is even the proper way of handling inter-scraper key conflicts? I could be missing something, but I think this function should not be throwing an exception. I say this because I'd think it is perfectly possible that two different sites may have the same job id and still have different jobs, right? This is what my intuition tells me, I don't have anything concrete to support this. And even if it so happened that two ids that are the same from two different sites means the same job, then I still don't think this function should throw an exception. It could just return abool
and log a warning. And if it must return an exception, it should at the very least be caught, which it is not at the moment. I hope these points make sense.This looks intentional and the purpose is clear: If there are no jobs in the dictionary, then return a non-zero code which means failure in most systems. The issue I see with this is that at least to me it seems that just because there are no jobs, it doesn't that there is a bug. This could very much be because of forces outside of JobFunnel; a CAPTCHA, the sites are down at the moment in time; the network is down at that moment in time, etc. I totally understand the intent of this, but I'm not sure if it is the best way of handling it because our CI could "fail" because of a CAPTCHA or because JobFunnel was't able to find any remote jobs, which is currently the case with our last GitHub Actions build.
Just wanted to start some discussion around this before I attempt to make any changes.
Cheers!
The text was updated successfully, but these errors were encountered: