[DISCUSSION] Duplicate Ids Across Different Scrapers and Return Code for Scraping #133

thebigG · 2021-02-02T02:14:36Z

Hello Everyone,

Hope you are all doing well.

Just wanted to start a discussion about some things as I much rather discuss them first before making a PR.

I want to discuss mainly two issues.

The following code snippet:

            for exist_key_id in existing_job_keys:
                if inc_key_id == exist_key_id:
                    raise ValueError(
                        f"Inter-scraper key-id duplicate! {exist_key_id}"
                    )

This is on jobfunnel.py:218. This is currently what is holding #132 from being merged in. And I wonder if this is even the proper way of handling inter-scraper key conflicts? I could be missing something, but I think this function should not be throwing an exception. I say this because I'd think it is perfectly possible that two different sites may have the same job id and still have different jobs, right? This is what my intuition tells me, I don't have anything concrete to support this. And even if it so happened that two ids that are the same from two different sites means the same job, then I still don't think this function should throw an exception. It could just return a bool and log a warning. And if it must return an exception, it should at the very least be caught, which it is not at the moment. I hope these points make sense.

The point of discussion has to do with the following code snippet:

    # Return value for Travis CI
    if (len(job_funnel.master_jobs_dict.keys()) > 1
            and os.path.exists(funnel_cfg.master_csv_file)):
        return 0
    else:
        return 1

This looks intentional and the purpose is clear: If there are no jobs in the dictionary, then return a non-zero code which means failure in most systems. The issue I see with this is that at least to me it seems that just because there are no jobs, it doesn't that there is a bug. This could very much be because of forces outside of JobFunnel; a CAPTCHA, the sites are down at the moment in time; the network is down at that moment in time, etc. I totally understand the intent of this, but I'm not sure if it is the best way of handling it because our CI could "fail" because of a CAPTCHA or because JobFunnel was't able to find any remote jobs, which is currently the case with our last GitHub Actions build.

Just wanted to start some discussion around this before I attempt to make any changes.

Cheers!

The text was updated successfully, but these errors were encountered:

PaulMcInnis · 2021-02-02T02:23:52Z

Thanks for bringing this up!

I believe this is because we key our dicts by the job id, this is raised because it prevents us from overwriting an existing job.
this is done because we have chosen job searches which should always yield a number of results - an assumption of course, and I agree that this is too opaque.

For 1. We may want to consider prefixing job ids with the provider, then we can do away with this check entirely.

For 2. I think we just need better tests which potentially ensure that we are able to always get a result, and which test independently the failure cases

thebigG · 2021-02-02T02:32:46Z

I think your solution for 1 can work really well 👍.

For 2, yes, we need more testing. I've been trying to write more unit tests, but have been held back by issues. And I don't want to start writing tests for things that have pending issues, but we'll get there one day 😅.

PaulMcInnis · 2021-02-02T02:41:36Z

Ahah yea there is a lot of testing ground to cover, in an ideal world we have 100% coverage but this is open source after all 🤓.

To me priority ought to be getting our large city issue resolved and all else can take a back seat.

…McInnis#123. Partially addresses PaulMcInnis#133.

PaulMcInnis · 2021-03-24T12:52:29Z

I think the majority of this issue should be resolved by changes within #136, however we may have a limitation of relying on the provider URIs here.

We may want to add duplicate jobs to the CSV with a special annotation since it will still be possible if the Provider does not supply us independant URIs.

thebigG assigned PaulMcInnis and markkvdb Feb 2, 2021

thebigG added a commit to thebigG/JobFunnel that referenced this issue Feb 16, 2021

-Provider names are used as a prefix for job ids now. Should fix Paul…

56a44ec

…McInnis#123. Partially addresses PaulMcInnis#133.

thebigG mentioned this issue Feb 16, 2021

-Provider names are used as a prefix for job ids now. #134

Merged

16 tasks

thebigG mentioned this issue Mar 20, 2021

Add a scraper for german indeed #136

Merged

15 tasks

Repository owner locked and limited conversation to collaborators Sep 21, 2021

PaulMcInnis closed this as completed Sep 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

[DISCUSSION] Duplicate Ids Across Different Scrapers and Return Code for Scraping #133

[DISCUSSION] Duplicate Ids Across Different Scrapers and Return Code for Scraping #133

thebigG commented Feb 2, 2021 •

edited

Loading

PaulMcInnis commented Feb 2, 2021

thebigG commented Feb 2, 2021 •

edited

Loading

PaulMcInnis commented Feb 2, 2021

PaulMcInnis commented Mar 24, 2021

This issue was moved to a discussion.

This issue was moved to a discussion.

[DISCUSSION] Duplicate Ids Across Different Scrapers and Return Code for Scraping #133

[DISCUSSION] Duplicate Ids Across Different Scrapers and Return Code for Scraping #133

Comments

thebigG commented Feb 2, 2021 • edited Loading

PaulMcInnis commented Feb 2, 2021

thebigG commented Feb 2, 2021 • edited Loading

PaulMcInnis commented Feb 2, 2021

PaulMcInnis commented Mar 24, 2021

This issue was moved to a discussion.

thebigG commented Feb 2, 2021 •

edited

Loading

thebigG commented Feb 2, 2021 •

edited

Loading