-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remove webscraping for URL #488
Conversation
The URL format has been changed and no longer includes the random string. This means the web scraping with BeauitfulSoup is no longer necessary.
Hi @Johann150, I'll have a look at your PR next week. |
|
||
return f'{year}.{release}' | ||
|
||
def gen_url(when: time.struct_time = time.localtime()) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First of all thanks for this very useful PR @Johann150
One remark here: I think it would be great to have a fallback url for the case that the current data is not online yet. This could happen, as you described, before 04:00.
Maybe we do the following: When the url is used to download the "Gesamtdatenexport" this is wrapped in a try
- except
block. If it fails, the url is changed to the url from one day before and the download is started again?
What do you think? And do you have time to make this change? Otherwise I can also do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have made the change. It was previously not checked at all what the download request status code was, so instead of writing a giant try
block I thought it would be a better idea to "just" check the status code instead.
There is one potential situation that I'm not quite sure about, when thinking about this a bit more. Since the download can take a few minutes, if by coincidence you were to start the download right before the new file is published, I don't know if the rest of the old file will be correctly downloaded. But maybe that is a very hypothetical and contrived situation that does not really need to be considered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think that won't happen too often and people can than rerun the download.
When the generated URL is not found, the download will be retried with the URL for the previous day. This could be applicable if the download is attempted before the new file is published. If the re-tried download also fails, the process is aborted completely.
Tests never run successfully for PRs from outside as secret API credentials are not revealed |
The URL format has been changed and no longer includes the random string. This means the web scraping with BeauitfulSoup is no longer necessary.
Workflow checklist
PR-Assignee
📙 Update the documentationno API change, not applicableReviewer