remove webscraping for URL #488

Johann150 · 2024-02-25T00:08:26Z

The URL format has been changed and no longer includes the random string. This means the web scraping with BeauitfulSoup is no longer necessary.

Workflow checklist

PR-Assignee

🐙 Follow the workflow in CONTRIBUTING.md
📝 Update the CHANGELOG.md
~~📙 Update the documentation~~ no API change, not applicable

Reviewer

🐙 Follow the Reviewer Guidelines
🐙 Provided feedback and show sufficient appreciation for the work done

The URL format has been changed and no longer includes the random string. This means the web scraping with BeauitfulSoup is no longer necessary.

FlorianK13 · 2024-02-29T21:34:00Z

Hi @Johann150, I'll have a look at your PR next week.

FlorianK13 · 2024-03-06T09:22:53Z

open_mastr/xml_download/utils_download_bulk.py

+
+    return f'{year}.{release}'
+
+def gen_url(when: time.struct_time = time.localtime()) -> str:


First of all thanks for this very useful PR @Johann150
One remark here: I think it would be great to have a fallback url for the case that the current data is not online yet. This could happen, as you described, before 04:00.
Maybe we do the following: When the url is used to download the "Gesamtdatenexport" this is wrapped in a try - except block. If it fails, the url is changed to the url from one day before and the download is started again?

What do you think? And do you have time to make this change? Otherwise I can also do it.

I have made the change. It was previously not checked at all what the download request status code was, so instead of writing a giant try block I thought it would be a better idea to "just" check the status code instead.

There is one potential situation that I'm not quite sure about, when thinking about this a bit more. Since the download can take a few minutes, if by coincidence you were to start the download right before the new file is published, I don't know if the rest of the old file will be correctly downloaded. But maybe that is a very hypothetical and contrived situation that does not really need to be considered.

Yes, I think that won't happen too often and people can than rerun the download.

When the generated URL is not found, the download will be retried with the URL for the previous day. This could be applicable if the download is attempted before the new file is published. If the re-tried download also fails, the process is aborted completely.

FlorianK13 · 2024-03-12T08:18:17Z

Tests never run successfully for PRs from outside as secret API credentials are not revealed

Johann150 added 2 commits February 25, 2024 01:10

remove webscraping for URL

38da6e2

The URL format has been changed and no longer includes the random string. This means the web scraping with BeauitfulSoup is no longer necessary.

update changelog

e1bfe42

Correct PR number in Changelog

123f510

FlorianK13 requested changes Mar 6, 2024

View reviewed changes

FlorianK13 approved these changes Mar 12, 2024

View reviewed changes

FlorianK13 merged commit 13f099c into OpenEnergyPlatform:develop Mar 12, 2024
0 of 6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

remove webscraping for URL #488

remove webscraping for URL #488

Johann150 commented Feb 25, 2024 •

edited by FlorianK13

Loading

FlorianK13 commented Feb 29, 2024

FlorianK13 Mar 6, 2024

Johann150 Mar 9, 2024

FlorianK13 Mar 12, 2024

FlorianK13 commented Mar 12, 2024


		return f'{year}.{release}'

		def gen_url(when: time.struct_time = time.localtime()) -> str:

remove webscraping for URL #488

remove webscraping for URL #488

Conversation

Johann150 commented Feb 25, 2024 • edited by FlorianK13 Loading

Workflow checklist

PR-Assignee

Reviewer

FlorianK13 commented Feb 29, 2024

FlorianK13 Mar 6, 2024

Choose a reason for hiding this comment

Johann150 Mar 9, 2024

Choose a reason for hiding this comment

FlorianK13 Mar 12, 2024

Choose a reason for hiding this comment

FlorianK13 commented Mar 12, 2024

Johann150 commented Feb 25, 2024 •

edited by FlorianK13

Loading