Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove webscraping for URL #488

Merged
merged 4 commits into from
Mar 12, 2024
Merged

remove webscraping for URL #488

merged 4 commits into from
Mar 12, 2024

Conversation

Johann150
Copy link
Contributor

@Johann150 Johann150 commented Feb 25, 2024

The URL format has been changed and no longer includes the random string. This means the web scraping with BeauitfulSoup is no longer necessary.

Workflow checklist

PR-Assignee

Reviewer

  • 🐙 Follow the Reviewer Guidelines
  • 🐙 Provided feedback and show sufficient appreciation for the work done

The URL format has been changed and no longer includes the random string.
This means the web scraping with BeauitfulSoup is no longer necessary.
@FlorianK13
Copy link
Member

Hi @Johann150, I'll have a look at your PR next week.


return f'{year}.{release}'

def gen_url(when: time.struct_time = time.localtime()) -> str:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all thanks for this very useful PR @Johann150
One remark here: I think it would be great to have a fallback url for the case that the current data is not online yet. This could happen, as you described, before 04:00.
Maybe we do the following: When the url is used to download the "Gesamtdatenexport" this is wrapped in a try - except block. If it fails, the url is changed to the url from one day before and the download is started again?

What do you think? And do you have time to make this change? Otherwise I can also do it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made the change. It was previously not checked at all what the download request status code was, so instead of writing a giant try block I thought it would be a better idea to "just" check the status code instead.

There is one potential situation that I'm not quite sure about, when thinking about this a bit more. Since the download can take a few minutes, if by coincidence you were to start the download right before the new file is published, I don't know if the rest of the old file will be correctly downloaded. But maybe that is a very hypothetical and contrived situation that does not really need to be considered.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think that won't happen too often and people can than rerun the download.

When the generated URL is not found, the download will be retried
with the URL for the previous day. This could be applicable if the
download is attempted before the new file is published.

If the re-tried download also fails, the process is aborted completely.
@FlorianK13
Copy link
Member

Tests never run successfully for PRs from outside as secret API credentials are not revealed

@FlorianK13 FlorianK13 merged commit 13f099c into OpenEnergyPlatform:develop Mar 12, 2024
0 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants