Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove webscraping for URL #488

Merged
merged 4 commits into from
Mar 12, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,8 @@ and the versioning aims to respect [Semantic Versioning](http://semver.org/spec/
## [v0.XX.X] unreleased - 2024-XX-XX
### Added
### Changed
- Fix and add URLs of example projects in readme [#481]((https://github.com/OpenEnergyPlatform/open-MaStR/pull/481)
- Fix and add URLs of example projects in readme [#481](https://github.com/OpenEnergyPlatform/open-MaStR/pull/481)
- No longer require web scraping for bulk download [#488](https://github.com/OpenEnergyPlatform/open-MaStR/pull/488)
### Removed

## [v0.14.1] Hotfix - 2024-01-17
Expand Down
83 changes: 69 additions & 14 deletions open_mastr/xml_download/utils_download_bulk.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,30 +5,73 @@

import numpy as np
import requests
from bs4 import BeautifulSoup
from tqdm import tqdm

# setup logger
from open_mastr.utils.config import setup_logger

log = setup_logger()

def gen_version(when: time.struct_time = time.localtime()) -> str:
"""
Generates the current version.

The version number is determined according to a fixed release cycle,
which is by convention in sync with the changes to other german regulatory
frameworks of the energysuch as GeLI Gas and GPKE.

The release schedule is twice per year on 1st of April and October.
The version number is determined by the year of release and the running
number of the release, i.e. the release on April 1st is release 1,
while the release in October is release 2.

def get_url_from_Mastr_website() -> str:
"""Get the url of the latest MaStR file from markstammdatenregister.de.
Further, the release happens during the day, so on the day of the
changeover, the exported data will still be in the old version/format.

The file and the corresponding url are updated once per day.
The url has a randomly generated string appended, so it has to be
grabbed from the marktstammdatenregister.de homepage.
For further details visit https://www.marktstammdatenregister.de/MaStR/Datendownload
see <https://www.marktstammdatenregister.de/MaStRHilfe/files/webdienst/Release-Termine.pdf>

Examples:
2024-01-01 = version 23.2
2024-04-01 = version 23.2
2024-04-02 = version 24.1
2024-09-30 = version 24.1
2024-10-01 = version 24.1
2024-10-02 = version 24.2
2024-31-12 = version 24.2
"""

html = requests.get("https://www.marktstammdatenregister.de/MaStR/Datendownload")
soup = BeautifulSoup(html.text, "lxml")
# find the download button element on the website
element = soup.find_all("a", "btn btn-primary text-right")[0]
# extract the url from the html element
return str(element).split('href="')[1].split('" title')[0]
year = when.tm_year
release = 1

if when.tm_mon < 4 or (when.tm_mon == 4 and when.tm_mday == 1):
year = year - 1
release = 2
elif when.tm_mon > 10 or (when.tm_mon == 10 and when.tm_mday > 1):
release = 2

# only the last two digits of the year are used
year = str(year)[-2:]

return f'{year}.{release}'

def gen_url(when: time.struct_time = time.localtime()) -> str:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all thanks for this very useful PR @Johann150
One remark here: I think it would be great to have a fallback url for the case that the current data is not online yet. This could happen, as you described, before 04:00.
Maybe we do the following: When the url is used to download the "Gesamtdatenexport" this is wrapped in a try - except block. If it fails, the url is changed to the url from one day before and the download is started again?

What do you think? And do you have time to make this change? Otherwise I can also do it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made the change. It was previously not checked at all what the download request status code was, so instead of writing a giant try block I thought it would be a better idea to "just" check the status code instead.

There is one potential situation that I'm not quite sure about, when thinking about this a bit more. Since the download can take a few minutes, if by coincidence you were to start the download right before the new file is published, I don't know if the rest of the old file will be correctly downloaded. But maybe that is a very hypothetical and contrived situation that does not really need to be considered.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think that won't happen too often and people can than rerun the download.

"""
Generates the download URL for the specified date.

Note that not all dates are archived on the website.
Normally only today is available, the export is usually made
between 02:00 and 04:00, which means before 04:00 the current data may not
yet be available and the download could fail.

Note also that this function will not be able to generate URLs for dates
before 2024 because a different URL scheme was used then which had some random
data embedded in the name to make it harder to automate downloads.
"""

version = gen_version(when)
date = time.strftime("%Y%m%d", when)

return f'https://download.marktstammdatenregister.de/Gesamtdatenexport_{date}_{version}.zip'


def download_xml_Mastr(
Expand Down Expand Up @@ -69,9 +112,21 @@ def download_xml_Mastr(
" You may want to download it another time."
)
print(print_message)
url = get_url_from_Mastr_website()

now = time.localtime()
url = gen_url(now)

time_a = time.perf_counter()
r = requests.get(url, stream=True)
if r.status_code == 404:
# presumably todays download is not ready yet, retry with yesterdays date
log.warning("Download file was not found. Assuming that the new file was not published yet and retrying with yesterday.")
now = time.localtime(time.mktime(now) - (24 * 60 * 60)) # subtract 1 day from the date
r = requests.get(url, stream=True)
if r.status_code == 404:
log.error("Could not download file: download URL not found")
return

total_length = int(18000 * 1024 * 1024)
with open(save_path, "wb") as zfile, tqdm(
desc=save_path, total=(total_length / 1024 / 1024), unit=""
Expand Down
1 change: 0 additions & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,6 @@
"requests",
"keyring",
"tqdm",
"beautifulsoup4",
"pyyaml",
"xmltodict",
],
Expand Down
35 changes: 30 additions & 5 deletions tests/xml_download/test_utils_download_bulk.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,33 @@
from open_mastr.xml_download.utils_download_bulk import get_url_from_Mastr_website
import time
from open_mastr.xml_download.utils_download_bulk import gen_url

def test_gen_url():
when = time.strptime("2024-01-01", "%Y-%m-%d")
url = gen_url(when)
assert type(url) == str
assert url == "https://download.marktstammdatenregister.de/Gesamtdatenexport_20240101_23.2.zip"

when = time.strptime("2024-04-01", "%Y-%m-%d")
url = gen_url(when)
assert type(url) == str
assert url == "https://download.marktstammdatenregister.de/Gesamtdatenexport_20240401_23.2.zip"

when = time.strptime("2024-04-02", "%Y-%m-%d")
url = gen_url(when)
assert type(url) == str
assert url == "https://download.marktstammdatenregister.de/Gesamtdatenexport_20240402_24.1.zip"

when = time.strptime("2024-10-01", "%Y-%m-%d")
url = gen_url(when)
assert type(url) == str
assert url == "https://download.marktstammdatenregister.de/Gesamtdatenexport_20241001_24.1.zip"

when = time.strptime("2024-10-02", "%Y-%m-%d")
url = gen_url(when)
assert type(url) == str
assert url == "https://download.marktstammdatenregister.de/Gesamtdatenexport_20241002_24.2.zip"

def test_get_url_from_Mastr_website():
url = get_url_from_Mastr_website()
assert len(url) > 10
when = time.strptime("2024-12-31", "%Y-%m-%d")
url = gen_url(when)
assert type(url) == str
assert "marktstammdaten" in url
assert url == "https://download.marktstammdatenregister.de/Gesamtdatenexport_20241231_24.2.zip"
Loading