Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Big query script #72

Merged
merged 4 commits into from
Jun 7, 2024
Merged

Big query script #72

merged 4 commits into from
Jun 7, 2024

Conversation

LadyChristina
Copy link
Member

@LadyChristina LadyChristina commented Jun 3, 2024

All Submissions:

  • Have you followed the guidelines in our Contributing documentation?
  • Have you verified that there aren't any other open Pull Requests for the same update/change?
  • Does the Pull Request pass all tests?

Description

Updated BigQuery data collection script so that it only takes one date as input (to_date) and performs the data collection until that date, starting from the last time each ledger was updated (which is saved in a json file called last_update.json and is updated automatically after each succesful data collection).

Copy link
Member

@dimkarakostas dimkarakostas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes that the same granularity is always used. If say someone uses first monthly, collects the files, and then sets daily, then the script will start from the last (monthly) updated point and will not collect the (daily) data before that. That's fine for now, but I'd suggest to open an enhancement issue for the future for this.

:param granularity: The granularity of the data collection. Can be 'day', 'week', 'month', or 'year'.
:return: A dictionary with ledgers as keys and the corresponding start dates as values.
"""
with open(hlp.ROOT_DIR / "data_collection_scripts/last_update.json") as f:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the file doesn't exist, this will throw an error. If that's intended OK, otherwise I suggest you create a separate function which checks if the files exists and, if not, initializes it somehow (eg. reading the files in the input folder).

ledger_snapshot_dates = {ledger: [hlp.get_date_string_from_date(to_date)] for ledger in ledgers}
else:
ledger_from_dates = get_from_dates(granularity=granularity)
ledger_snapshot_dates = {ledger: hlp.get_dates_between(ledger_from_dates[ledger], to_date, granularity) for ledger in ledgers}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will also throw an error if the ledger is not in the file (if that's intended, fine).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@LadyChristina
Copy link
Member Author

This assumes that the same granularity is always used. If say someone uses first monthly, collects the files, and then sets daily, then the script will start from the last (monthly) updated point and will not collect the (daily) data before that. That's fine for now, but I'd suggest to open an enhancement issue for the future for this.

Fixed. Now last_update.json includes is defined per granularity.

@dimkarakostas dimkarakostas merged commit 85041c2 into main Jun 7, 2024
1 check passed
@dimkarakostas dimkarakostas deleted the big_query_script branch June 7, 2024 08:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants