Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring #74

Merged
merged 23 commits into from
Jul 16, 2024
Merged

Refactoring #74

merged 23 commits into from
Jul 16, 2024

Conversation

dimkarakostas
Copy link
Member

All Submissions:

  • Have you followed the guidelines in our Contributing documentation?
  • Have you verified that there aren't any other open Pull Requests for the same update/change?
  • Does the Pull Request pass all tests?

Description

Refactoring of how data is mapped and analyzed.

With this PR no longer a database file is created per snapshot and per ledger. Instead, a database file is created only for the mapping address information per combination of mapping sources and the ledger raw data is fully loaded in memory during analyzing.

The refactoring also enables support for combining and excluding mapping sources and computing clusters of entities that form due to the usage of different sources.

README.md Outdated
Place all raw data (which could be collected from
[BigQuery](https://cloud.google.com/bigquery/) for example) in the `input`
directory. Each file named as `<project_name>_<snapshot_date>_raw_data.json`
(e.g. `bitcoin_{2023-01-01}_raw_data.json`). By default, there is a (very
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this was like that before, but the curly brackets shouldn't be there, right?

if val and not force_analyze:
metric_value = val[0]
if 'tau' in default_metric_name:
threshold = float(default_metric_name.split('=')[1])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a big deal but it would be nice to abstract this to a helper function so that it's easily updateable in case we change the metric name format in the future

metric_value = val[0]
if 'tau' in default_metric_name:
threshold = float(default_metric_name.split('=')[1])
metric_value = compute_functions[default_metric_name](entries, circulation, threshold)[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we use anywhere the second returned value from the compute_tau function, so perhaps we can remove it completely and we wouldn't need to differentate between the two calls here

while clustered_balances:
item = clustered_balances.popitem()
if item[1] > balance_threshold:
entries.append((item[1], ))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why we need each entry to be a tuple now? I think we had it this way before because of what the db was returning but now that we are processing each entry separately would it not make more sense to just store one value? (so entries would be a list of numbers then instead of list of tuples)

for filename in db_paths:
input_filename = None
input_paths = [input_dir / f'{ledger}_{date}_raw_data.csv' for input_dir in hlp.get_input_directories()]
print(input_paths)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgotten print?

id INTEGER PRIMARY KEY,
address TEXT NOT NULL UNIQUE,
entity TEXT NOT NULL,
is_contract BIT DEFAULT 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What purpose does the is_contract field serve here?

clusters = list(address_entities.values())
del address_entities

# If an entity is present in two entries, then these are merged.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't entirely understand this process. In general, the way we merge information from different sources seems like something that should be more documented, so perhaps worth creating a new page in our docs about the mapping process, where this could also be explained in more detail (and with examples) to make it easier to understand

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also seems like a fragment of code that could be in a separate function, as the logic here is about merging clusters and not just retrieving them as the current function it belongs in suggests

clusters = hlp.get_clusters('bitcoin')
assert clusters['entity1'] == clusters['entity2']
assert clusters['entity1'] == clusters['entity3']
assert clusters['entity4'] == clusters['entity5']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also worth testing that the inactive sources (like test3) were not used, eg by asserting that entity7 is not in the clusters

@LadyChristina LadyChristina self-requested a review July 16, 2024 14:03
Copy link
Member

@LadyChristina LadyChristina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@LadyChristina LadyChristina merged commit 0a4821d into main Jul 16, 2024
1 check passed
@LadyChristina LadyChristina deleted the refactoring branch July 16, 2024 14:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants