-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactoring #74
Refactoring #74
Conversation
8bf40ab
to
60c8ba6
Compare
README.md
Outdated
Place all raw data (which could be collected from | ||
[BigQuery](https://cloud.google.com/bigquery/) for example) in the `input` | ||
directory. Each file named as `<project_name>_<snapshot_date>_raw_data.json` | ||
(e.g. `bitcoin_{2023-01-01}_raw_data.json`). By default, there is a (very |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this was like that before, but the curly brackets shouldn't be there, right?
if val and not force_analyze: | ||
metric_value = val[0] | ||
if 'tau' in default_metric_name: | ||
threshold = float(default_metric_name.split('=')[1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a big deal but it would be nice to abstract this to a helper function so that it's easily updateable in case we change the metric name format in the future
metric_value = val[0] | ||
if 'tau' in default_metric_name: | ||
threshold = float(default_metric_name.split('=')[1]) | ||
metric_value = compute_functions[default_metric_name](entries, circulation, threshold)[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we use anywhere the second returned value from the compute_tau
function, so perhaps we can remove it completely and we wouldn't need to differentate between the two calls here
while clustered_balances: | ||
item = clustered_balances.popitem() | ||
if item[1] > balance_threshold: | ||
entries.append((item[1], )) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason why we need each entry to be a tuple now? I think we had it this way before because of what the db was returning but now that we are processing each entry separately would it not make more sense to just store one value? (so entries
would be a list of numbers then instead of list of tuples)
for filename in db_paths: | ||
input_filename = None | ||
input_paths = [input_dir / f'{ledger}_{date}_raw_data.csv' for input_dir in hlp.get_input_directories()] | ||
print(input_paths) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgotten print?
id INTEGER PRIMARY KEY, | ||
address TEXT NOT NULL UNIQUE, | ||
entity TEXT NOT NULL, | ||
is_contract BIT DEFAULT 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What purpose does the is_contract
field serve here?
clusters = list(address_entities.values()) | ||
del address_entities | ||
|
||
# If an entity is present in two entries, then these are merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't entirely understand this process. In general, the way we merge information from different sources seems like something that should be more documented, so perhaps worth creating a new page in our docs about the mapping process, where this could also be explained in more detail (and with examples) to make it easier to understand
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also seems like a fragment of code that could be in a separate function, as the logic here is about merging clusters and not just retrieving them as the current function it belongs in suggests
8bef41a
to
bb86550
Compare
842d7ef
to
7c6b092
Compare
clusters = hlp.get_clusters('bitcoin') | ||
assert clusters['entity1'] == clusters['entity2'] | ||
assert clusters['entity1'] == clusters['entity3'] | ||
assert clusters['entity4'] == clusters['entity5'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also worth testing that the inactive sources (like test3) were not used, eg by asserting that entity7 is not in the clusters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
All Submissions:
Description
Refactoring of how data is mapped and analyzed.
With this PR no longer a database file is created per snapshot and per ledger. Instead, a database file is created only for the mapping address information per combination of mapping sources and the ledger raw data is fully loaded in memory during analyzing.
The refactoring also enables support for combining and excluding mapping sources and computing clusters of entities that form due to the usage of different sources.