Refactoring #74

dimkarakostas · 2024-07-11T09:55:08Z

All Submissions:

Have you followed the guidelines in our Contributing documentation?
Have you verified that there aren't any other open Pull Requests for the same update/change?
Does the Pull Request pass all tests?

Description

Refactoring of how data is mapped and analyzed.

With this PR no longer a database file is created per snapshot and per ledger. Instead, a database file is created only for the mapping address information per combination of mapping sources and the ledger raw data is fully loaded in memory during analyzing.

The refactoring also enables support for combining and excluding mapping sources and computing clusters of entities that form due to the usage of different sources.

…urces

LadyChristina · 2024-07-15T15:17:35Z

README.md

+Place all raw data (which could be collected from
+[BigQuery](https://cloud.google.com/bigquery/) for example) in the `input`
+directory.  Each file named as `<project_name>_<snapshot_date>_raw_data.json`
+(e.g.  `bitcoin_{2023-01-01}_raw_data.json`). By default, there is a (very


I guess this was like that before, but the curly brackets shouldn't be there, right?

LadyChristina · 2024-07-16T09:38:30Z

tokenomics_decentralization/analyze.py

-        if val and not force_analyze:
-            metric_value = val[0]
+        if 'tau' in default_metric_name:
+            threshold = float(default_metric_name.split('=')[1])


Not a big deal but it would be nice to abstract this to a helper function so that it's easily updateable in case we change the metric name format in the future

LadyChristina · 2024-07-16T09:41:17Z

tokenomics_decentralization/analyze.py

-            metric_value = val[0]
+        if 'tau' in default_metric_name:
+            threshold = float(default_metric_name.split('=')[1])
+            metric_value = compute_functions[default_metric_name](entries, circulation, threshold)[0]


I don't think we use anywhere the second returned value from the compute_tau function, so perhaps we can remove it completely and we wouldn't need to differentate between the two calls here

LadyChristina · 2024-07-16T09:44:11Z

tokenomics_decentralization/analyze.py

+    while clustered_balances:
+        item = clustered_balances.popitem()
+        if item[1] > balance_threshold:
+            entries.append((item[1], ))


Is there a reason why we need each entry to be a tuple now? I think we had it this way before because of what the db was returning but now that we are processing each entry separately would it not make more sense to just store one value? (so entries would be a list of numbers then instead of list of tuples)

LadyChristina · 2024-07-16T09:45:24Z

tokenomics_decentralization/analyze.py

-            for filename in db_paths:
+            input_filename = None
+            input_paths = [input_dir / f'{ledger}_{date}_raw_data.csv' for input_dir in hlp.get_input_directories()]
+            print(input_paths)


Forgotten print?

LadyChristina · 2024-07-16T09:47:04Z

tokenomics_decentralization/db_helper.py

+        id INTEGER PRIMARY KEY,
+        address TEXT NOT NULL UNIQUE,
+        entity TEXT NOT NULL,
+        is_contract BIT DEFAULT 0


What purpose does the is_contract field serve here?

LadyChristina · 2024-07-16T11:15:26Z

tokenomics_decentralization/helper.py

+    clusters = list(address_entities.values())
+    del address_entities
+
+    # If an entity is present in two entries, then these are merged.


I don't entirely understand this process. In general, the way we merge information from different sources seems like something that should be more documented, so perhaps worth creating a new page in our docs about the mapping process, where this could also be explained in more detail (and with examples) to make it easier to understand

This also seems like a fragment of code that could be in a separate function, as the logic here is about merging clusters and not just retrieving them as the current function it belongs in suggests

LadyChristina · 2024-07-16T13:46:56Z

tests/test_helper.py

+    clusters = hlp.get_clusters('bitcoin')
+    assert clusters['entity1'] == clusters['entity2']
+    assert clusters['entity1'] == clusters['entity3']
+    assert clusters['entity4'] == clusters['entity5']


Also worth testing that the inactive sources (like test3) were not used, eg by asserting that entity7 is not in the clusters

LadyChristina

LGTM

Add mapping sources file

a82ac05

dimkarakostas requested a review from LadyChristina as a code owner July 11, 2024 09:55

dimkarakostas added 10 commits July 11, 2024 12:55

Create get active source info helper functions

900cde3

Move get_output_row, write_csv_output to helper

b0d70a8

Remove clustering flag and computed it based on config sources

364d454

Description comment in get_circulation_from_entries

6eb27f7

Add helper function to compute entity clusters when using multiple so…

d2b2c5a

…urces

Update README

8ad0812

Update documentation

f9ed264

Disable exclude_below_usd_cent config flag by default

a7648b8

Disable plot config flag by default

92bf22e

Refactor mapping process

b798d0b

dimkarakostas force-pushed the refactoring branch from 8bf40ab to 60c8ba6 Compare July 11, 2024 09:55

LadyChristina requested changes Jul 16, 2024

View reviewed changes

dimkarakostas added 9 commits July 16, 2024 14:52

Refactor analyzing process

781ebc3

Refactor db_helper

61bed0d

Remove schema.py

111e5de

Remove old helper functions to get force map balances and analyze flag

bf9302c

Refactor tests

1eecdb2

README typo

318f71c

Change tau computation to return only index

084fbe4

Add helper function to get tau from param string

ea347e7

Exclude contract addrs from entries if flag is set

bb86550

dimkarakostas force-pushed the refactoring branch from 8bef41a to bb86550 Compare July 16, 2024 12:23

dimkarakostas added 2 commits July 16, 2024 16:27

Change entries object to list of ints instead of tuples

860e7f1

Add test for excluding contracts flag

7c6b092

dimkarakostas force-pushed the refactoring branch from 842d7ef to 7c6b092 Compare July 16, 2024 13:27

LadyChristina reviewed Jul 16, 2024

View reviewed changes

Add small testcase

e4033b5

LadyChristina self-requested a review July 16, 2024 14:03

LadyChristina approved these changes Jul 16, 2024

View reviewed changes

LadyChristina merged commit 0a4821d into main Jul 16, 2024
1 check passed

LadyChristina deleted the refactoring branch July 16, 2024 14:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactoring #74

Refactoring #74

dimkarakostas commented Jul 11, 2024

LadyChristina Jul 15, 2024

LadyChristina Jul 16, 2024

LadyChristina Jul 16, 2024

LadyChristina Jul 16, 2024

LadyChristina Jul 16, 2024

LadyChristina Jul 16, 2024

LadyChristina Jul 16, 2024

LadyChristina Jul 16, 2024

LadyChristina Jul 16, 2024

LadyChristina left a comment

Refactoring #74

Refactoring #74

Conversation

dimkarakostas commented Jul 11, 2024

Description

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LadyChristina left a comment

Choose a reason for hiding this comment