Readonly Pipeline #229

haohangyan · 2024-10-20T23:06:56Z

This PR contains all the code we need for the readonly pipeline.

When we run the command bash indra_db/readonly_dumping/test_bash.sh, the pipeline will start running.
An overview of the pipeline:

Set up environment variables and database passwords.
Get file paths for required initial dump files and verify them.
Dump raw statements, reading text content meta, and text refs principal from the principal database.
Get all files that will be loaded into the readonly database.
Recreate the local database and import data using the readonly_dumping script.
(Not tested) create and remove a local database dump file after uploading it to S3.
(Not tested) upload an end-date file to S3 and restore the dump to a readonly instance using pg_restore.

The pipeline contains two main parts: export_assembly and readonly_dumping.

Some main stages in the export assembly include:

Running the knowledgebase pipeline
Running statement distillation
Running preprocessing
Merging processed knowledgebase statements with processed raw statements
Running grounding and deduplication
Calculating refinements
Calculating the belief score

In the readonly_dumping part, we dump the tables into the local Postgres database in the following order: “belief”, “raw_stmt_src”, “reading_ref_link”, “evidence_counts”, “pa_agent_counts”, “mesh_concept_ref_counts”, “mesh_term_ref_counts”, “name_meta”, “text_meta”, “other_meta”, “source_meta”, “agent_interactions”, “fast_raw_pa_link”, “raw_stmt_mesh_concepts”, “raw_stmt_mesh_terms”, “mesh_concept_meta”, “mesh_term_meta”.

Some tasks that need to be completed:

When the final readonly database is generated, the bash file needs a script to upload the database to an S3 bucket. We need to determine the storage destination for where the database should be placed.
Everything is currently running using the test_bash.sh. This will be fixed so we can use readonly_dumping_bash.sh soon.

Depends on sorgerlab/indra#1460.

kkaris

There are updates needed by both of us @haohangyan, in general I think it looks good

kkaris · 2024-11-04T17:35:15Z

indra_db/cli/knowledgebase.py

+        # tp = tas.process_from_web(affinity_class_limit=2,
+        #                           named_only=True,
+        #                           standardized_only=False)
+        tp = tas.process_csv('/Users/haohangyan/.data/readonly_pipeline/kb_source_data/tas.csv')


Restore the S3 loading.

kkaris · 2024-11-04T17:38:09Z

indra_db/cli/knowledgebase.py

@@ -149,6 +215,7 @@ def _get_statements(self):
        logger.info('Retrieving CBN network zip archive')
        tmp_zip = os.path.join(cbn_dir, 'cbn_human.zip')
        resp = requests.get(self.archive_url)
+        resp.raise_for_status()


We should check if the raise_for_status() check is done for all web downloads, or checks for non-200 status in other ways.

kkaris · 2024-11-04T17:42:10Z

indra_db/cli/knowledgebase.py

+    def get_source_version(self):
+        url = "https://downloads.thebiogrid.org/BioGRID/Release-Archive"
+        soup = BeautifulSoup(requests.get(url).content, "html.parser")
+        version_numbers = re.findall(r'\d+\.\d+\.\d+', soup.text)


This looks a bit fragile, what if the homepage just does a minor update and another value is added somewhere on the page that also matches pattern? I think it's better to try to be more precise which part of the page is checked for version number.

kkaris · 2024-11-04T17:45:56Z

indra_db/cli/knowledgebase.py

+            response = requests.head(url)
+            if 'Last-Modified' in response.headers:
+                last_modified = response.headers['Last-Modified']
+                version = version + subset + last_modified


Suggested change

version = version + subset + last_modified

version += subset + last_modified

kkaris · 2024-11-04T17:46:37Z

indra_db/cli/knowledgebase.py

-        expanded_stmts = [s for s in _expanded(stmts)]
+    def get_statements(self):
+        from indra.sources import drugbank
+        # For now. Load from local since aws is not setted up


We can revert to using S3 now

kkaris · 2024-11-04T18:38:44Z

indra_db/readonly_dumping/README.md

We should update this README since your added updates. I can start going over it but you should also have look at it to see if any part needs to be updated since you added to the code @haohangyan.

kkaris · 2024-11-04T21:13:07Z

indra_db/readonly_dumping/export_assembly.py

+        ev_list = []
+        for source, count in summed_source_counts.items():
+            for _ in range(count):
+                ev_list.append(Evidence(source_api=source))


This needs to updated with mapping from names used in the source counts to the source api name since source != source_api for certain knowledgebases. I'll take care of that.

kkaris · 2024-11-04T21:46:35Z

indra_db/readonly_dumping/readonly_dumping.py

+
+        num_srcs = len(src_count_dict)
+        has_rd = any(source in src_count_dict for source in SOURCE_GROUPS["reader"])
+        has_db = any(source in src_count_dict for source in SOURCE_GROUPS["database"])


We have to ensure mapping from indra db name works with indra names and that the distinction between database/reader is correct here, since the source counts use the names found in DBInfo table, while the SOURCE_GROUPS contains names as they are in indra I'll take a look at this as well.

kkaris · 2024-11-04T21:49:49Z

indra_db/readonly_dumping/readonly_dumping_bash.sh

+    echo "Dumping raw statements"
+    start=$(date +%s)
+    psql -d indradb_test \
+         -h indradb-refresh.cwcetxbvbgrf.us-east-1.rds.amazonaws.com \


The URL needs to be updated.

kkaris · 2024-11-04T21:51:34Z

indra_db/readonly_dumping/test_bash.sh

What's the difference between readonly_dumping_bash.sh and test_bash.sh? They look almost the same, so they should probably be merged.

kkaris added 30 commits November 1, 2024 14:38

Add body for belief calculation

7cbed2f

Add tqdm output

f17bc04

Add two more files to locations

817656b

Set preassembler variable to global

583d251

Check for output before running next steps

5279915

Import locations and setup belief loading

44542bb

Set each row as tuple

f0fe8e4

Add more files to locations

a2e472c

More logging

cc4ba56

Set local ro variables globally instead of in function

8576f4f

More logging

dffd971

Reorganize functions

8eb9ffd

Fix global variables, do not mock num_batches

5231293

Fix logging

119fed1

Use add_edges_from

c837b83

Explicitly log script was finished if final output exists

e6aba74

Do load preassembler

8f01949

pass bathc_size and num_batches instead

0604337

Fix(?) belief calc bug

db2dd09

Compartmentalize mock evidence creation to avoid confusion

0a8212f

Minor cleanup

5644889

Further compartmentalize code

d80113e

Modify reading_ref_link dumper

f2e2143

Use more clear variable name

4c7acd9

Move delete of temp file to before index building

d5f52bb

Add raw_stmt_source temp file

7b51f92

Add raw_stmt_src creation

9078156

Add index building to belief table creation

e26525e

Move load_statement_json to util.py

38c3170

Add two new mappings to locations.py

e0253ae

haohangyan added 23 commits November 1, 2024 14:38

bash file testing (temporary)

3651cfd

Adjust get_n_process for refinement step

57e54eb

Garbage collection in Ubuntu/Linux Environment

8c863ea

Global pa variable and multiprocess setting

d775801

Functions rearrangement

f6010a1

Adjust the multiprocess memory usage

8a484bd

Cleaned compare_snapshots functions

7d0d6ae

Cleaned get_n_process

0a2fb7d

Fixed local database environment variable

78f4cc0

Temporary change for testing

4e48a5a

Use SQL Ontology in getting refinement

61ebeee

Spark config and code clean up

0fcb149

Time benchmark for readonly_dumping

e3c69b3

Export step time benchmark

1324586

Pipeline files clean up

dd6162d

EC2 bash file fix

e696822

Removed log info in split file function

1b7e8bd

Improved log recording

4d9afa3

Changed logger path

81e0982

Change logg propagate setting

04320e6

Make the logg not propogate and fixed PUBMED_MESH_DIR deletion

12adb9e

Fixed EC2 rds in the bash file

f3ffe48

Remove source ~/.zshrc

ab27a25

kkaris force-pushed the readonly_dumping_new branch from 6a3c5d0 to ab27a25 Compare November 1, 2024 21:38

kkaris added 4 commits November 1, 2024 15:49

Move getting db_info_mapping to __main__

31551f1

Use kb mapping for name in source counts, keep track with UUIDs

1d76c62

Remove unused imports

a140052

More clear error message

75cb41d

kkaris self-requested a review November 1, 2024 23:13

kkaris reviewed Nov 4, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Readonly Pipeline #229

Readonly Pipeline #229

haohangyan commented Oct 20, 2024 •

edited by kkaris

Loading

kkaris left a comment

kkaris Nov 4, 2024

kkaris Nov 4, 2024

kkaris Nov 4, 2024

kkaris Nov 4, 2024

kkaris Nov 4, 2024

kkaris Nov 4, 2024

kkaris Nov 4, 2024

kkaris Nov 4, 2024

kkaris Nov 4, 2024

kkaris Nov 4, 2024

	version = version + subset + last_modified
	version += subset + last_modified

Readonly Pipeline #229

Are you sure you want to change the base?

Readonly Pipeline #229

Conversation

haohangyan commented Oct 20, 2024 • edited by kkaris Loading

kkaris left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

haohangyan commented Oct 20, 2024 •

edited by kkaris

Loading