Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Readonly Pipeline #229

Open
wants to merge 265 commits into
base: master
Choose a base branch
from
Open

Conversation

haohangyan
Copy link
Contributor

@haohangyan haohangyan commented Oct 20, 2024

This PR contains all the code we need for the readonly pipeline.

When we run the command bash indra_db/readonly_dumping/test_bash.sh, the pipeline will start running.
An overview of the pipeline:

  • Set up environment variables and database passwords.
  • Get file paths for required initial dump files and verify them.
  • Dump raw statements, reading text content meta, and text refs principal from the principal database.
  • Get all files that will be loaded into the readonly database.
  • Recreate the local database and import data using the readonly_dumping script.
  • (Not tested) create and remove a local database dump file after uploading it to S3.
  • (Not tested) upload an end-date file to S3 and restore the dump to a readonly instance using pg_restore.

The pipeline contains two main parts: export_assembly and readonly_dumping.

Some main stages in the export assembly include:

  1. Running the knowledgebase pipeline
  2. Running statement distillation
  3. Running preprocessing
  4. Merging processed knowledgebase statements with processed raw statements
  5. Running grounding and deduplication
  6. Calculating refinements
  7. Calculating the belief score

In the readonly_dumping part, we dump the tables into the local Postgres database in the following order: “belief”, “raw_stmt_src”, “reading_ref_link”, “evidence_counts”, “pa_agent_counts”, “mesh_concept_ref_counts”, “mesh_term_ref_counts”, “name_meta”, “text_meta”, “other_meta”, “source_meta”, “agent_interactions”, “fast_raw_pa_link”, “raw_stmt_mesh_concepts”, “raw_stmt_mesh_terms”, “mesh_concept_meta”, “mesh_term_meta”.

Some tasks that need to be completed:

  • When the final readonly database is generated, the bash file needs a script to upload the database to an S3 bucket. We need to determine the storage destination for where the database should be placed.
  • Everything is currently running using the test_bash.sh. This will be fixed so we can use readonly_dumping_bash.sh soon.

Depends on sorgerlab/indra#1460.

@kkaris kkaris self-requested a review November 1, 2024 23:13
Copy link
Contributor

@kkaris kkaris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are updates needed by both of us @haohangyan, in general I think it looks good

Comment on lines +145 to +148
# tp = tas.process_from_web(affinity_class_limit=2,
# named_only=True,
# standardized_only=False)
tp = tas.process_csv('/Users/haohangyan/.data/readonly_pipeline/kb_source_data/tas.csv')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restore the S3 loading.

@@ -149,6 +215,7 @@ def _get_statements(self):
logger.info('Retrieving CBN network zip archive')
tmp_zip = os.path.join(cbn_dir, 'cbn_human.zip')
resp = requests.get(self.archive_url)
resp.raise_for_status()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should check if the raise_for_status() check is done for all web downloads, or checks for non-200 status in other ways.

def get_source_version(self):
url = "https://downloads.thebiogrid.org/BioGRID/Release-Archive"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
version_numbers = re.findall(r'\d+\.\d+\.\d+', soup.text)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks a bit fragile, what if the homepage just does a minor update and another value is added somewhere on the page that also matches pattern? I think it's better to try to be more precise which part of the page is checked for version number.

response = requests.head(url)
if 'Last-Modified' in response.headers:
last_modified = response.headers['Last-Modified']
version = version + subset + last_modified
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
version = version + subset + last_modified
version += subset + last_modified

expanded_stmts = [s for s in _expanded(stmts)]
def get_statements(self):
from indra.sources import drugbank
# For now. Load from local since aws is not setted up
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can revert to using S3 now

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should update this README since your added updates. I can start going over it but you should also have look at it to see if any part needs to be updated since you added to the code @haohangyan.

ev_list = []
for source, count in summed_source_counts.items():
for _ in range(count):
ev_list.append(Evidence(source_api=source))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to updated with mapping from names used in the source counts to the source api name since source != source_api for certain knowledgebases. I'll take care of that.


num_srcs = len(src_count_dict)
has_rd = any(source in src_count_dict for source in SOURCE_GROUPS["reader"])
has_db = any(source in src_count_dict for source in SOURCE_GROUPS["database"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to ensure mapping from indra db name works with indra names and that the distinction between database/reader is correct here, since the source counts use the names found in DBInfo table, while the SOURCE_GROUPS contains names as they are in indra I'll take a look at this as well.

echo "Dumping raw statements"
start=$(date +%s)
psql -d indradb_test \
-h indradb-refresh.cwcetxbvbgrf.us-east-1.rds.amazonaws.com \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The URL needs to be updated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between readonly_dumping_bash.sh and test_bash.sh? They look almost the same, so they should probably be merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants