-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Readonly Pipeline #229
base: master
Are you sure you want to change the base?
Readonly Pipeline #229
Conversation
6a3c5d0
to
ab27a25
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are updates needed by both of us @haohangyan, in general I think it looks good
# tp = tas.process_from_web(affinity_class_limit=2, | ||
# named_only=True, | ||
# standardized_only=False) | ||
tp = tas.process_csv('/Users/haohangyan/.data/readonly_pipeline/kb_source_data/tas.csv') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Restore the S3 loading.
@@ -149,6 +215,7 @@ def _get_statements(self): | |||
logger.info('Retrieving CBN network zip archive') | |||
tmp_zip = os.path.join(cbn_dir, 'cbn_human.zip') | |||
resp = requests.get(self.archive_url) | |||
resp.raise_for_status() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should check if the raise_for_status()
check is done for all web downloads, or checks for non-200 status in other ways.
def get_source_version(self): | ||
url = "https://downloads.thebiogrid.org/BioGRID/Release-Archive" | ||
soup = BeautifulSoup(requests.get(url).content, "html.parser") | ||
version_numbers = re.findall(r'\d+\.\d+\.\d+', soup.text) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks a bit fragile, what if the homepage just does a minor update and another value is added somewhere on the page that also matches pattern? I think it's better to try to be more precise which part of the page is checked for version number.
response = requests.head(url) | ||
if 'Last-Modified' in response.headers: | ||
last_modified = response.headers['Last-Modified'] | ||
version = version + subset + last_modified |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
version = version + subset + last_modified | |
version += subset + last_modified |
expanded_stmts = [s for s in _expanded(stmts)] | ||
def get_statements(self): | ||
from indra.sources import drugbank | ||
# For now. Load from local since aws is not setted up |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can revert to using S3 now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should update this README since your added updates. I can start going over it but you should also have look at it to see if any part needs to be updated since you added to the code @haohangyan.
ev_list = [] | ||
for source, count in summed_source_counts.items(): | ||
for _ in range(count): | ||
ev_list.append(Evidence(source_api=source)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to updated with mapping from names used in the source counts to the source api name since source != source_api
for certain knowledgebases. I'll take care of that.
|
||
num_srcs = len(src_count_dict) | ||
has_rd = any(source in src_count_dict for source in SOURCE_GROUPS["reader"]) | ||
has_db = any(source in src_count_dict for source in SOURCE_GROUPS["database"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have to ensure mapping from indra db name works with indra names and that the distinction between database/reader is correct here, since the source counts use the names found in DBInfo table, while the SOURCE_GROUPS
contains names as they are in indra I'll take a look at this as well.
echo "Dumping raw statements" | ||
start=$(date +%s) | ||
psql -d indradb_test \ | ||
-h indradb-refresh.cwcetxbvbgrf.us-east-1.rds.amazonaws.com \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The URL needs to be updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the difference between readonly_dumping_bash.sh
and test_bash.sh
? They look almost the same, so they should probably be merged.
This PR contains all the code we need for the readonly pipeline.
When we run the command
bash indra_db/readonly_dumping/test_bash.sh
, the pipeline will start running.An overview of the pipeline:
The pipeline contains two main parts: export_assembly and readonly_dumping.
Some main stages in the export assembly include:
In the readonly_dumping part, we dump the tables into the local Postgres database in the following order: “belief”, “raw_stmt_src”, “reading_ref_link”, “evidence_counts”, “pa_agent_counts”, “mesh_concept_ref_counts”, “mesh_term_ref_counts”, “name_meta”, “text_meta”, “other_meta”, “source_meta”, “agent_interactions”, “fast_raw_pa_link”, “raw_stmt_mesh_concepts”, “raw_stmt_mesh_terms”, “mesh_concept_meta”, “mesh_term_meta”.
Some tasks that need to be completed:
Depends on sorgerlab/indra#1460.