Publications manifest generator for the Cancer Complexity Knowledge Portal (CCKP)
Manifests for the CCKP can be generated using Docker or Python (3.9+). Regardless of approach, a Synapse account is required, as well as an Entrez account (strongly recommended). Failing to provide Entrez credentials will most likely result in timeout errors from NCBI.
Create a file called .env
and update its contents with your Synapse
Personal Access Token (PAT) and NCBI account info.
# Synapse Credentials
SYNAPSE_AUTH_TOKEN=<PAT>
# Entrez Credentials
ENTREZ_EMAIL=<email>
ENTREZ_API_KEY=<apikey>
Run the Docker container, replacing /path/to/.env
with your path to .env
.
docker run --rm -ti \
--env-file /path/to/.env \
--volume $PWD/output:/tmp/output:rw \
docker pull ghcr.io/mc2-center/pubmed-crawler
If this is your first time running the command, Docker will first pull the image (max. 1-2 minutes) before running the container.
To pull the latest Docker changes, run the following command:
docker pull ghcr.io/mc2-center/pubmed-crawler
Depending on how many new publications have been added to PubMed since the last
scrape (and NCBI’s current requests traffic), this step could take anywhere from
30 seconds to 15ish minutes. Once complete, a manifest will be found in a folder
called output
, with a name like publications_manifest_<yyyy-mm-dd>.xlsx
,
where <yyyy-mm-dd>
is the current date.
-
Clone this repo where you want on your local machine, e.g. current directory,
Desktop
, etc.git clone https://github.com/mc2-center/pubmed-crawler.git
-
In the
pubmed-crawler
directory, copy.envTemplate
as.env
, then update its contents with your Synapse Personal Access Token (PAT) and NCBI account info. -
Install the dependencies for the Python scripts, ideally in a virtual environment, e.g. conda or pyenv. For example:
conda create -n pubmed-crawler python=3.9 conda activate pubmed-crawler pip install -r requirements.txt
-
Set environment variables from
.env
so that the scripts will have access to the credentials.export $(grep -v '^#' .env | xargs)
While in the virtual environment, run the command:
python pubmed_crawler.py -t syn21868591
where:
syn21868591
is the Synapse table containing publications already curated for the CCKP
PubMed Crawler uses this table to compare against publications found in PubMed,
based on the grant numbers found in the Portal - Grants Merged table (syn21918972).
To change the table of grants to query PubMed with, use -g
or --grantview_id
. For example:
python pubmed_crawler.py -t syn21868591 -g syn33657459
When using a different table of grants, ensure that its schema has at least the following columns:
grantNumber
consortium
theme
Below is the full usage of the script:
usage: pubmed_crawler.py [-h] [-g GRANT_ID] -t TABLE_ID [-o OUTPUT_NAME]
Get PubMed information from a list of grant numbers and put the results into a CSV file.
Table ID can be provided if interested in only scrapping for new publications.
optional arguments:
-h, --help show this help message and exit
-g GRANT_ID, --grant_id GRANT_ID
Synapse table/view ID containing grant numbers in 'grantNumber' column.
(Default: syn21918972)
-t TABLE_ID, --table_id TABLE_ID
Current Synapse table holding PubMed info.
-o OUTPUT_NAME, --output_name OUTPUT_NAME
Any PMIDs found in PubMed that are not found in the Publications table will
be scraped. Depending on the number of new publications (and NCBI’s current
requests traffic), this step could take anywhere from 30 seconds to 15ish
minutes. Once complete, a manifest will be found in a folder called output
,
with a name like publications_manifest_<yyyy-mm-dd>.xlsx
, where <yyyy-mm-dd>
is the current date.
Fill out the manifest(s) as needed, using the pre-defined Controlled Vocabulary listed in standard_terms for applicable columns. Once complete, validate and upload the manifest(s) with the Data Curator App (DCA).