Skip to content

PubMed Crawler for CCKP publication manifest

License

Notifications You must be signed in to change notification settings

mc2-center/pubmed-crawler

Repository files navigation

MC2 Center Pubmed Crawler

Publications manifest generator for the Cancer Complexity Knowledge Portal (CCKP)


GitHub release (latest by date) GitHub Release Date GitHub

Manifests for the CCKP can be generated using Docker or Python (3.9+). Regardless of approach, a Synapse account is required, as well as an Entrez account (strongly recommended). Failing to provide Entrez credentials will most likely result in timeout errors from NCBI.

🐳 Generate with Docker

Setup

Create a file called .env and update its contents with your Synapse Personal Access Token (PAT) and NCBI account info.

# Synapse Credentials
SYNAPSE_AUTH_TOKEN=<PAT>

# Entrez Credentials
ENTREZ_EMAIL=<email>
ENTREZ_API_KEY=<apikey>

Usage

Run the Docker container, replacing /path/to/.env with your path to .env.

docker run --rm -ti \
  --env-file /path/to/.env \
  --volume $PWD/output:/tmp/output:rw \
  docker pull ghcr.io/mc2-center/pubmed-crawler

If this is your first time running the command, Docker will first pull the image (max. 1-2 minutes) before running the container.

To pull the latest Docker changes, run the following command:

docker pull ghcr.io/mc2-center/pubmed-crawler

Output

Depending on how many new publications have been added to PubMed since the last scrape (and NCBI’s current requests traffic), this step could take anywhere from 30 seconds to 15ish minutes. Once complete, a manifest will be found in a folder called output, with a name like publications_manifest_<yyyy-mm-dd>.xlsx, where <yyyy-mm-dd> is the current date.

🐍 Generate with Python

Setup

  1. Clone this repo where you want on your local machine, e.g. current directory, Desktop, etc.

    git clone https://github.com/mc2-center/pubmed-crawler.git
    
  2. In the pubmed-crawler directory, copy .envTemplate as .env, then update its contents with your Synapse Personal Access Token (PAT) and NCBI account info.

  3. Install the dependencies for the Python scripts, ideally in a virtual environment, e.g. conda or pyenv. For example:

    conda create -n pubmed-crawler python=3.9
    conda activate pubmed-crawler
    pip install -r requirements.txt
    
  4. Set environment variables from .env so that the scripts will have access to the credentials.

    export $(grep -v '^#' .env | xargs)
    

Usage

While in the virtual environment, run the command:

python pubmed_crawler.py -t syn21868591

where:

  • syn21868591 is the Synapse table containing publications already curated for the CCKP

PubMed Crawler uses this table to compare against publications found in PubMed, based on the grant numbers found in the Portal - Grants Merged table (syn21918972). To change the table of grants to query PubMed with, use -g or --grantview_id. For example:

python pubmed_crawler.py -t syn21868591 -g syn33657459

When using a different table of grants, ensure that its schema has at least the following columns:

  • grantNumber
  • consortium
  • theme

Below is the full usage of the script:

usage: pubmed_crawler.py [-h] [-g GRANT_ID] -t TABLE_ID [-o OUTPUT_NAME]

Get PubMed information from a list of grant numbers and put the results into a CSV file.
Table ID can be provided if interested in only scrapping for new publications.

optional arguments:
  -h, --help            show this help message and exit
  -g GRANT_ID, --grant_id GRANT_ID
                        Synapse table/view ID containing grant numbers in 'grantNumber' column. 
                        (Default: syn21918972)
  -t TABLE_ID, --table_id TABLE_ID
                        Current Synapse table holding PubMed info.
  -o OUTPUT_NAME, --output_name OUTPUT_NAME

Output

Any PMIDs found in PubMed that are not found in the Publications table will be scraped. Depending on the number of new publications (and NCBI’s current requests traffic), this step could take anywhere from 30 seconds to 15ish minutes. Once complete, a manifest will be found in a folder called output, with a name like publications_manifest_<yyyy-mm-dd>.xlsx, where <yyyy-mm-dd> is the current date.

✏️ Next Steps

Fill out the manifest(s) as needed, using the pre-defined Controlled Vocabulary listed in standard_terms for applicable columns. Once complete, validate and upload the manifest(s) with the Data Curator App (DCA).

Read more about annotating and using the DCA.