MC² Center Pubmed Crawler

Publications manifest generator for the Cancer Complexity Knowledge Portal (CCKP)

Manifests for the CCKP can be generated using Docker or Python (3.9+). Regardless of approach, a Synapse account is required, as well as an Entrez account (strongly recommended). Failing to provide Entrez credentials will most likely result in timeout errors from NCBI.

🐳 Generate with Docker

Setup

Create a file called .env and update its contents with your Synapse Personal Access Token (PAT) and NCBI account info.

# Synapse Credentials
SYNAPSE_AUTH_TOKEN=<PAT>

# Entrez Credentials
ENTREZ_EMAIL=<email>
ENTREZ_API_KEY=<apikey>

Usage

Run the Docker container, replacing /path/to/.env with your path to .env.

docker run --rm -ti \
  --env-file /path/to/.env \
  --volume $PWD/output:/tmp/output:rw \
  docker pull ghcr.io/mc2-center/pubmed-crawler

If this is your first time running the command, Docker will first pull the image (max. 1-2 minutes) before running the container.

To pull the latest Docker changes, run the following command:

docker pull ghcr.io/mc2-center/pubmed-crawler

Output

Depending on how many new publications have been added to PubMed since the last scrape (and NCBI’s current requests traffic), this step could take anywhere from 30 seconds to 15ish minutes. Once complete, a manifest will be found in a folder called output, with a name like publications_manifest_<yyyy-mm-dd>.xlsx, where <yyyy-mm-dd> is the current date.

🐍 Generate with Python

Setup

Clone this repo where you want on your local machine, e.g. current directory, Desktop, etc.
```
git clone https://github.com/mc2-center/pubmed-crawler.git
```
In the pubmed-crawler directory, copy .envTemplate as .env, then update its contents with your Synapse Personal Access Token (PAT) and NCBI account info.
Install the dependencies for the Python scripts, ideally in a virtual environment, e.g. conda or pyenv. For example:
```
conda create -n pubmed-crawler python=3.9
conda activate pubmed-crawler
pip install -r requirements.txt
```
Set environment variables from .env so that the scripts will have access to the credentials.
```
export $(grep -v '^#' .env | xargs)
```

Usage

While in the virtual environment, run the command:

python pubmed_crawler.py -t syn21868591

where:

syn21868591 is the Synapse table containing publications already curated for the CCKP

PubMed Crawler uses this table to compare against publications found in PubMed, based on the grant numbers found in the Portal - Grants Merged table (syn21918972). To change the table of grants to query PubMed with, use -g or --grantview_id. For example:

python pubmed_crawler.py -t syn21868591 -g syn33657459

When using a different table of grants, ensure that its schema has at least the following columns:

grantNumber
consortium
theme

Below is the full usage of the script:

usage: pubmed_crawler.py [-h] [-g GRANT_ID] -t TABLE_ID [-o OUTPUT_NAME]

Get PubMed information from a list of grant numbers and put the results into a CSV file.
Table ID can be provided if interested in only scrapping for new publications.

optional arguments:
  -h, --help            show this help message and exit
  -g GRANT_ID, --grant_id GRANT_ID
                        Synapse table/view ID containing grant numbers in 'grantNumber' column. 
                        (Default: syn21918972)
  -t TABLE_ID, --table_id TABLE_ID
                        Current Synapse table holding PubMed info.
  -o OUTPUT_NAME, --output_name OUTPUT_NAME

Output

Any PMIDs found in PubMed that are not found in the Publications table will be scraped. Depending on the number of new publications (and NCBI’s current requests traffic), this step could take anywhere from 30 seconds to 15ish minutes. Once complete, a manifest will be found in a folder called output, with a name like publications_manifest_<yyyy-mm-dd>.xlsx, where <yyyy-mm-dd> is the current date.

✏️ Next Steps

Fill out the manifest(s) as needed, using the pre-defined Controlled Vocabulary listed in standard_terms for applicable columns. Once complete, validate and upload the manifest(s) with the Data Curator App (DCA).

→ Read more about annotating and using the DCA.

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
.github/workflows		.github/workflows
output		output
.dockerignore		.dockerignore
.envTemplate		.envTemplate
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
pubmed_crawler.py		pubmed_crawler.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MC² Center Pubmed Crawler

Publications manifest generator for the Cancer Complexity Knowledge Portal (CCKP)

🐳 Generate with Docker

Setup

Usage

Output

🐍 Generate with Python

Setup

Usage

Output

✏️ Next Steps

About

Releases 14

Packages

Contributors 2

Languages

License

mc2-center/pubmed-crawler

Folders and files

Latest commit

History

Repository files navigation

MC2 Center Pubmed Crawler

Publications manifest generator for the Cancer Complexity Knowledge Portal (CCKP)

🐳 Generate with Docker

Setup

Usage

Output

🐍 Generate with Python

Setup

Usage

Output

✏️ Next Steps

About

Resources

License

Stars

Watchers

Forks

Releases 14

Packages 0

Contributors 2

Languages

MC² Center Pubmed Crawler

Packages