Pipleline to literature

A Pipeline for Obtaining Relevant Literature Based on Given Keywords

It's a pipeline to help researchers accelerate literature searches and information acquisition

Let's start following the steps!

Step 1

Syntax for obtaining query syntaxes for databases such as PubMed based on keywords

Common approach

Take PubMed as an example.

Take the subject keywords of our current study (e.g. Mycotoxin, enzyme, degrade, degradation, etc.) as an example.

Website: https://pubmed.ncbi.nlm.nih.gov/advanced/

Search based on search keyword statements

Note: When you use a literature database to search for relevant literature resources, we recommend that you optimize your keywords. For example, if your research area of interest is a physician topic, you should perform keyword validation at the MeSH URL (http://www.nlm.nih.gov/mesh/). This is to ensure that the most accurate research vocabulary is used. This maximizes the chance of ensuring that the literature resources searched in the database are the most accurate and relevant.

Download all retrieved literature information

For Web of Science:

Website: https://www.webofscience.com/wos/woscc/advanced-search

You can also supplement the relevant literature in other databases such as Google Scholar, Science Direct, etc.

Common approach

In order to minimize manual operations, we hereby create a homemade tool, ptol, which can be downloaded and installed directly via pip install ptol. Once you have installed it, you can view the initial introduction to the subroutine with the ptol -h or ptol --help commands. Below:

The subroutine query_syntax in ptol automatically generates all possible lexical variations, PubMed and Web of Science query syntaxes, and corresponding download links based on keywords provided by the user.

Module name: query_syntax

Usage:

Enter the following command in the terminal to see help on using the program:

ptol query_syntax -h

All parameters and descriptions are listed below:

Parameters	Descriptions
-m	When running the script for the first time, use -m init to download the dictionary library first. Once downloaded, use -m run for subsequent run parameters.
-i	Setting the path to a file containing only keywords.
-o	Setting the output file path.

Enter the file format:

keyword 1

keyword 2

keyword 3

...

As shown in the figure below:

Initialization for the first run

ptol query_syntax -m init -i keywords.txt -o my_result.txt

When you see the last line of the terminal print "Successfully downloaded wordnet and other thesaurus!", indicating that the initialization is successful, then you can proceed to the use of the program.

Practical training:

ptol query_syntax -m run -i keywords.txt -o my_result.txt

Outputs the contents of the file:

After that, according to the results given by this program, go directly from PubMed or Web of Science to download the results of searching literature information. You can refer to the next steps in the section 1. common approach.

Step 2

Consolidation of literature information

Literature collected from different databases was combined into one file through MS Excel. We keep only the Title and DOI number and save it as an xlsx file. Example:

The file was then processed to remove duplicates using the subroutine remove_duplicates.

Module name: remove_duplicates

Usage:

Enter the following command in the terminal to see help on using the program:

ptol remove_duplicates -h

All parameters and descriptions are listed below:

Parameters	Descriptions
-i	Setting the path to MS Excel files ending in .xlsx extension
-o	Setting the output file path.

Practical training:

ptol remove_duplicates -i all_database_literatures_data.xlsx -o all_database_literatures_data_single.txt

Outputs the contents of the file:

Step 3

Download literatures

Based on the entirety of the relevant literature obtained earlier, a pdf of each piece of literature was downloaded.

Note: In order to get all the above literature as fast as possible, we suggest that a one-time batch download can be realized by tools such as EndNote, crawler, and so on. Please note that at all times, please respect the copyrights of the authors and publishers of the literature. That is, the acquisition of the target literature is carried out through legal channels.

Here, we provide the subroutine download_pdfs that can batch download pdf format literature. Just for reference.

Module name: download_pdfs

Usage:

Enter the following command in the terminal to see help on using the program:

ptol download_pdfs -h

Note: This subroutine is for test use by interested parties only, and in order to comply with the publisher's copyright, please download it from the official link of the literature publisher, or purchase the target literature you need.

Step 4

Convert pdf documents to text files

After downloading all the documents (pdf), use the subroutine pdf_to_text for batch processing to convert all the documents into text files.

Module name: pdf_to_text

Usage:

Enter the following command in the terminal to see help on using the program:

ptol pdf_to_text -h

All parameters and descriptions are listed below:

Parameters	Descriptions
-m	The script provides four kinds of pdf files into text files, respectively, numbered 1, 2, 3, 4, the user can set up according to their own preferences. A run, only one of the methods can be set. The purpose of such a design is that when some of the pdf documents can not be converted into text files, you can put these documents into a separate directory, try another method of conversion.
-i	Setting the path to the folder that includes only pdf-formatted literatures.
-o	Setting the path of output folder, all the text files which are converted successfully will be stored in this directory.

Practical training:

ptol pdf_to_text -m 4 -i literatures_pdf -o literatures_text

View a text-formatted document from the leteratures_text folder as follows:

Note: The file name of the document is logged in the terminal for failed conversions. Convenient for users to follow up.

Access to large language modeling tools

After that, following the process described in our article, the research question is prepared manually and then the text file is copied and pasted into the input box of a big language model such as ChatGPT. The goal of capturing information from the literature by big language models instead of manually can be realized.

Finally, I sincerely hope that this pipeline can accelerate your research process and wish the best of luck in research.

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
test_data		test_data
Pipeline_to_literature_for_NLP-new.pdf		Pipeline_to_literature_for_NLP-new.pdf
README.md		README.md
batch_download_literatures_pdf_alpha_test.py		batch_download_literatures_pdf_alpha_test.py
batch_pdf_file_to_text_file.py		batch_pdf_file_to_text_file.py
generate_query_statements_and_links_to_literature_database_searches_based_on_keywords.py		generate_query_statements_and_links_to_literature_database_searches_based_on_keywords.py
remove_duplicates.py		remove_duplicates.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pipleline to literature

A Pipeline for Obtaining Relevant Literature Based on Given Keywords

Step 1

Syntax for obtaining query syntaxes for databases such as PubMed based on keywords

Common approach

Search based on search keyword statements

Download all retrieved literature information

Common approach

Step 2

Consolidation of literature information

Step 3

Download literatures

Step 4

Convert pdf documents to text files

Access to large language modeling tools

About

Releases

Packages

Languages

ChenHuilong1223/Pipeline_to_literature

Folders and files

Latest commit

History

Repository files navigation

Pipleline to literature

A Pipeline for Obtaining Relevant Literature Based on Given Keywords

Step 1

Syntax for obtaining query syntaxes for databases such as PubMed based on keywords

Common approach

Search based on search keyword statements

Download all retrieved literature information

Common approach

Step 2

Consolidation of literature information

Step 3

Download literatures

Step 4

Convert pdf documents to text files

Access to large language modeling tools

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages