-
Download the PubLayNet dataset
wget -O training_data_generation/PubLayNet_PDF.tar.gz https://dax-cdn.cdn.appdomain.cloud/dax-publaynet/1.0.0/PubLayNet_PDF.tar.gz
-
Unpack the PubLayNet dataset. The dataset should be located at training_data_generation/publaynet/
-
Download the COCO 2017 dataset to use its content as random (non-chemical) images using the Kaggle API (https://github.com/Kaggle/kaggle-api).
kaggle dataset download awsaf49/coco-2017-dataset
-
Unpack the COCO dataset. The images should be located at
training_data_generation/random_images/
. We used the images from thetrain
subset. -
Download SMILES list from https://zenodo.org/record/5155037#.Y6r-9HbMK38 and save it as ´smiles.txt´ in training_data_generation.
forked from OBrink/chemical_page_segmentation_dataset
-
Notifications
You must be signed in to change notification settings - Fork 0
Artificial data generation for DECIMER-Segmentation
License
Steinbeck-Lab/chemical_page_segmentation_dataset
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
About
Artificial data generation for DECIMER-Segmentation
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published
Languages
- Jupyter Notebook 94.7%
- Python 5.3%