Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inference Accelerated PDF batch parsing #106

Open
wants to merge 77 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
77 commits
Select commit Hold shift + click to select a range
3d5d93a
sync
veya2ztn Jul 23, 2024
8bbef33
add rough layout module
veya2ztn Jul 23, 2024
ed136d4
add batch processing
veya2ztn Jul 24, 2024
c334a66
test in S1
veya2ztn Jul 24, 2024
10ccb40
alignmented
veya2ztn Jul 24, 2024
d8dded3
reformat to match data group format
veya2ztn Jul 24, 2024
f655e6b
sync
veya2ztn Jul 24, 2024
2816698
mnerge
veya2ztn Jul 24, 2024
8681980
lets try another size assign
veya2ztn Jul 25, 2024
5037ed0
sync
veya2ztn Jul 25, 2024
426f17c
sync
veya2ztn Jul 29, 2024
d50097a
what new?
veya2ztn Jul 29, 2024
b195c46
merge
veya2ztn Jul 29, 2024
b40a558
add ocr batch processing
veya2ztn Jul 30, 2024
e727dfa
add pytorchocr
veya2ztn Jul 30, 2024
addccda
we pass the layout+mfd+det+rec test
veya2ztn Jul 31, 2024
1b60a5e
sync
veya2ztn Jul 31, 2024
a1cef0a
add weight
veya2ztn Jul 31, 2024
76cb7f1
sync
veya2ztn Aug 2, 2024
3f007e6
add rec batch model
veya2ztn Aug 3, 2024
1446e75
sync
veya2ztn Aug 3, 2024
49ddeba
lets use tensorRT
veya2ztn Aug 5, 2024
aeb23f6
try to convert layoutlmV3 to onnx and tensorrt
veya2ztn Aug 5, 2024
d422430
add sample image
veya2ztn Aug 5, 2024
5031624
our layoutLMv3 use CasadeROIalign which is not caplabe to detectron2 …
veya2ztn Aug 6, 2024
81e5a21
sync
veya2ztn Aug 7, 2024
00b04d9
lets seperate the postprocess
veya2ztn Aug 7, 2024
b22a308
it same use numpy backend faster
veya2ztn Aug 7, 2024
56d4ca6
lets load tensorrt MFD model
veya2ztn Aug 7, 2024
d20f0b0
we find the YOLO tensorRT has strict on the input size
veya2ztn Aug 7, 2024
ac2a33c
we find the YOLO tensorRT has strict on the input size
veya2ztn Aug 7, 2024
1162692
sync
veya2ztn Aug 7, 2024
9f767b2
Merge branch 'ONNX2TensorRT' of https://github.com/veya2ztn/PDF-Extra…
veya2ztn Aug 7, 2024
ff3cccf
pad batch to avoid tensorRT mdf wrong
veya2ztn Aug 7, 2024
f989a21
pass tensorRT and compiled model
veya2ztn Aug 7, 2024
a546387
sync
veya2ztn Aug 8, 2024
cd99977
check
veya2ztn Aug 8, 2024
827ca37
merge
veya2ztn Aug 8, 2024
c11d603
add async implement
veya2ztn Aug 8, 2024
ad3733c
add smart batch size judge
veya2ztn Aug 8, 2024
d195745
pass async
veya2ztn Aug 8, 2024
b3f9334
pass async
veya2ztn Aug 8, 2024
9573dfd
make det through inner batch size to avoid too large tensor
veya2ztn Aug 8, 2024
7fdd259
sync
veya2ztn Aug 8, 2024
afad9bd
before code may make the dataiter skip at final page.
veya2ztn Aug 9, 2024
159cc1a
we now made aync mode online
veya2ztn Aug 9, 2024
887f31c
add batch run script
veya2ztn Aug 12, 2024
6f96a89
lets do an async data loader into memory
veya2ztn Aug 13, 2024
0faba59
stepped rough rec. Now the bottleneck appear dataloader setup
veya2ztn Aug 13, 2024
2d7405f
stepped rough rec. Now the bottleneck appear dataloader setup
veya2ztn Aug 13, 2024
8127be2
Not iterableDataset
veya2ztn Aug 16, 2024
241fc4a
sync
veya2ztn Aug 18, 2024
cf2e549
we add no paddle depend workflow and pass on beijing server
veya2ztn Aug 18, 2024
1b30bf3
some script is server based and be removed
veya2ztn Aug 18, 2024
b0a1d93
add checklog
veya2ztn Aug 18, 2024
f8fc7f1
forget add no_paddle object
veya2ztn Aug 19, 2024
1a5d8bc
lets use reverse ssh port transfer the lock
veya2ztn Aug 22, 2024
f1efc04
we pass the faster rec inference framework
veya2ztn Aug 27, 2024
239a5fd
merge
veya2ztn Aug 27, 2024
9194031
fix a bug for image crop, for early version data like layoutV1, pleas…
veya2ztn Aug 27, 2024
ee54b1c
clean and reconstruct
veya2ztn Sep 2, 2024
bce8204
pass batchrun
veya2ztn Sep 2, 2024
6a0968e
merge
veya2ztn Sep 3, 2024
0759cc3
add README.md
veya2ztn Sep 3, 2024
2e78c93
update README
veya2ztn Sep 3, 2024
afabaff
add batch det
veya2ztn Sep 4, 2024
bb18105
add missing part fixing
veya2ztn Sep 12, 2024
b734ec8
sync
veya2ztn Sep 12, 2024
3221eb0
Merge branch 'ONNX2TensorRT' of https://github.com/veya2ztn/PDF-Extra…
veya2ztn Sep 12, 2024
fc5adc0
update rec, we forget consider exclude the mfd part
veya2ztn Sep 19, 2024
d3f5d4a
Merge branch 'ONNX2TensorRT' of https://github.com/veya2ztn/PDF-Extra…
veya2ztn Sep 19, 2024
35d28a9
add replace mode and full data save mode
veya2ztn Sep 23, 2024
ec9b1be
merge
veya2ztn Sep 23, 2024
e6cdab3
Merge branch 'ONNX2TensorRT' of https://github.com/veya2ztn/PDF-Extra…
veya2ztn Sep 23, 2024
1817e53
sync
veya2ztn Sep 25, 2024
73bae56
sync
veya2ztn Sep 25, 2024
7f510b0
sync
veya2ztn Sep 25, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
9 changes: 6 additions & 3 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,17 +1,20 @@
*.ipynb*
*.ipynb

models
# local data
output/*
data/*
temp*
test*

weights
# python
.ipynb_checkpoints
*.ipynb
**/__pycache__/

*.filelist
*.jsonl
analysis
physics_collection
# logs
*.log
*.out
Expand Down
91 changes: 91 additions & 0 deletions batch_running_task/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
# Inference Accelerated PDF parsing
This fold include a series infra-accelerate modules for origin PDF parsing, including:
- intergrated preprocessing into dataloader
- fast_postprocessing
- torch.compile + bf16
- tensorRT
- Torch-TensorRT[https://pytorch.org/TensorRT/]

Those engine is tested on a 80,000,000 pdf dataset and get a 5-10x speedup compared with the origin pdf parsing engine. Basicly, it can reach 6-10 pages per second on a single A100 GPU.

This is not a pipline framework but seperated into three task-wise batch processing engine. But it can be easily integrated into your own pipline framework.
- Detection (Bounding Boxing)
- Recognition (OCR)
- Math formula recognition (MFR)

## Detection (Bounding Boxing)
Check the unit case:1000pdf takes around 20-30min
```
python batch_running_task/task_layout/rough_layout.py
```
### LayoutLM
The layoutLM is based on the `detectron2`. The main Vision Engine(ViT) is implemented via huggingface, the postprocess is based on detectron.
There is a tensorRT version of the detectron model https://github.com/NVIDIA/TensorRT/tree/main/samples/python/detectron2 , but it is only for Mask R-CNN backbone.
The tensorRT author manuelly develop the CUDA NMS and ROIAlign such as `DET2GraphSurgeon` (see https://github.com/NVIDIA/TensorRT/blob/main/samples/python/detectron2/create_onnx.py) to convert the detectron2 model to tensorRT model.
For layoutLM, there is no such tool to convert whole model into a tensorRT engine.
There are serveral ways to accelerate the layoutLM model:
- accelerate part by part, such as the tensorRT ViT backbone with detectron ROIAlign, NMS.
- use torch.compile
- use bf16

In this repo, I use the torch.compile(1.5x) and bf16(2x) to accelerate the layoutLM model. The tensorRT version is not implemented yet.

Another way to accelerate the layoutLM is `avoid .numpy() large GPU tensor`. Origin code will use
```
boxes = outputs["instances"].to("cpu")._fields["pred_boxes"].tensor.numpy()
labels = outputs["instances"].to("cpu")._fields["pred_classes"].numpy()
scores = outputs["instances"].to("cpu")._fields["scores"].numpy()
```
This will copy the large tensor from GPU to CPU. (Since we later only gather part of data via `mask`, full tensor copy is unnecessary).
The better way is to do slicing on GPU tensor and then copy the sliced tensor to CPU. (2x) (see batch_running_task/task_layout/get_batch_layout_model.py)


### MFD
MFD(Math Formula Detection) is a simple YOLO model build through `ultralytics`. It has a good tensorRT convert tool chain. See https://docs.ultralytics.com/modes/export/ and convension/MDF/convert.py
Download the engine via `huggingface-cli download --resume-download --local-dir-use-symlinks False LLM4SCIENCE/ultralytics-YOLO-MFD --local-dir models/MFD`. The `batchsize` and `tensorRT version==10.3.0` must match! if you want to use the `trt_engine` directly.

### PaddleOCR-Det
PaddleOCR-Det is the best text detecter around the world. But original paddle det only support one image per batch. In our detection task, every image is normlized into same size, so the original paddle det does not fit our task. Refer to `https://github.com/WenmuZhou/PytorchOCR`, Zhou has convert the paddleOCR into pytorch. It allow us use batch detection in pytorch now.

There is a big speed up possiblity for the postprocessing for the paddleOCR-Det module. Currently, we use the DB postprocessing. See `https://github.com/PaddlePaddle/PaddleOCR/blob/main/ppocr/postprocess/db_postprocess.py`. The DB postprocessing is the slow part compare to whole detection process. Currently, there is no any speedup solution for the DB postprocessing.

### Detection Async(experimental)
See `batch_running_task/task_layout/rough_layout_with_aync.py`
The async detection is a way to async postprocess and GPU inference. It works perfectly. But in slurm system, there is `exit` error when run the script, this will make your machine `CPU soft lock`. So, I do not recommend to use this script in slurm system.

## Recognition (OCR)
Check the unit case:1000 pdf takes around 2-5 min
```
python batch_running_task/task_rec/rough_rec.py
```
PaddleOCR-Rec is the best text recognizer around the world. The original paddle rec support batch image processing. And the origin paddleOCR "is already very fast".
However, you can see I still use `PytorchOCR` in this part. Just want to provide a non-paddle solution.
Download the engine via `huggingface-cli download --resume-download --local-dir-use-symlinks False LLM4SCIENCE/pytorch_paddle_weight --local-dir models/pytorch_paddle_weight `. The `batchsize` and `tensorRT version==10.3.0` must match! if you want to use the `trt_engine` directly.


## Math formula recognition (MFR)
Check the unit case: 1000 pdf takes around 2-5min
```
python batch_running_task/task_mfr/rough_mfr.py
```
MFR model is `nougat` based model named `UniMERNet`. I tried to use Huggingface tensorRT convert tool chain to convert the model into tensorRT. But it failed. (The reshape module is not set properly). One way is using the `TensorRT-LLM`, see `https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/multimodal` and `convension/unimernet`.
- Notice `TensorRT-LLM` will default install `mpi4py=4.*.*` which will require `mpi.so40`. The conda `conda install -c conda-forge openmpi` can only support `openmpi==3.*.*'. So you need to install `openmpi` from source. Or, you can just `pip install mpi4py==3.*`.
- Notice you should `srun --mpi=pmi2` when run script in slurm.

Download the engine via `huggingface-cli download --resume-download --local-dir-use-symlinks False LLM4SCIENCE/unimernet --local-dir models/MFR/unimernet`. The `batchsize` and `tensorRT version==10.3.0` must match! if you want to use the `trt_engine` directly.

The different between `LLM4SCIENCE/unimernet` and `wanderkid/unimernet` is we delete the `counting` module in weight file. (it only works in training). And it is a pure nougat model.


## Batch run the task
Each task has a "batch_deal_with_xxx" module which will automatively schedule task. For example, your can prepare a `.jsonl` file named `test.filelist` with each line is
```
{"track_id":"e8824f5a-9fcb-4ee5-b2d4-6bf2c67019dc","path":"10.1017/cbo9780511770425.012.pdf","file_type":"pdf","content_type":"application/pdf","content_length":80078,"title":"German Idealism and the Concept of Punishment || Conclusion","remark":{"file_id":"cbo9780511770425.012","file_source_type":"paper","original_file_id":"10.1017/cbo9780511770425.012","file_name":"10.1017/cbo9780511770425.012.pdf","author":"Merle, Jean-Christophe"}}
{"track_id":"64d182ba-21bf-478f-bb65-6a276aab3f4d","path":"10.1111/j.1365-2559.2006.02442.x.pdf","file_type":"pdf","content_type":"application/pdf","content_length":493629,"title":"Sensitivity and specificity of immunohistochemical antibodies used to distinguish between benign and malignant pleural disease: a systematic review of published reports","remark":{"file_id":"j.1365-2559.2006.02442.x","file_source_type":"paper","original_file_id":"10.1111/j.1365-2559.2006.02442.x","file_name":"10.1111/j.1365-2559.2006.02442.x.pdf","author":"J King; N Thatcher; C Pickering; P Hasleton"}}
```
and then run
```
python batch_running_task/task_layout/batch_deal_with_layout.py --root test.filelist
python batch_running_task/task_layout/batch_deal_with_rec.py --root test.filelist
python batch_running_task/task_layout/batch_deal_with_mfr.py --root test.filelist
```
36 changes: 36 additions & 0 deletions batch_running_task/batch_run.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@

TOTALNUM=30
CPU_NUM=$1 # Automatically get the number of CPUs
if [ -z "$CPU_NUM" ]; then
CPU_NUM=$TOTALNUM
fi
# check hostname: if it start with SH than use

if [[ $(hostname) == SH* ]]; then
PARA="--quotatype=spot -p AI4Chem -N1 -c8 --gres=gpu:1"

export LD_LIBRARY_PATH=/mnt/cache/share/gcc/gcc-7.5.0/lib64:${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
export PATH=/mnt/cache/share/gcc/gcc-7.5.0/bin:$PATH

else

PARA="-p vip_gpu_ailab_low -N1 -c8 --gres=gpu:1"
fi
SCRIPT="batch_running_task/task_rec/run_rec.sh"
FILELIST="physics_collection/wait_for_ocr.filelist"


START=0
for ((CPU=0; CPU<CPU_NUM; CPU++));
do

#sbatch --quotatype=spot -p AI4Chem -N1 -c8 --gres=gpu:1 run.sh sci_index_files.addon.filelist $(($CPU+$START)) $TOTALNUM
#sbatch --quotatype=spot -p AI4Chem -N1 -c8 --gres=gpu:1 run_mfr.sh physics_collection/sci_index_files.remain.filelist 0 1
sbatch $PARA $SCRIPT $FILELIST $(($CPU+$START)) $TOTALNUM
#sbatch --quotatype=spot -p AI4Chem -N1 -c8 --gres=gpu:1 physics_collection/sci_index_files.finished.filelist $(($CPU+$START)) $TOTALNUM
#sbatch --quotatype=spot -p AI4Chem -N1 -c8 --gres=gpu:1 batch_running_task/task_layout/run_layout_for_missing_page.sh physics_collection/analysis/not_complete_pdf_page_id.pairlist.remain.filelist $(($CPU+$START)) $TOTALNUM
## lets sleep 20s every 10 job start
if [ $(($CPU % 10)) -eq 9 ]; then
sleep 20
fi
done
114 changes: 114 additions & 0 deletions batch_running_task/batch_run_utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
from tqdm.auto import tqdm
from multiprocessing import Pool
import numpy as np
import argparse
import os
from dataclasses import dataclass
from typing import List
@dataclass
class BatchModeConfig:
task_name = 'temp'
root_path : str

index_part : int = 0
num_parts : int = 1
datapath : str = None
savepath : str = None
logpath : str = "analysis"
batch_num : int = 0
redo : bool = False
shuffle: bool = False
debug:bool=False
verbose: bool = False
ray_nodes: List[int] = None
debug: bool = False

def from_dict(kargs):
return BatchModeConfig(**kargs)
def to_dict(self):
return self.__dict__



def process_files(func, file_list, args:BatchModeConfig):

num_processes = args.batch_num
if num_processes == 0:
results = []
for arxivpath in tqdm(file_list, desc="Main Loop:"):
results.append(func((arxivpath, args)))
return results
else:
with Pool(processes=num_processes) as pool:
args_list = [(file, args) for file in file_list]
results = list(tqdm(pool.imap(func, args_list), total=len(file_list), desc="Main Loop:"))
return results

import json
def obtain_processed_filelist(args:BatchModeConfig,alread_processing_file_list=None):
ROOT_PATH = args.root_path
index_part= args.index_part
num_parts = args.num_parts
if alread_processing_file_list is None:
if ROOT_PATH.endswith('.json'):
with open(ROOT_PATH,'r') as f:
alread_processing_file_list = json.load(f)
elif os.path.isfile(ROOT_PATH):
if ROOT_PATH.endswith('.filelist'):
with open(ROOT_PATH,'r') as f:
alread_processing_file_list = [t.strip() for t in f.readlines()]
elif ROOT_PATH.endswith('.arxivids'):
with open(ROOT_PATH,'r') as f:
alread_processing_file_list = [os.path.join(args.datapath, t.strip()) for t in f.readlines()]
else:
alread_processing_file_list = [ROOT_PATH]
elif os.path.isdir(ROOT_PATH):
### this means we will do the whole subfiles under this folder
alread_processing_file_list = os.listdir(ROOT_PATH)
alread_processing_file_list = [os.path.join(ROOT_PATH,t) for t in alread_processing_file_list]
else:
### directly use the arxivid as input
alread_processing_file_list = [os.path.join(args.datapath, ROOT_PATH)]

totally_paper_num = len(alread_processing_file_list)
if totally_paper_num > 1:
divided_nums = np.linspace(0, totally_paper_num , num_parts+1)
divided_nums = [int(s) for s in divided_nums]
start_index = divided_nums[index_part]
end_index = divided_nums[index_part + 1]
else:
start_index = 0
end_index = 1
args.start_index = start_index
args.end_index = end_index
if args.shuffle:
np.random.shuffle(alread_processing_file_list)
alread_processing_file_list = alread_processing_file_list[start_index:end_index]

return alread_processing_file_list


def save_analysis(analysis, debug, args):

logpath = os.path.join(args.logpath,args.task_name)
print(logpath)
os.makedirs(logpath, exist_ok=True)
if args.num_parts > 1:
for key, val in analysis.items():
print(f"{key}=>{len(val)}")
fold = os.path.join(logpath,f"{key.lower()}.filelist.split")
os.makedirs(fold, exist_ok=True)
with open(os.path.join(fold,f"{args.start_index}-{args.end_index}"), 'w') as f:
for line in val:
f.write(line+'\n')
else:
#print(analysis)
for key, val in analysis.items():
print(f"{key}=>{len(val)}")
if debug:
print(val)
else:
with open(os.path.join(logpath,f"{key.lower()}.filelist"), 'w') as f:
for line in val:
f.write(line+'\n')

35 changes: 35 additions & 0 deletions batch_running_task/check_log.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@

for file in .log/*;
do
### skip if it is not a file
[ -f "$file" ] || continue
## if head -n 1 file has string `is not` then delete this file
if [ "$(tail -n 3 "$file"|head -n 1|grep -c 'is not')" -eq 1 ]; then
echo "Deleting $file"
rm -f "$file"
fi
done

user=`whoami`
jobname='ParseSciHUB'

runningPID=`squeue -u $user -n $jobname | awk '{print $1}'`
for log_file in .log/*;
do
### skip if it is not a file
[ -f "$log_file" ] || continue
## get PID from log_file, the name rule is like $PID-ParseSciHUB.out
PID=$(echo $log_file|awk -F'/' '{print $2}'|awk -F'-' '{print $1}')
## if the PID is not in runningPID, then delete this file
if [ "$(echo $runningPID|grep -c $PID)" -eq 0 ]; then
#echo "Deleting $log_file"
rm -f "$log_file"
else
#line=$(tail -n 30 "$log_file"|grep Data|tail -n 1| sed 's/\x1B\[A//g'| tr -d '\r')
line=$(tail -n 1000 "$log_file"|grep "Images batch"|tail -n 1| sed 's/\x1B\[A//g'| tr -d '\r')
#line=$(tail -n 1000 "$log_file"|grep "[Data]"|tail -n 1| sed 's/\x1B\[A//g'| tr -d '\r')
echo $log_file $line
#grep Error "$log_file"
fi
done
#echo "$output"
89 changes: 89 additions & 0 deletions batch_running_task/dataaccelerate.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
#pip install prefetch_generator

# 新建DataLoaderX类
from torch.utils.data import DataLoader
import numpy as np
import torch

def sendall2gpu(listinlist,device):
if isinstance(listinlist,(list,tuple)):
return [sendall2gpu(_list,device) for _list in listinlist]
elif isinstance(listinlist, (dict)):
return dict([(key,sendall2gpu(val,device)) for key,val in listinlist.items()])
elif isinstance(listinlist, np.ndarray):
return listinlist
else:
return listinlist.to(device=device, non_blocking=True)
try:
from prefetch_generator import BackgroundGenerator
class DataLoaderX(DataLoader):
def __iter__(self):
return BackgroundGenerator(super().__iter__())
except:
pass#DataLoaderX = DataLoader
class DataSimfetcher():
def __init__(self, loader, device='auto'):

if device == 'auto':
self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
else:
self.device = device
self.loader = iter(loader)

def next(self):
try:

self.batch = next(self.loader)
self.batch = sendall2gpu(self.batch,self.device)
except StopIteration:
self.batch = None
return self.batch
class DataPrefetcher():
def __init__(self, loader, device='auto'):
if device == 'auto':
self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
else:
self.device = device#raise NotImplementedError
self.loader = iter(loader)
self.stream = torch.cuda.Stream()
# With Amp, it isn't necessary to manually convert data to half.
# if args.fp16:
# self.mean = self.mean.half()
# self.std = self.std.half()
self.preload()

def preload(self):
try:
self.batch = next(self.loader)
except StopIteration:
self.batch = None
return
with torch.cuda.stream(self.stream):
self.batch = sendall2gpu(self.batch,self.device)
# With Amp, it isn't necessary to manually convert data to half.
# if args.fp16:
# self.next_input = self.next_input.half()
# else:
# self.next_input = self.next_input.float()

def next(self):
torch.cuda.current_stream().wait_stream(self.stream)
batch = self.batch
self.preload()
return batch

class infinite_batcher:
def __init__(self,data_loader, device='auto'):
self.length=len(data_loader)
self.now=-1
self.data_loader= data_loader
self.prefetcher = None
self.device = device
def next(self):
if (self.now >= self.length) or (self.now == -1):
if self.prefetcher is not None:del self.prefetcher
self.prefetcher = DataSimfetcher(self.data_loader,device=self.device)
self.now=0
self.now+=1
return self.prefetcher.next()

Loading
Loading