Using pdfminer, a python library, to extract key concept from several books.
run scripts to install required packages.
required python version: python3
pip install -r requirement.txt
before your run the program, you need to:
(1) put all books in ./data/corpus/ , you'd better use a clear file name: bookname.pdf.
(2) put all defined concept pair file in ./data/concepts/ , each file named as concepts_bookname.xlsx
then write your conf file in ./conf, you can reference ./conf/task_conf.yaml and ./conf/books_info.yaml.
when everything is ok, run
python main.py --conf_path conf/task_{your}_conf.yaml
# for example
python main.py --conf_path conf/task_conf.yaml
The output will be:
./data/concepts/all_concepts.csv record concept-concept_idx relation. ./data/concepts/book_chapter_ids.csv record chapter-chapter_idx relation. ./data/concepts_page_nums/all_words_info.csv with format: (book_idx,course_idx,fre)
you can use this tool to transform output data to more readable format.
cd tools
python get_word_stat_info.py --word_idx xxx
# or
python get_word_stat_info.py --word