TextbookExtraction

introduction

Using pdfminer, a python library, to extract key concept from several books.

how to use

run scripts to install required packages.

required python version: python3

pip install -r requirement.txt

before your run the program, you need to:

(1) put all books in ./data/corpus/ , you'd better use a clear file name: bookname.pdf.

(2) put all defined concept pair file in ./data/concepts/ , each file named as concepts_bookname.xlsx

then write your conf file in ./conf, you can reference ./conf/task_conf.yaml and ./conf/books_info.yaml.

when everything is ok, run

python main.py --conf_path conf/task_{your}_conf.yaml
# for example
python main.py --conf_path conf/task_conf.yaml

The output will be:

./data/concepts/all_concepts.csv record concept-concept_idx relation. ./data/concepts/book_chapter_ids.csv record chapter-chapter_idx relation. ./data/concepts_page_nums/all_words_info.csv with format: (book_idx,course_idx,fre)

tools

you can use this tool to transform output data to more readable format.

cd tools
python get_word_stat_info.py --word_idx xxx
# or
python get_word_stat_info.py --word

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
conf		conf
core		core
data		data
tools		tools
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
descriptive_statistic.Rmd		descriptive_statistic.Rmd
main.py		main.py
requirement.txt		requirement.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TextbookExtraction

introduction

how to use

tools

About

Releases

Packages

Languages

LalZzy/textbook_extraction

Folders and files

Latest commit

History

Repository files navigation

TextbookExtraction

introduction

how to use

tools

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages