The project is part of my graduation design which aims to crawl structured information of papers from digital library.
Profile spider will be released soon.
- ACM (Done, Support Digital Library Search Result)
- IEEE (Developing, Support Single Page)
- arXiv (Done, Support All Categories)
- AAAI (Done, Support 2009-2019 AAAI Conferences)
Keywords: Python, Scrapy, MySQL, Papers
- Python 3.6
- MySQL 8.0.17
- scrapy
- selenium
- PhantomJS (optional only for IEEE_Spider)
- scrapy_proxies
- pymysql
- twisted
- fake_useragent
You can execute papers.sql to initialize the database.
- MYSQL_DBNAME = 'papers'
- TABLE_NAME = {'ACM_Data', 'IEEE_Data', 'arXiv_Data'}
attribute | data_type | length | not NULL |
---|---|---|---|
p_id | int | 0 | ✅(key) |
title | varchar | 255 | |
authors | varchar | 2047 | |
year | varchar | 255 | |
type | varchar | 255 | |
subjects | varchar | 255 | |
url | varchar | 255 | |
abstract | varchar | 4095 | |
citation | int | 0 |
- A Script runs automatically to get free proxies (HTTP only) and will be integrated to scrapy-based main program soon.
- For every request, it will generate a random proxy and user-agent.
- TXT file, raw json (not exact json) and MySQL are provided to store data.
- Level-based optional log is given.
- Asynchronous mode is used as data storage mechanism for MySQL pipeline, thus the program is more efficient and reliable when encounts data flood from spider.
Before you launch scrapy, you should customize the settings first. When you start IEEE_Spider, js middleware based on selenium and PhantomJS needs adding.
In terminal
scrapy crawl ACM_Spider
or
scrapy crawl IEEE_Spider
etc.
- IEEE Spider (The HTML is JS-dynamic.)
- arXix (easy)
- Proxy Downloader Integration
- MongoDB Storage
- Robuster Xpath Rules
- UUID for Database
- Crawl Specific Pages
- arXiv_Spider searches nothing when requests too much.
- Pipeline encounters MySQL error.