ACM-IEEE-arXiv Info Spider (Developing)

The project is part of my graduation design which aims to crawl structured information of papers from digital library.

Profile spider will be released soon.

Supported Libraries

ACM (Done, Support Digital Library Search Result)
IEEE (Developing, Support Single Page)
arXiv (Done, Support All Categories)
AAAI (Done, Support 2009-2019 AAAI Conferences)

Keywords: Python, Scrapy, MySQL, Papers

Dependencies & Requirements

Python 3.6
MySQL 8.0.17
scrapy
selenium
PhantomJS (optional only for IEEE_Spider)
scrapy_proxies
pymysql
twisted
fake_useragent

Data Structure of Database

You can execute papers.sql to initialize the database.

MYSQL_DBNAME = 'papers'
TABLE_NAME = {'ACM_Data', 'IEEE_Data', 'arXiv_Data'}

attribute	data_type	length	not NULL
p_id	int	0	✅(key)
title	varchar	255
authors	varchar	2047
year	varchar	255
type	varchar	255
subjects	varchar	255
url	varchar	255
abstract	varchar	4095
citation	int	0

Features

A Script runs automatically to get free proxies (HTTP only) and will be integrated to scrapy-based main program soon.
For every request, it will generate a random proxy and user-agent.
TXT file, raw json (not exact json) and MySQL are provided to store data.
Level-based optional log is given.
Asynchronous mode is used as data storage mechanism for MySQL pipeline, thus the program is more efficient and reliable when encounts data flood from spider.

Install & Run

Before you launch scrapy, you should customize the settings first. When you start IEEE_Spider, js middleware based on selenium and PhantomJS needs adding.

In terminal

scrapy crawl ACM_Spider

or

scrapy crawl IEEE_Spider

etc.

Developing in Process

IEEE Spider (The HTML is JS-dynamic.)
arXix (easy)
Proxy Downloader Integration
MongoDB Storage
Robuster Xpath Rules
UUID for Database
Crawl Specific Pages

Bugs Found (Ask for help)

arXiv_Spider searches nothing when requests too much.
Pipeline encounters MySQL error.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
acaSpider		acaSpider
.gitattributes		.gitattributes
.gitignore		.gitignore
ACMSpider_info.txt		ACMSpider_info.txt
LICENSE		LICENSE
MySQL-Spider.png		MySQL-Spider.png
README.md		README.md
acemap.ico		acemap.ico
papers.sql		papers.sql
proxy_list.txt		proxy_list.txt
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ACM-IEEE-arXiv Info Spider (Developing)

Supported Libraries

Dependencies & Requirements

Data Structure of Database

Features

Install & Run

Developing in Process

Bugs Found (Ask for help)

Preview

About

Releases

Packages

Contributors 2

Languages

License

xyjigsaw/ACM-IEEE-arXiv-Spider

Folders and files

Latest commit

History

Repository files navigation

ACM-IEEE-arXiv Info Spider (Developing)

Supported Libraries

Dependencies & Requirements

Data Structure of Database

Features

Install & Run

Developing in Process

Bugs Found (Ask for help)

Preview

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages