dblp-spider

A spider written using scrapy that crawls dblp website to extract various data about the authors like his/her co-authors' names, communities to which he/she belongs to and articles that he/she has published. It is then used to build a co-authorship network graph.

Example of a co-authorship network

"Siddhartha Anand" "Partha Basuchowdhuri"
"Siddhartha Anand" "Khusbu Mishra"
...

The above example denotes an edge-list (author_one->author_two). Such an edge-list denotes a graph of co-authors who have worked on a paper together.

Dependencies

How to clone the repository

Simply, run the following command:

$ git clone https://github.com/SiddharthaAnand/dblp-spider.git

This will clone this repository to your local system.

How to run the code

Make sure you are in the working directory of dblp-spider. Then run the following command:

$ scrapy crawl dblpspider [-o] [filename]

This will start the spider, send requests asynchronously and receive data and store the output (denoted by '-o' in the filename given by you).

You can store the extracted data in different file formats, all thanks to scrapy's in-built capacity to do it. You can store file in .csv format, .json format and .jl format. You can read about .jl format and how is it better than .json format over google.

I have used .jl just as an example:

$ scrapy crawl dblpspider -o dblp_data.jl

Sample json data

This is the sample data that you might get after the crawl is over. You can optionally use the in-built json package to pretty print the contents of the json file.

$ head dblp_json.jl
{
    "author_articles_published": [
        "Spanning tree-based fast community detection methods in social networks."
    ],
    "author_name": "Siddhartha Anand",
    "coauthor_communities_list": [
        "show coauthor community: group 1",
        "show coauthor community: group 1",
        "show coauthor community: group 1",
        "show coauthor community: group 1",
        "show coauthor community: group 1"
    ],
    "coauthors_name_list": [
        "Partha Basuchowdhuri",
        "Subhashis Majumder",
        "Riya Roy",
        "Sanjoy Kumar Saha",
        "Diksha Roy Srivastava"
    ]
}
...

Licence

This project is licensed under Apache Licence - see the LICENSE.md for more details.

Future enhancements

Add a no-sql db to insert data
Deploy the spider on a server for large scale crawl
Extract more data from dblp
Visualize the data using a visualization tool

Contributions

Any kind of contribution or suggestion are always welcome. You can modify it and extract even more data from dblp.

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
coauthornetwork		coauthornetwork
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
requirements.txt		requirements.txt
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

dblp-spider

Example of a co-authorship network

Dependencies

How to clone the repository

How to run the code

Sample json data

Licence

Future enhancements

Contributions

About

Releases

Packages

Contributors 2

Languages

License

SiddharthaAnand/dblp-spider

Folders and files

Latest commit

History

Repository files navigation

dblp-spider

Example of a co-authorship network

Dependencies

How to clone the repository

How to run the code

Sample json data

Licence

Future enhancements

Contributions

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages