Skip to content
pv-812 edited this page Apr 16, 2015 · 11 revisions

Welcome to the MailingListParser wiki!

Goal -

The main obejective of developing a mailing list parser is to extract information from mailing list such as sender, receivers etc. This information is then used to construct an organizational (or commmunication) structure using hypergraphs and perform analysis on it.

Details -

1. The parser is python based and makes use of the web scraping tool scrapy.

2. The archive of mail threads of LINUX-RT Users is used as the dataset.

Updates 1 -

1. Read up documentation of scrapy.

2. Bulit a carwler to extract the links to all mail thread from the index of mails.

Update 2 -

1. Storage of scraped data : Looked into NoSql databases for storage of scraped data. Of the many NoSql databases , Redis and Cassandra were considered as possible databases for storing scraped data

Redis :

a. Data is stored in terms of key-value pair.

b. Disk-backed in-memory database

c. Master slave architecture with replication

d. Provides different data structures such as string, sets, lists ,etc. for storgae of data

e. Provides transactions



Cassandra :

a. Column-family database

b. Provides querying capabilities as seen in relational databases. Uses CQL3 for querying

c. No master slave architecture, all nodes are considered equal

d. Writes can be faster than reads, when reads are disk bound.

e. Map/reduce possible with Apache Hadoop.

Both databases have python API.

While Redis provides better performance speed due to the fact that updates are done in-memory, we cannot set more than one attribute as a key and hence cannot performance multi-key operations. Also the dataset size is limited to the size of the computer RAM. Considering the above points, I concluded that Cassandra would be used for storing scraped data. However, this idea was dropped on further consideration, as most of Cassandra's processing utility would not be used, making it just a temporary store. It was finally decided to store the scraped data in normal files.

2. Changed mailing list archive to Linux Kernel Mailing List archive from Linux Real Time Users archive as certain important fields such as refernce number for each mail was not available in the previous archive.

3. Built a crawler to extract data from the mailing list archive. Unable to retrieve all of the data due to network problems.