Skip to content
pv-812 edited this page Apr 16, 2015 · 11 revisions

Welcome to the MailingListParser wiki!

Goal -

The main obejective of developing a mailing list parser is to extract information from mailing list such as sender, receivers etc. This information is then used to construct an organizational (or commmunication) structure using hypergraphs and perform analysis on it.

Details -

1. The parser is python based and makes use of the web scraping tool scrapy.

2. The archive of mail threads of LINUX-RT Users is used as the dataset.

Updates 1 -

1. Read up documentation of scrapy.

2. Bulit a carwler to extract the links to all mail thread from the index of mails.

Update 2 -

1. Storage of scraped data : Looked into NoSql databases for storage of scraped data. Of the many NoSql databases , Redis and Cassandra were considered as possible databases for storing scraped data

    Cassandra : Column-family database
                Provides querying capabilities as seen in relational databases. Uses CQL3 for querying
                No master slave architecture, all nodes are considered equal
                Writes can be faster than reads, when reads are disk bound.
                Map/reduce possible with Apache Hadoop.
 While Redis provides better performance speed due to the fact that updates are done in-memory, 
 we cannot set  more than one attribute as a key and hence cannot performance multi-key operations. 
 Also the dataset  size is limited to the size of the computer RAM. Considering the above points, 
 I concluded that Cassandra  would be used for storing scraped data. However, this idea was 
 dropped on further consideration, as most of Cassandra's processing utility would not be used,
 making it just a temporary store. It was finally decided to store the scraped data in normal files.

2. Changed mailing list archive to Linux Kernel Mailing List archive from Linux Real Time Users archive as certain important fields such as refernce number for each mail was not available in the previous archive.

3. Built a crawler to extract data from the mailing list archive. Unable to retrieve all of the data due to network problems.