Refactoring Suggestions

General Refactoring Ideas:

Use os.path.join() function to create paths in the drivers.
lib/analysis/user.py has keyword_clusters() function. The code is incorrect; it needs to be improved.
lib/slack/analysis/network.py has a message_number_graph() function which is copied from lib/analysis/network.py, but the code fails for slack logs. We need to fix this problem.
The reader.py incorrectly assumes that all months have 31 days; the loops in the reader.py (lines-53 to 62) also impose arithmetic relation on dates. It is much better to use a calendar / date module which has abstractions like next day, next month and next year.
The lib.vis.matplotlob_csv_heatmap_generator() function must be renamed to lib.vis.matplotlib_csv_heatmap_generator(). The corresponding tests must also change.
Move config.py to the directory of user drivers (top level for now). See issue #178.
Remove plot.ly credentials from config.py. See issue #187.
Move util functions from channel.py to util.py. See issue #176.
degree_analysis_on_graph() function has date parameter. The parameter remains unused and can be removed.
Remove percentile cutoff from functions of channel.py. That way, we can complete analysis for all users and then apply different statistical cutoffs as filters (see issue #175).
Port correlation analysis from Ubuntu drivers to lib/. See issue #182.
Modify the heatmap function to work with flexible time range. For now, it only works with days. See issue #151.
The lib.analysis.network.channel_user_presence_graph_and_csv() function is 270 lines long. It needs to be refactored.
The lib.analysis.channel.conv_len_conv_refr_time() function is 152 lines long. It needs to be refactored.
An inner function lib.analysis.network.compare_spliced_nick() calls an outer function lib.analysis.network.nick_receiver_from_conn_comp(). The order of invocation is incorrect.
Need efficient data structure for storing keywords in memory and in files.
Use loggers and move the debug messages away from terminal to files, see issue #74.
Remove repetition of work done by various functions across modules and make the functions composable. See issue #177,
Use better data structures (may be radix tree) for nicknames, keywords, chat logs.
Right now we consider input lines as unstructured text. If we create a structure for each line as {timestamp, sender, receiver, message}, we might be able to index and process the input must faster.
Use the Python's map-reduce functions to process the input very quickly. We can also improve the in-memory representation of the logs using memory mapped files or iter tools. See issue #78.
Use concurrency to reduce the analysis time. We can use python multiprocessing module for introducing concurrency into the project. (see issue #174)
Functional programming concepts of Python can be used to simplify the lot of inner functions. See overview, tutorials - 1, 2
Use profiler to get time and memory performance of different functions. See issue #67 and issue #171.
Use use config object (dictionary) instead of config.py. See issue #84.
Remove specification of artificial upper bounds on the lists. See issue #40.
The repository size has grown beyond 100MB, mainly because of increase in the size of test data files. We need to shrink the repository
Use pajek software for hierarchical community detection and visualization of large scale networks.
Migrate all the plotting requirements from plotly to d3.js, see issue #108 and issue #170.
We need a polymorphic graph reduction function, see issue #179.
Refactor nicktracker.py. nicks_same_list is initialised to list of list of size max_number_of_diff_nicks. also consider writing the code in a way that a common script could take care of the three programs.
Reduce the input data set sizes for all the tests. This is important because the test/ directory is already 145MB in size.

Performance Bottlenecks

The code base has been run on slackware for the month of Jan 2013. Here are a few run time and memory consumption numbers.

Conversation Characteristics

Date Range	Run Time	Virtual Memory	Resident Memory
3 days (1 to 3 Jan, 2013)	9 sec	640MB	150MB
13 days (1 to 13 Jan, 2013)	133 sec	730MB	240MB
31 days (1 to 31 Jan, 2013)	600 sec	640MB	180MB
186 days (1-Jan to 30-Jun, 2013)	does not finish in 4 hrs	640MB	180MB

Lesson: solve issue-175 and simplify both channel.conv_len_conv_refr_time() and channel.response_time() functions.

Message Exchange Graph

In the following table graph is a short notation for message exchange graph.

Date Range	graph stats	generate graph	save graph using dot	conversation characteristics
3 days (1 to 3 Jan, 2013)	66 users, 494 messages	10 sec	0.8 sec	9 sec
13 days (1 to 13 Jan, 2013)	195 users, 2833 messages	114 sec	243 sec	133 sec
31 days (1 to 31 Jan, 2013)	337 users, 6999 messages	215 sec	> 1 hr	600 sec
186 days (1-Jan to 30-Jun, 2013)	768 users, 26812 messages	870 sec	not tried	does not finish in 4 hrs

generate graph is done using network.message_number_graph() function.
save graph using dot command is done using saver.draw_nx_graph() function.
Both these functions involve drawing large graphs; that's why they are slower.

Plot of Infomaps Communities

The performance of vis.plot_infomap_igraph() function. (memory consumption remains stable at 640MB of virtual memory for all cases)

Date Range	edge weight cutoff	graph size	execution time
5 days (1 to 5 Jan, 2013)	0	95 users, 960 messages	0.4 sec
5 days (1 to 5 Jan, 2013)	10	18 users, 245 messages	0.1 sec
5 days (1 to 5 Jan, 2013)	20	2 users, 20 messages	0.05 sec
13 days (1 to 13 Jan, 2013)	0	195 users, 2833 messages	1 sec
13 days (1 to 13 Jan, 2013)	10	30 users, 1059 messages	0.12 sec
13 days (1 to 13 Jan, 2013)	20	9 users, 464 messages	0.07 sec
31 days (1 to 31 Jan, 2013)	0	337 users, 6999 messages	does not finish in 50 minutes
31 days (1 to 31 Jan, 2013)	10	61 users, 3145 messages	0.24 sec
31 days (1 to 31 Jan, 2013)	20	27 users, 1915 messages	0.11 sec

This function uses igraph.plot() which is the likely culprit for the delay.

HITS Algorithm

The performance of network.identify_hubs_and_experts() function.
(memory consumption remains stable at 840MB of virtual memory for all cases)
This function internally makes calls to network.message_number_graph() and networkx.hits() functions. The plotting of graph generated by HITS is not part of this function.

Date Range	edge weight cutoff	graph size	execution time	generate graph
31 days (1 to 31 Jan, 2013)	10	61 users, 3145 messages	480 sec	213 sec
186 days (1-Jan to 30-Jun, 2013)	0	768 users, 4707 messages	2428 sec
186 days (1-Jan to 30-Jun, 2013)	10	147 users, 588 messages	2428 sec	870 sec
186 days (1-Jan to 30-Jun, 2013)	20	74 users, 244 messages	2428 sec	sec

Even after accounting for the time taken to generate the graph, HITS is an exponential algorithm. The running time of identify_hubs_and_experts() is also independent of the message cutoffs. This is because of the way the function is written. This function computes the complete message exchange graph each time it is called and then as a last step, truncates all the edges below a certain weight. It is better to pass a message exchange graph to this function and ask it to compute HITS graph.

Specific Refactoring Ideas

Reader

Rather than loading all log data and storing it in raw format, we can preprocess the raw data into a meaningful format that helps reduce the runtime of analysis. 1a. The general format and interesting data present in one line of log data is <sender's nick> <receiver's nick> Generalising the format we can represent the log data as -
- <sender's nick> <receiver's nick, (optional)>
- type can be login message, logout message, channel exit message, a general message (message sent from one user with a specific receiver) , a conversation message (message directed to specific user)

Data Structure: dict with keys as token type. Present work prototype: https://github.com/kaushik-rohit/irclog-mirror

Message Exchange Graph

Line 75 - 79 are useless and can be moved outside the for loop.

Home

Research Docs

Research Outline

Developer Docs

Miscellaneous

References

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactoring Suggestions

General Refactoring Ideas:

Performance Bottlenecks

Conversation Characteristics

Message Exchange Graph

Plot of Infomaps Communities

HITS Algorithm

Specific Refactoring Ideas

Reader

Message Exchange Graph

Research Docs

Developer Docs

Miscellaneous

Clone this wiki locally