Skip to content

Refactoring Suggestions

Prasad Talasila edited this page Apr 23, 2018 · 39 revisions

General Refactoring Ideas:

  1. Use os.path.join() function to create paths in the drivers.
  2. lib/analysis/user.py has keyword_clusters() function. The code is incorrect; it needs to be improved.
  3. lib/slack/analysis/network.py has a message_number_graph() function which is copied from lib/analysis/network.py, but the code fails for slack logs. We need to fix this problem.
  4. The reader.py incorrectly assumes that all months have 31 days; the loops in the reader.py (lines-53 to 62) also impose arithmetic relation on dates. It is much better to use a calendar / date module which has abstractions like next day, next month and next year.
  5. The lib.vis.matplotlob_csv_heatmap_generator() function must be renamed to lib.vis.matplotlib_csv_heatmap_generator(). The corresponding tests must also change.
  6. Move config.py to the directory of user drivers (top level for now). See issue #178.
  7. Remove plot.ly credentials from config.py. See issue #187.
  8. Move util functions from channel.py to util.py. See issue #176.
  9. degree_analysis_on_graph() function has date parameter. The parameter remains unused and can be removed.
  10. Remove percentile cutoff from functions of channel.py. That way, we can complete analysis for all users and then apply different statistical cutoffs as filters (see issue #175).
  11. Port correlation analysis from Ubuntu drivers to lib/. See issue #182.
  12. Modify the heatmap function to work with flexible time range. For now, it only works with days. See issue #151.
  13. The lib.analysis.network.channel_user_presence_graph_and_csv() function is 270 lines long. It needs to be refactored.
  14. The lib.analysis.channel.conv_len_conv_refr_time() function is 152 lines long. It needs to be refactored.
  15. An inner function lib.analysis.network.compare_spliced_nick() calls an outer function lib.analysis.network.nick_receiver_from_conn_comp(). The order of invocation is incorrect.
  16. Need efficient data structure for storing keywords in memory and in files.
  17. Use loggers and move the debug messages away from terminal to files, see issue #74.
  18. Remove repetition of work done by various functions across modules and make the functions composable. See issue #177,
  19. Use better data structures (may be radix tree) for nicknames, keywords, chat logs.
  20. Right now we consider input lines as unstructured text. If we create a structure for each line as {timestamp, sender, receiver, message}, we might be able to index and process the input must faster.
  21. Use the Python's map-reduce functions to process the input very quickly. We can also improve the in-memory representation of the logs using memory mapped files or iter tools. See issue #78.
  22. Use concurrency to reduce the analysis time. We can use python multiprocessing module for introducing concurrency into the project. (see issue #174)
  23. Functional programming concepts of Python can be used to simplify the lot of inner functions. See overview, tutorials - 1, 2
  24. Use profiler to get time and memory performance of different functions. See issue #67 and issue #171.
  25. Use use config object (dictionary) instead of config.py. See issue #84.
  26. Remove specification of artificial upper bounds on the lists. See issue #40.
  27. The repository size has grown beyond 100MB, mainly because of increase in the size of test data files. We need to shrink the repository
  28. Use pajek software for hierarchical community detection and visualization of large scale networks.
  29. Migrate all the plotting requirements from plotly to d3.js, see issue #108 and issue #170.
  30. We need a polymorphic graph reduction function, see issue #179.
  31. Refactor nicktracker.py. nicks_same_list is initialised to list of list of size max_number_of_diff_nicks. also consider writing the code in a way that a common script could take care of the three programs.
  32. Reduce the input data set sizes for all the tests. This is important because the test/ directory is already 145MB in size.

Performance Bottlenecks

The code base has been run on slackware for the month of Jan 2013. Here are a few run time and memory consumption numbers.

Conversation Characteristics

Date Range Run Time Virtual Memory Resident Memory
3 days (1 to 3 Jan, 2013) 9 sec 640MB 150MB
13 days (1 to 13 Jan, 2013) 133 sec 730MB 240MB
31 days (1 to 31 Jan, 2013) 600 sec 640MB 180MB
186 days (1-Jan to 30-Jun, 2013) does not finish in 4 hrs 640MB 180MB

Lesson: solve issue-175 and simplify both channel.conv_len_conv_refr_time() and channel.response_time() functions.

Message Exchange Graph

In the following table graph is a short notation for message exchange graph.

Date Range graph stats generate graph save graph using dot conversation characteristics
3 days (1 to 3 Jan, 2013) 66 users, 494 messages 10 sec 0.8 sec 9 sec
13 days (1 to 13 Jan, 2013) 195 users, 2833 messages 114 sec 243 sec 133 sec
31 days (1 to 31 Jan, 2013) 337 users, 6999 messages 215 sec > 1 hr 600 sec
186 days (1-Jan to 30-Jun, 2013) 768 users, 26812 messages 870 sec not tried does not finish in 4 hrs

generate graph is done using network.message_number_graph() function.
save graph using dot command is done using saver.draw_nx_graph() function.
Both these functions involve drawing large graphs; that's why they are slower.

Plot of Infomaps Communities

The performance of vis.plot_infomap_igraph() function. (memory consumption remains stable at 640MB of virtual memory for all cases)

Date Range edge weight cutoff graph size execution time
5 days (1 to 5 Jan, 2013) 0 95 users, 960 messages 0.4 sec
5 days (1 to 5 Jan, 2013) 10 18 users, 245 messages 0.1 sec
5 days (1 to 5 Jan, 2013) 20 2 users, 20 messages 0.05 sec
13 days (1 to 13 Jan, 2013) 0 195 users, 2833 messages 1 sec
13 days (1 to 13 Jan, 2013) 10 30 users, 1059 messages 0.12 sec
13 days (1 to 13 Jan, 2013) 20 9 users, 464 messages 0.07 sec
31 days (1 to 31 Jan, 2013) 0 337 users, 6999 messages does not finish in 50 minutes
31 days (1 to 31 Jan, 2013) 10 61 users, 3145 messages 0.24 sec
31 days (1 to 31 Jan, 2013) 20 27 users, 1915 messages 0.11 sec

This function uses igraph.plot() which is the likely culprit for the delay.

HITS Algorithm

The performance of network.identify_hubs_and_experts() function.
(memory consumption remains stable at 840MB of virtual memory for all cases)
This function internally makes calls to network.message_number_graph() and networkx.hits() functions. The plotting of graph generated by HITS is not part of this function.

Date Range edge weight cutoff graph size execution time generate graph
31 days (1 to 31 Jan, 2013) 10 61 users, 3145 messages 480 sec 213 sec
186 days (1-Jan to 30-Jun, 2013) 0 768 users, 4707 messages 2428 sec
186 days (1-Jan to 30-Jun, 2013) 10 147 users, 588 messages 2428 sec 870 sec
186 days (1-Jan to 30-Jun, 2013) 20 74 users, 244 messages 2428 sec sec

Even after accounting for the time taken to generate the graph, HITS is an exponential algorithm. The running time of identify_hubs_and_experts() is also independent of the message cutoffs. This is because of the way the function is written. This function computes the complete message exchange graph each time it is called and then as a last step, truncates all the edges below a certain weight. It is better to pass a message exchange graph to this function and ask it to compute HITS graph.

Specific Refactoring Ideas

Reader

  1. Rather than loading all log data and storing it in raw format, we can preprocess the raw data into a meaningful format that helps reduce the runtime of analysis. 1a. The general format and interesting data present in one line of log data is <sender's nick> <receiver's nick> Generalising the format we can represent the log data as -
    • <sender's nick> <receiver's nick, (optional)>
    • type can be login message, logout message, channel exit message, a general message (message sent from one user with a specific receiver) , a conversation message (message directed to specific user)

Data Structure: dict with keys as token type. Present work prototype: https://github.com/kaushik-rohit/irclog-mirror

Message Exchange Graph

Line 75 - 79 are useless and can be moved outside the for loop.