Skip to content
Prasad Talasila edited this page Feb 10, 2018 · 3 revisions

Observations

  1. Too many subscribers are on the CC/BCC list. These are like marked, but silent people. What social role do they play in this scenario?

  2. If only three points in a frequency distribution are statistically significant (>5% significance), what can we say about the parametric estimation / distribution estimation?

  3. Are kernel methods any better than the parameter estimation at prediction?

  4. It has been suggested that visual inspection is the best way to prove the estimate. Also it is said that the CDF of a distribution is a better predictor than the PDF itself. How to utilize these two pieces of information?

  5. Distribution hubs like [email protected] can be distinguished by the fact that they would have no out-degree. Hence it is possible to analyze the network formed by just the distribution hubs. Howver, the distribution hubs need to be verified manually so as not to remove people with high out-degree/in-degree ratio like [email protected].
    For the real social network created by the users, we need to remove the distribution hubs. An email sent to a distribution hub goes to everyone, so there is no real targeted interaction there. We can look at the social network and the communities after removing the distribution hubs. It would be interesting to see if there are multilevel communities in the induced network. The library version of Infomaps can provide details of multi-level communities and can produce a .tree file.

Future Work

  1. Cluster and classify authors using network and text mining approaches. First try these two approaches separately and then combine them.
  2. Reduce the dimensions of the text feature vector for the authors.
  3. Provide analytical explanation for the importance of the detected dimensions.
  4. There seems to be good amount of work on average graphs and graph clustering for detection of average kind of graph from a collection of graphs. Look this up.
  5. Any model that comes out of this research must have statistically significant predictive power.
  6. Jeh, Glen, and Jennifer Widom. "SimRank: a measure of structural-context similarity." Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2002.
    The above paper can be used to find similar nodes. If these nodes also have the same expertise, then the nodes are presumably substitutable. This is equivalent to automatic finding a new leader in the network with similar expertise and network effects to an existing leader.
  7. The communities for a specific author/subscriber is likely to be based on the interests of the subscriber. It would be interesting to see the evolution of these communities around an author over time. A good way to approach the problem is to pick the top-5 in-degree/out-degree authors separately and track the evolution of the community around each of these authors.
    In order to track these ten authors, we can take one month as a unit of time and pick all the threads that got started in that month as the basis for constructing this author graph. As a starter, doing this for two consecutive months would be good.

Prof. Ashwin Sreenivasan suggested trying out K4.3 clustering and classification algorithm to auto-detect important features.