Skip to content
eugenefratkin edited this page Mar 23, 2011 · 2 revisions

Daisy Wang has been doing information extraction (text labeling) stuff using Conditional Random Fields in Postgres that Joe would like to port to MADlib.

Inference and learning methods for graphical models (Bayes Nets):

One-pass approximate quantiles: We should either invent an extension to the countmin approach for discrete domains, or look into one of these algorithms:

Graph algorithms (e.g. for social network analysis)

  • cluster coefficients (Joe has a naive SQL implementation, but one can do much better)
  • PageRank (we have a Greenplum MapReduce implementation)
  • centrality metrics

Sampling methods.

Sets of primitives necessary for algorithm development:

  • vector operations: assume v and w to be vectors, M to be matrix and a to be scalar. We need hash(v), element wise operations - vw, v+w, v/w, v-w, va, v-a, dot(v,v), v^a, versions of same where null=0, sum(v), ditance(v,w) with some options, covariance(v,w)
  • expose functionality of Lapack to SQL, including: matrix transforms, decompositions, inversions and so on.
  • Sets of statistical functions. For each distribution (Uniform, Gaussian, Poisson, Chi Sq, Exp, Binomial, Multinomial, T) provide number generator, PDF, CDF, inverse CDF (if applicable).