madlib_the_sql.tex

% THIS IS AN EXAMPLE DOCUMENT FOR VLDB 2012
% based on ACM SIGPROC-SP.TEX VERSION 2.7
% Modified by  Gerald Weber <gerald@cs.auckland.ac.nz>
% Removed the requirement to include *bbl file in here. (AhmetSacan, Sep2012)
% Fixed the equation on page 3 to prevent line overflow. (AhmetSacan, Sep2012)

\documentclass{vldb}
\usepackage{times}
\usepackage{graphicx}
%\usepackage{balance}  % for  \balance command ON LASMAD PAGE  (only there!)
\usepackage{alltt,algo}
\usepackage{ctable} % Nicer tables, e.g., \toprule
\usepackage{verbatim} % Also provides the comment environment
\usepackage{dcolumn} % Decimal point alignment in tables
\usepackage{xcolor} % for color comments
% \usepackage[
% % bookmarks,
% % colorlinks=false,
% % linkcolor=blue,
% % citecolor=blue,
% % pagebackref=false,
%   pdftitle={Introducing MADlib (MAD Skills, the SQL)},
%   pdfauthor={}
% ]{hyperref}                            % Für PDF-Features
\usepackage{url}
\usepackage[protrusion=true,expansion=true]{microtype}
% "txfonts" is apparently needed to fixup the formatting of lstlistings
\usepackage{txfonts}
\usepackage{paralist}
\usepackage{listings}
\lstset{
basicstyle=\ttfamily\scriptsize,       % the size of the fonts that are used for the code
numbers=left,                   % where to put the line-numbers
numberstyle=\ttfamily,      % the size of the fonts that are used for the line-numbers
%aboveskip=0pt,
%belowskip=0pt,
stepnumber=1,                   % the step between two line-numbers. If it is 1 each line will be numbered
%numbersep=10pt,                  % how far the line-numbers are from the code
breakindent=0pt,
firstnumber=1,
%backgroundcolor=\color{white},  % choose the background color. You must add \usepackage{color}
showspaces=false,               % show spaces adding particular underscores
showstringspaces=false,         % underline spaces within strings
showtabs=false,                 % show tabs within strings adding particular underscores
frame=leftline,
tabsize=2,  		% sets default tabsize to 2 spaces
captionpos=b,   		% sets the caption-position to bottom
breaklines=false,    	% sets automatic line breaking
breakatwhitespace=true,    % sets if automatic breaks should only happen at whitespace
%escapeinside={\%}{)}          % if you want to add a comment within your code
columns=fixed,
basewidth=0.52em,
% are you fucking kidding me lstlistings?  who puts the line numbers outside the margin?
xleftmargin=6mm,
xrightmargin=-6mm,
numberblanklines=false,
language=C++,
morekeywords={table,scratch,channel,interface,periodic,bloom,state,bootstrap,morph,monotone,lset,lbool,lmax,lmap}
}
% \lstset{escapeinside={(*}{*)}}

\makeatletter
% % The vldb style file is an anachronsism. So no need to worry about hacks anyway.
% % We redefined the listing environment by package minted here. (If we did not,
% % we would also get an error because \aboveparskip is undefined.)
% % Copied from vldb.csl
% \@ifundefined{code}{\newcounter {code}} % this is for LaTeX2e
% 
% \def\fps@code{tbp}
% \def\ftype@code{1}
% \def\ext@code{lof}
% \def\fnum@code{Listing \thecode}
% \def\code{\@float{code}}
% \let\endcode\end@float
% \@namedef{code*}{\@dblfloat{code}}
% \@namedef{endcode*}{\end@dblfloat}
% \makeatother
% 
% \usepackage{minted}

% BEGIN Layout
	\newcommand{\otoprule}{\midrule[\heavyrulewidth]}
	
	\newcolumntype{+}{>{\global\let\currentrowstyle\relax}}
	\newcolumntype{^}{>{\currentrowstyle}}
	\newcommand{\rowstyle}[1]{\gdef\currentrowstyle{#1}%
		#1\ignorespaces
	}
% END Layout

% BEGIN Convenience commands
  % % BEGIN Listing environments
  % \newminted{cpp}{mathescape,
  %         linenos,
  %         numbersep=5pt,
  %         fontsize=\scriptsize,
  %         framesep=2mm}
  % \newminted{python}{mathescape,
  %           linenos,
  %           numbersep=5pt,
  %           fontsize=\scriptsize,
  %           framesep=2mm}
  % \newminted{sql}{mathescape,
  %         numbersep=5pt,
  %         tabsize=4,
  %         fontsize=\scriptsize,
  %         framesep=2mm}
  % % END Listing environments
	% BEGIN COMMENTS
  % \newcommand{\jmh}[1]{{\textcolor{red}{#1 -- jmh}}}
  % \newcommand{\fs}[1]{{\textcolor{orange}{#1 -- fs}}}
  % \newcommand{\ksn}[1]{{\textcolor{blue}{#1 -- ksn}}}
  % END COMMENTS
% END Convenience commands

% BEGIN Mathematical Definitions
	% BEGIN Set Symbols
		\newcommand{\setsymbol}[1]{\mathbb{#1}}
		\newcommand{\N}{\@ifstar{\setsymbol{N}_0}{\setsymbol{N}}}
		\newcommand{\R}{\setsymbol{R}}
	% END Set Symbols
	\renewcommand{\vec}[1]{\ensuremath{\boldsymbol{#1}}}
% END Mathematical Definitions

\hyphenation{hand-ling}

\graphicspath{{FIGS/}}


\begin{document}

% ****************** TITLE ****************************************

\title{The MADlib Analytics Library\\{\LARGE\em or MAD Skills, the SQL}}


\numberofauthors{11} %  in this sample file, there are a *total*
% of EIGHT authors. SIX appear on the 'first-page' (for formatting
% reasons) and the remaining two appear in the \additionalauthors section.

\author{Joseph M. Hellerstein\\{\small  hellerstein@berkeley.edu}\\{\small U.C. Berkeley} \and
 Christoper R\'{e}\\{\small chrisre@cs.wisc.edu}\\{\small U. Wisconsin} \and
 Florian Schoppmann\\{\small Florian.Schoppmann@emc.com}\\{\small Greenplum} \and
 Daisy Zhe Wang\\{\small daisyw@cise.ufl.edu}\\{\small U. Florida} \and
 % \and
 % Gavin Sherry\\{\small Greenplum} \and
 Eugene Fratkin\\{\small Eugene.Fratkin@emc.com}\\{\small Greenplum} \and
 Aleksander Gorajek\\{\small Aleksander.Gorajek@emc.com}\\{\small Greenplum} \and
 Kee Siong Ng\\{\small KeeSiong.Ng@emc.com}\\{\small Greenplum} \and
 Caleb Welton\\{\small Caleb.Welton@emc.com}\\{\small Greenplum} \and
 Xixuan Feng\\{\small xfeng@cs.wisc.edu} \\ {\small U. Wisconsin} \and
 Kun Li\\{\small kli@cise.ufl.edu} \\{\small U. Florida} \and
 Arun Kumar\\{\small arun@cs.wisc.edu} \\{\small U. Wisconsin} \and
 % Steven Hillion\\{\small Alpine Data Labs} \and
 % Luke Lonergan\\{\small Greenplum} \and
 % Kee Siong Ng\\{\small Greenplum} \and 
 }

\maketitle

\begin{abstract}
  MADlib is a free, open-source library of in-database analytic methods.
  It provides an evolving suite of SQL-based algorithms for machine
  learning, data mining and statistics that run at scale within a
  database engine, with no need for data import/export to other tools.
  The goal is for MADlib to eventually serve a role for scalable
  database systems that is similar to the CRAN library for R: a
  community repository of statistical methods, this time written with
  scale and parallelism in mind.

  In this paper we introduce the MADlib project, including the
  background that led to its beginnings, and the motivation for its
  open-source nature.  We provide an overview of the library's
  architecture and design patterns, and provide a description
  of various statistical methods in that context.  We include performance and speedup results of a core design pattern from one of those methods over the Greenplum parallel DBMS on a modest-sized test cluster.    We then report on two 
  initial efforts at incorporating academic research into MADlib, which is one of the 
  project's goals. 
  
  MADlib is freely
  available at \url{http://madlib.net}, and the project is open
  for contributions of both new methods, and ports to additional
  database platforms.  
\end{abstract}


\section{Introduction:\\From Warehousing to Science}
\noindent
Until fairly recently, large databases were used mainly for accounting
purposes in enterprises, supporting financial record-keeping and
reporting at various levels of granularity.  {\em Data
Warehousing} was the name given to industry practices for these
database workloads.  Accounting, by definition, involves significant
care and attention to detail. Data Warehousing practices followed suit
by encouraging careful and comprehensive database design, and by following exacting policies regarding the quality of data loaded into the database.

Attitudes toward large databases have changed quickly in the
past decade, as the focus of usage has shifted from
accountancy to analytics.  The need for correct accounting and data
warehousing practice has not gone away, but it is becoming a
shrinking fraction of the volume---and the value---of large-scale data
management.  The emerging trend focuses on the use of a wide range of
potentially noisy data to support predictive analytics, provided via 
statistical models and algorithms.  {\em Data Science} is a
name that is gaining currency for the industry practices evolving
around these workloads.

Data scientists make use of database engines in a very different way than traditional data warehousing professionals.  Rather than carefully
designing global schemas and ``repelling'' data until it is integrated,
they load data into private schemas in whatever form is convenient.  Rather than focusing on simple OLAP-style drill-down reports, they implement rich statistical
models and algorithms in the database, using extensible SQL as a
language for orchestrating data movement between disk, memory, and multiple parallel machines.  In short, for data scientists a DBMS is a scalable analytics runtime---one that is conveniently compatible with the database systems widely used for transactions and accounting.

In 2008, a group of us from the database industry, consultancy,
academia, and end-user analytics got together to describe this usage
pattern as we observed it in the field.  We dubbed it {\bf MAD}, an acronym for the {\em Magnetic} (as opposed to
repellent) aspect of the platform, the {\em Agile} design patterns used for
modeling, loading and iterating on data, and the {\em Deep} statistical models and
algorithms being used.  The ``MAD Skills'' paper that resulted described
this pattern, and included a number of non-trivial analytics
techniques implemented as simple SQL scripts~\cite{DBLP:journals/pvldb/CohenDDHW09}.

After the publication of the paper, 
significant interest emerged not only in its design aspects, but
also in the actual SQL implementations of statistical methods.  This
interest came from many directions: customers were
requesting it of consultants and vendors, and academics were
increasingly publishing papers on the topic.  What was missing was a
software framework to focus the energy of the community, and connect
the various interested constituencies.  This led to the design of
\textbf{MADlib}, the subject of this paper.  

\subsection*{Introducing MADlib}
MADlib is a library of analytic methods that can be installed and
executed within a relational database engine that supports extensible
SQL.  A snapshot of the current contents of MADlib including methods and
ports is provided in Table~\ref{tab:methods}.  This set of methods and ports is intended to grow over time.

The methods in MADlib are designed both for in- or out-of-core execution, and for the shared-nothing, ``scale-out''
parallelism offered by modern parallel database engines, ensuring that
computation is done close to the data.  The core functionality is
written in declarative SQL statements, which orchestrate data movement to and from disk, and across networked machines.  Single-node inner loops use SQL
extensibility to call out to high-performance math libraries in
user-defined scalar and aggregate functions.  At the highest level,
tasks that require iteration and/or structure definition are coded in
Python driver routines, which are used only to kick off the data-rich
computations that happen within the database engine.

MADlib is hosted publicly at github, and readers are encouraged to
browse the code and documentation via the MADlib website
\url{http://madlib.net}.  The initial MADlib codebase reflects contributions
from both industry (Greenplum) and academia (UC Berkeley, the
University of Wisconsin, and the University of Florida).  Code
management and Quality Assurance efforts have been contributed by
Greenplum.  At this time, the project is ready to consider
contributions from additional parties, including both new methods and
ports to new platforms.

\begin{table}
  \begin{center}
  \begin{tabular}{|l|l|}
    \hline
    {\bf Category} & {\bf Method} \\ \hline\hline
    Supervised Learning
          & Linear Regression\\
          & Logistic Regression \\
          & Naive Bayes Classification\\
          & Decision Trees (C4.5)\\
          & Support Vector Machines\\ \hline
    Unsupervised Learning
          & k-Means Clustering\\
          & SVD Matrix Factorization\\
          & Latent Dirichlet Allocation\\
          & Association Rules\\ \hline
    Decriptive Statistics
          & Count-Min Sketch\\
          & Flajolet-Martin Sketch\\
          & Data Profiling\\
          & Quantiles\\ \hline
    Support Modules
          & Sparse Vectors\\
          & Array Operations\\
          & Conjugate Gradient Optimization \\\hline
  \end{tabular}
  \end{center}
  \vspace{-1em}
\caption{Methods provided in MADlib v0.3. This version has been tested on two DBMS platforms: PostgreSQL (single-node open source) and Greenplum Database (massively parallel commercial system, free and fully-functional for research use.)}
\label{tab:methods}
\end{table}
% 
% \begin{table*} 
%   \begin{center}
%   \begin{tabular}{|p{1.25in}|p{1.75in}|p{3.5in}|}
%   \hline
%   {\bf DBMS} & {\bf R\&D Access} & {\bf Notes}\\ \hline
%     PostgreSQL& Open source & Single-node only. \\ \hline
%     Greenplum Database & ``Community Edition'' available free for non-production use & Shared-nothing parallel DBMS for clusters and multicore. \\ \hline
%      \end{tabular}
% \end{center}
% \caption{Current MADlib ports. \jmh{This isn't very useful.  Worth pointing out somewhere that GP Community Edition is free and not crippleware.} \ksn{A two line table probably doesn't deserve to take up so much space. We can say this in two sentences.}}
% \label{tab:ports}
% \end{table*}

\section{Goals of the Project} 
The primary goal of the MADlib open-source project is to accelerate
innovation and technology transfer in the Data Science community via a shared
library of scalable in-database analytics, much as the CRAN library
serves the R community~\cite{ripley2001r}.  Unlike CRAN, which is
customized to the R analytics engine, we hope that MADlib's grounding
in standard SQL can lead to community ports to a
variety of parallel database engines. 
% \fs{True in theory, but we do currently use a lot of PG and GP specific SQL and extensibility. On the SQL layer, few abstraction efforts have been undertaken (essentially just the m4 preprocessor). At this point, we certainly do not have an edge over R in terms of portability. (While I agree that this is what the C++ AL is for, why we are trying to port pgSQL to Python for driver code, etc.)}
%\jmh{We gain little by hedging here.  This is a good goal for us to embrace, and if we fail at it we should take the consequences.}

% I decided to remove the benchmarking discussion.  --JMH
% In addition to its primary goal, MADlib can serve a number of other
% purposes for the community.  As a standard and relatively rich
% open source platform, it enables apples-to-apples comparisons that can
% be deeper than the traditional TPC benchmarks. For example, the DBMS
% backend can be held constant, so that two implementations for the same task
% (e.g., entity extraction from text) can be compared for runtime and answer
% quality.  Similarly, as MADlib is ported to more platforms, an
% algorithm can be held constant, and two backend DBMS engines can be
% compared for performance.  This latter comparison has been notoriously
% difficult in the past, due to the closed-source, black-box nature of
% ``data mining'' and analytic toolkits that were not only customized to
% specific platforms, but also lacked transparency into their analytic
% algorithms.  In MADlib, such ports may differ on database extension interfaces and calls to single-node math libraries, but should be able to retain the same macro-programmed implementation in SQL (Section~\ref{sec:macro}.)

% \fs{While I'd love this to be true, I don't see that MADlib would be more of an apple-to-apples comparison than using existing benchmarks. Performance would again largely depend on the quality of the port, i.e., how well the port adapts to vendor-specific implementations. Currently, we are using lots of PG/GP specific SQL. In fact, much code is unfortunately still written in pgSQL. (I don't like it, but that's how it is.) Not even user-defined aggregates---arguably our most fundamental building block---is standard SQL.}
\subsection{Why Databases?}

For decades, statistical packages like SAS, Matlab and R have been the key tools for deep analytics, and the practices surrounding these tools have been elevated into widely-used traditional methodologies.  
% DELETED -- better not to pick fights with specific vendors.  -JMH
% SAS in particular has a
% large customer base, and its proposed analytics methodologies have
% become engrained in the modern enterprise. Nevertheless, usurpers loom
% on the horizon. 
One standard analytics methodology advocated in this domain
is called SEMMA: Sample, Explore, Modify, Model, Assess. The ``EMMA''
portion of this cycle identifies a set of fundamental tasks
that an analyst needs to perform, but the first, ``S'' step makes less and less sense in many settings today.  The costs of computation and storage are increasingly cheap, and entire data sets can often be processed efficiently by a cluster of computers.  Meanwhile, competition for extracting value from data has become increasingly refined.  Consider fiercely competitive application domains like online advertising or politics.  It is of course important to target ``typical'' people (customers, voters) that would be captured by sampling the database.  But the fact that SEMMA is standard practice means that optimizing for a sample provides no real competitive advantage.  Winning today requires extracting advantages in the long tail of ``special interests'', a practice known as ``microtargeting'', ``hypertargeting'' or ``narrowcasting''.  In that context, the first step of SEMMA essentially defeats the remaining four steps, leading to simplistic, generalized decision-making that may not translate well to small populations in the tail of the distribution. In the era of ``Big Data'', this argument for enhanced attention to long tails applies to an increasing range of use cases.
%\fs{That explains why you do not want to weed out anything \emph{before} storing the data. But what is the reason for doing large-scale machine learning \emph{after} you have identified you microtarget? Concretely: Why prefer logistic regression over 1~billion rows to 1~million rows? If the model is reasonable, shouldn't the difference always be insignificant?}
%\jmh{Sure, at some point you often have small-data problems.  That's fine.  The point here was to demonstrate that you will also have big-data problems, and you can't solve those with a small-data solution.}

% throws away the very
% competitive advantage that customers hope to get from acquiring their
% valuable data. Sampling decouples the human intelligence in
% modeling from where these insights are put to use (on the entire
% data). This decoupling makes analysts less effective: they must guess
% about the statistical and performance robustness of their model. If
% they guess incorrectly, they may not know about it for weeks or
% months. (SAID BETTER BY PEOPLE WHO TALK GET TO REGULARLY TALK TO
% CUSTOMERS)

Driven in part by this observation, momentum has been gathering around efforts to develop scalable full-dataset analytics. One popular alternative is to
push the statistical methods directly into new parallel processing platforms---notably, Apache Hadoop. For example, the open-source Mahout project
aims to implement machine learning tools within Hadoop, harnessing
 interest in both academia and industry~\cite{DBLP:conf/nips/ChuKLYBNO06,mahout}. This is
certainly a plausible path to a solution, and Hadoop is being advocated as a promising approach even by major database players, including EMC/Greenplum, Oracle and Microsoft.

At the same time that the Hadoop ecosystem has been evolving, the SQL-based analytics ecosystem has grown rapidly as well, and large volumes of valuable data are likely to pour into
SQL systems for many years to come. There is a rich
ecosystem of tools, know-how, and organizational requirements that
encourage this.  For these cases, it would be helpful to push statistical methods into the DBMS. And as we will see, massively parallel databases form a surprisingly useful platform for sophisticated analytics.  MADlib currently targets this environment of in-database analytics.

%% There is nothing to be gained by pissing on Hadoop.  -- JMH
% there
% are drawbacks with the Hadoop approach: its performance is untenably
% slow compared to database processing as it must provide higher levels
% of fault tolerance than other systems. Additionally, the sheer number
% of machines that are required to achieve good performance of makes it
% unclear that Hadoop systems are cost-effect in all but the most scale
% heavy environments.
% 
% For such users, Greenplum provides a cost-effective solution for
% scalable analytics. It does not require a complicated and error-prone
% import/export cycle to Haddop nor forces the user to work over a
% snapshot of the data: one works on the raw data and does their
% analysis on production (or very near-to-production) data. This allows
% user to maximize the value proposition of storing all that data in a
% cost-effective manner.

%%   Q1: why not SAS?  Q2: what about the fact that SQL nerds dont do
%%   data analysis?  Q3: does Hadoop and Mahout make this irrelevant?

%% Chris asked Q1 and Q2 this way:

%% There is another major shift that MADlib potentially allows:
%%   analysts can get closer to their data than ever before. For
%%   example, SAS promotes a model of an analysts workflow called SEMMA
%%   model (Sample, Explore, Modify, Model, Assess). This is (afaik) the
%%   industry standard. To me, what's totally busted about this loop is
%%   the S -- if you're looking for something rare, the sampling step
%%   throws out the most interesting bits! Then your EMM steps where you
%%   build understanding are of a small fraction of your data. This is
%%   the part where the analyst is currently far from the data. As a
%%   result, their entire conversation is not with the data but with a
%%   small sample. If something goes wrong, you (maybe) find out about
%%   it in the A step (which is on the whole data).

%%   Moreover, it's totally unnecessary to do the S loop in the MADlib
%%   world view, and the fact that the S step has been elevated to the
%%   level of feature is a testament to how broken the current tool
%%   chain is.
%% A couple follow-on points:

%%     * WRT SAS and sampling, the way I heard it from the FAN guys its all about competitive advantage in the tails of the distribution.  Something like anybody can tackle the 20% of cases in the head of the distribution.  The competitive advantage is in tackling the long tails: e.g. target ads to toyota-truck-driving-latin-music-loving-sushi-eating women in cold climates.  Simple sample/extract techniques blow on that stuff.
%%     * WRT Hadoop/Mahout:  I think we need to mention it here and acknowledge its been an attractive path to a solution, and Hadoop certainly is getting the attention of the Web and ML communities [Stanford paper].  Mahout is an attempt to package that energy, with some institutional support from MapR.  But despite the success of MapReduce, lots of important data is still going into databases and will continue to do so for years to come for a host of reasons that are both technical and organizational.  Even if Mahout succeeds wildly (and it isnt doing so to date, but I dont think we want to bother saying that), theres a critical vacuum to be filled in SQL-land.  What we can (re-)learn from the Hadoop community is the power of open source teamwork, and the desire for agile platforms for analytics.  Theres no reason we cant direct that agile thinking toward the data in databases.
%% *


\subsection{Why Open Source?}


From the beginning, MADlib was designed as an open-source project with
corporate backing, rather than a closed-source corporate effort with
academic consulting. This decision was motivated by a number of
factors, including the following:

\begin{itemize}
    \item \textbf{The benefits of customization}: Statistical methods are rarely used as turnkey solutions.  As a result, it is common for data scientists to want to modify and adapt canonical models and methods (e.g., regression, classification, clustering) to their own purposes.  This is a very tangible benefit of open-source libraries over traditional closed-source packages. Moreover, in an open-source community there is a process and a set of positive incentives for useful modifications to be shared back to the benefit of the entire community.
    \item \textbf{Valuable data vs.\ valuable software}:  In many emerging business sectors, the corporate value is captured in the data itself, not in the software used to analyze that data.  Indeed, it is in the interest of these companies to have the open-source community adopt and improve their software.  Open-source efforts can also be synergistic for vendors that sell commercial software, as evidenced by companies like EMC/Greenplum, Oracle, Microsoft and others beginning to provide Apache Hadoop alongside their commercial databases.  Most IT shops today run a mix of open source and proprietary software, and many software vendors are finding it wise to position themselves intelligently in that context.  
    % \fs{Reviewer~2 wanted this section shorter, and the next sentence is, IMHO, indeed an instance where he is right. The sentence raises more questions (why is it an advantage if a Greenplum-sponsored project helps competitors?) than it answers.} Meanwhile, for most database system vendors, their core competency is not in statistical methods, but rather in the engines that support those methods, and the service industry that evolves around them.  For these vendors, involvement and expertise with an open-source library like MADlib is an opportunity to expand their functionality and service offerings.
    %\fs{As a reader I ask ask myself here: Does Greenplum have competency in statistical methods? If so, why do they want to let competitors benefit from that? The argument would be plausible if MADlib was a joint venture, but leaves open questions as a unilateral move by Greenplum.}
%MADlib *is* a joint venture, and that's the point -- it's a collaboration between industry and research.  Wouldn't have happened had I not pushed it, and had Greenplum not supported it.
    \item \textbf{Closing the research-to-adoption loop}:  Very few traditional database customers have the capacity to develop significant in-house research into computing or data science. 
    % \fs{Again, one would assume that the Greenplum side of MADlib could not care less if other companies are lacking certain capacities.} 
    % \jmh{Clarified to make sure that I'm referring to DBMS customers, not vendors.}
    On the other hand, it is hard for academics doing computing research to understand and influence the way that analytic processes are done in the field.  An open-source project like MADlib has the potential to connect academic researchers not only to industrial software vendors, but also directly to the end-users of analytics software.  This can improve technology transfer from academia into practice without requiring database software vendors to serve as middlemen.  It can similarly enable end-users in specific application domains to influence the research agenda in academia.
    \item \textbf{Leveling the playing field, encouraging innovation}:  Over the past two decades, database software vendors have developed proprietary data mining toolkits consisting of textbook algorithms.  It is hard to assess their relative merits.  Meanwhile, other communities in machine learning and internet advertising have also been busily innovating, but their code is typically not well packaged for reuse, and the code that is available was not written to run in a database system.  None of these projects has demonstrated the vibrancy and breadth we see in the open-source community surrounding R and its CRAN package. The goal of MADlib is to fill this gap: 
     %     
     % A robust open source project like MADlib 
     % \fs{We should find the right mix of self-confidence and at the same time acknowledging that MADlib is still at an early stage and beta in many respects.} 
     % \jmh{Well said.  Will try to address}
     %can 
     bring the database community up to a baseline level of competence on standard statistical algorithms, remove opportunities for proprietary \textit{FUD}, and help focus a large community on innovation and technology transfer.
\end{itemize}

\subsection{A Model for Open-Source Collaboration}

The design of MADlib comes at a time when the connections between
open-source software and academic research seem particularly frayed.
MADlib is designed in part as an experiment in binding these
communities more tightly, to face current realities in software
development.

In previous decades, many important open-source packages originated in universities and evolved into significant commercial products.  Examples include the Ingres and Postgres database systems, the BSD UNIX and Mach operating systems, the X Window user interfaces and the Kerberos authentication suite.  These projects were characterized by aggressive application of new research ideas, captured in workable but fairly raw public releases that matured slowly with the help of communities outside the university.  While all of the above examples were incorporated into commercial products, many of those efforts emerged years or decades after the initial open-source releases, and often with significant changes.

Today, we expect successful open-source projects to be quite mature,
often comparable to commercial products.  To achieve this level of
maturity, most successful open-source projects have one or more major
corporate backers who pay some number of committers and provide
professional support for Quality Assurance (QA).  This kind of investment is typically
made in familiar software packages, not academic research projects.  Many of the most popular examples---Hadoop, Linux,
OpenOffice---began as efforts to produce open-source alternatives to well-identified,
pre-existing commercial efforts. 
%\fs{I think Linux is too big a project to say it does not contain research. This paragraph also sounds a bit too negative to me. Like ``Today's OSS consists mainly of boring rip-offs of commercial products.'' 
% \jmh{Moderated somewhat.  I didn't mean that Linux doesn't contain research, but rather that it wasn't a research project at base.  And yes -- today's OSS projects do mostly consist of boring rip-offs of stuff from industry.}
% While not completely untrue, there is more to the story.}

MADlib is making an explicit effort to explore a new model for industry support of academic research via open source.  Many academic research projects are generously supported by financial grants and gifts from companies.  In MADlib, the corporate donation has largely consisted of a commitment to allocate significant professional software engineering time to bootstrap an open-source sandbox for academic research and tech transfer to practice. 
% \fs{This paragraph sounds as if MADlib is contrasting, say, Linux here in that (a) it contains cutting-edge research and (b) corporate donations consist mainly of labor instead of money. I do not see that MADlib is singular in that respect. Think of what, e.g., RedHat does, or in fact any company that makes money with GPL'd software.} 
% \jmh{The contrast in the paragraph is pretty clear: as opposed to writing somebody like Joe Hellerstein a check, we offer him an engineering team.  I didn't say anything about Linux.}
This leverages a strength of industry that is not easily replicated by government and other non-profit funding sources.  Companies can recruit high-quality, experienced software
engineers with the attraction of well-compensated, long-term career
paths.  Equally important, software shops can offer an entire software
engineering pipeline that cannot be replicated on campus:
this includes QA processes and QA engineering staff.  The hope is that the corporate staffing of research projects like MADlib can enable more impactful
academic open-source research, and speed technology transfer to industry. %\fs{Linux is used and advanced by lots of students and academics, and there are companies like RedHat, IBM, etc.}
%\jmh{Well sure, and Windows is also used and tweaked by lots of academics.  I don't see many companies bootstrapping new research projects for academia like MADlib; they usually just write checks.}

\subsection{MADlib Status}

MADlib is still young, currently (as of March, 2012) at Version 0.3.  The initial versions have focused on establishing a baseline of 
useful functionality, while laying the groundwork for future evolution.  Initial development began with the non-trivial work of building the general-purpose
framework described in Section~\ref{sec:gen}.  Additionally, we wanted robust
implementations of textbook methods that were most frequently
requested from customers we met through contacts at Greenplum.  Finally, we wanted
to validate MADlib as a research vehicle, by fostering a small number of  university groups working in the area to experiment with the platform and get their code disseminated (Section~\ref{sec:university}).

\section{MADlib Architecture}
\label{sec:gen}
% 
% \ksn{We support PostgreSQL, but make almost no mention of non-parallel in-database algorithms/design issues.
%  Ideally, we would start with issues surrounding the use of non-parallelised SQL to implement statistical methods, then move on to additional complications introduced by parallelism.
%  It's probably too late to change things now, but maybe we can add a sentence near the beginning of Section 3 to say that the MADlib architecture is designed to exploit parallel databases, and that's what we will focus on (exclusively) in Section 3.}
% \fs{The first paragraph of the abstract puts emphasize on parallelism. Likewise, the second paragraph of Section 1 under "Introducing MADlib". So I think we are covered in general. But I don't feel strongly here.}
% 
% \jmh{I put in some brief mention of moving data to/from disk.  I think the VLDB audience will understand that as a built-in feature of using SQL.}
The core of traditional SQL---\texttt{SELECT...} \texttt{FROM...} \texttt{WHERE}... \texttt{GROUP BY}---is
quite a powerful harness for orchestrating bulk data processing across
one or many processors and disks.  It is also a portable, native
language supported by a range of widely-deployed open-source and
commercial database engines.  This makes SQL an attractive
framework for writing data-intensive programs.  Ideally, we would like
MADlib methods to be written entirely in straightforward and portable
SQL. Unfortunately, the portable core of ``vanilla'' SQL is often not
quite enough to express the kinds of algorithms needed for advanced
analytics.

Many statistical methods boil down to linear-algebra expressions over
matrices.  For relational databases to operate over very large
matrices, this presents challenges at two scales.  At a macroscopic
scale, the matrices must be intelligently partitioned into chunks that
can fit in memory on a single node.  Once partitioned, the pieces can
be keyed in such a way that SQL constructs can be used to orchestrate
the movement of these chunks into and out of memory across one or many
machines.  At a microscopic scale, the database engine must invoke
efficient linear-algebra routines on the pieces of data it gets in
core.  To this end it has to have the ability to very quickly invoke
well-tuned linear-algebra methods.

We proceed to discuss issues involved at both of these levels in a bit more detail, and solutions we chose to implement in MADlib.
\subsection{Macro-Programming (Orchestration)}
\label{sec:macro}
A scalable method for linear algebra depends upon divide-and-conquer techniques: intelligent
partitioning of the matrix, and a pattern to process the pieces and
merge results back together.  This partitioning and dataflow is
currently outside the scope of a traditional query optimizer or
database design tool.  But there is a rich literature from scientific
computing on these issues (e.g.,~\cite{choi1996scalapack}) 
% \fs{I suppose this refers to ``Applied numerical linear algebra''?} 
that database programmers can use to craft efficient in-database implementations.  Once data is properly partitioned, database engines shine at orchestrating the
resulting data movement of partitions and the piecewise results of
computation.


\subsubsection{User-Defined Aggregation}
% 
% \ksn{This section looks like it should follow the first paragraph of Section 3.1 to prove the point that SQL is pretty good as the basis of a macro-programming or parallel declarative programming language.
%   We should mention that UDA and parallel constructs like the merge and final functions lend themselves nicely to the implementation of the large body of work on online learning algorithms and model averaging techniques over such algorithms currently being actively pursued in the machine learning literature. The Zinkevish, Weimer, Smola, Li paper is a good reference.}
%  \fs{I followed KeeSiong's suggestions.}

The most basic building block in the macro-programming of MADlib is the use of user-defined aggregates (UDAs). In general, aggregates---and the related window functions---are the natural way in SQL to implement mathematical functions that take as input the values of an arbitrary number of rows (tuples). DBMSs typically implement aggregates as data-parallel streaming algorithms. And there is a large body of recent work on online learning algorithms and model-averaging techniques that fit the computational model of aggregates well (see, e.g., \cite{Zinkevich10}).

Unfortunately, extension interfaces for user-defined aggregates vary widely across vendors and open-source systems.
% , and many vendors do  SQL standard does not prescribe an extension interface for user-defined aggregates \jmh{we'd better triple-check that} \fs{I'll have another look. Can somebody else check, too?}, and they are typically implemented using vendor-specific APIs. 
Nonetheless, the aggregation paradigm (or in functional programming terms, ``fold'' or ``reduce'') is natural and ubiquitous, and we expect the basic algorithmic patterns for user-defined aggregates to be very portable. In most widely-used DBMSs (e.g., in PostgreSQL, MySQL, Greenplum, Oracle, SQL Server, Teradata), a user-defined aggregate consists of a well-known pattern of two or three user-defined functions:
\begin{enumerate}
	\item A \emph{transition function} that takes the current transition state and a new data point. It combines both into into a new transition state.
	% DB people won't be helped by analogies to functional programming
	%The transition function is equivalent to the ``combining'' function passed to linear \emph{left-fold} functions in functional-programming languages.
	\item An optional \emph{merge function} that takes two transition states and computes a new combined transition state. This function is only needed for parallel execution.
	% Again in functional-programming terms, a merge operation is a tree-like fold.
	\item A \emph{final function} that takes a transition state and transforms it into the output value.
\end{enumerate}
Clearly, a user-defined aggregate is inherently data-parallel if 
% it is Distributive~\cite{gray1997data}: 
the transition function is associative and the merge function returns the same result as if the transition function was called repeatedly for every individual element in the second state.

Unfortunately, user-defined aggregates are not enough. In designing the high-level orchestration of data movement for analytics, we ran across two main
limitations in standard SQL that we describe next.  We addressed both these limitations using driver code written in simple
script-based user-defined functions (UDFs), which in turn kick off more
involved SQL queries.  When implemented correctly, the performance of the scripting language code is not critical, since its logic is invoked only
occasionally to kick off much larger bulk tasks that are executed by
the core database engine.

\subsubsection{Driver Functions for Multipass Iteration} \label{sec:DriverFunctions}
% \ksn{I would put this section before Section `Templated Queries'. \fs{I did this.}
%   Do we need to clarify that iteration is only an issue when each iteration requires access to multiple data points?
%   An iterative algorithm that processes training data points one at a time is perfectly suited to SQL implementation as an aggregate function, so iterations by itself is not necessarily a problem.}
The first problem we faced is the prevalence of ``iterative'' algorithms for many
methods in statistics, which make many passes over a data set.  Common examples include optimization methods like Gradient
Descent and Markov Chain Monte Carlo (MCMC) simulation in which the number of iterations
is determined by a data-dependent stopping condition at the end of
each round.  There are multiple SQL-based workarounds for this
problem, whose applicability depends on the context.

\noindent{\bf Counted Iteration via Virtual Tables}.
In order to drive a fixed number $n$ of independent
iterations, it is often simplest (and very efficient) to declare a virtual table with $n$ rows (e.g., via PostgreSQL's \texttt{generate\_series} table function), and join it with a view representing a single iteration.  This approach was used to implement
$m$-of-$n$ Bootstrap sampling in the original MAD Skills paper~\cite{DBLP:journals/pvldb/CohenDDHW09}.  It is supported in some fashion in a variety of DBMSs, sometimes by writing a simple table function consisting of a few lines of code.

\noindent{\bf Window Aggregates for Stateful Iteration}.
For
settings where the current iteration depends on previous iterations,
SQL's windowed aggregate feature can be used to carry state across iterations. 
% \fs{What is the idea here? We should either have a little bit more detail here, so that readers have an understanding, or omit this comment.}  
Wang et al.\ took
this approach to implement in-database MCMC inference~\cite{wang2011hybrid} (Section~\ref{sec:mcmc}).  Unfortunately the level of support for window aggregates varies across SQL engines.
% Alternatively, some DBMSs might support multi-pass user-defined aggregate together with a stopping criterion. \jmh{Can somebody back up the previous sentence with an example?  I'm not familiar with this feature.} \fs{I checked DB2, Oracle, Teradata, MySQL, hsqldb. None of these have multi-pass aggregates. Still, multi-pass aggregates would make a lot of sense from a user perspective.}
%\jmh{Sounds like a research project -- probably ends up being isomorphic to recursive queries.}

\noindent{\bf Recursive Queries}.
Most
generally, it is possible to use the recursion features of SQL to
perform iteration with arbitrary stopping conditions---this was used by
Wang et al.\ to implement Viterbi inference~\cite{DBLP:journals/pvldb/WangFGH10} (Section~\ref{sec:viterbi}).  
Unfortunately, like windowed aggregates,
recursion support in SQL
varies across database products, and does not form a reliable
basis for portability.

\noindent{\bf Driver Functions}.
None of the above methods provides both generality and portability.  As a result, in MADlib we chose to implement complex iterative methods by
writing a driver UDF in Python to control iteration, which passes state across iterations intelligently.  A standard
pitfall in this style of programming is for the driver code to pull a large amount of data
out of the database; this becomes a
scalability bottleneck since the driver code typically does not
parallelize and hence pulls all data to a single node.  We avoid this
via a design pattern in which the driver UDF kicks off each iteration
and stages any inter-iteration output into a temporary table via \texttt{CREATE TEMP TABLE...
AS SELECT...} It then reuses the resulting temp table in subsequent iterations as needed.  Final outputs are also often stored in temp tables unless they are small, and can be interrogated using small
aggregate queries as needed.  As a result, all large-data movement is
done within the database engine and its buffer pool.
Database engines typically provide efficient parallelism as well as
buffering and spill files on disk for large temp tables, so this
pattern is quite efficient in practice.  Sections \ref{sec:logistic} and \ref{sec:kmeans} provide discussion of this pattern in the context of specific algorithms. 

\subsubsection{Templated Queries}
% 
% \ksn{  The two problems with first-order logic in this context are 1) lack of a proper type system; and 2) no support for higher-order functions.
%   The current description seems to conflate the two problems, which maybe why it's slightly confusing.
%   I don't really understand what's the problem being solved and what's the solution being proposed here.
%   A relevant problem here is the unnatural way we support function arguments to algorithms, for example the distance metric argument in k-means and the kernel function in SVM, which can be traced directly to the lack of direct support for higher-order functions in SQL.
%   The reference to iterative algorithms, and the solution approach illustrated in Listing 3, seem to be related to this lack of higher-order functions support as well.}
% \fs{Section `Templated Queries' discussed variable arity, which is a problem encountered by the profile module and also by the decision tree module (I think). Problem 1) was only mentioned in Section 4.2.1, and Problem 2) was discussed in Section 3.3 but none are discussed in `Templated Queries', were they? I now have mentioned problems 1 and 2 in this section.}
A second problem is a limitation of SQL's roots in
first-order logic, which requires that queries be cognizant of the
schema of their input tables, and produce output tables with a fixed
schema.  In many cases we want to write ``templated'' queries that
work over arbitrary schemas, with the details of arity, column names and
types to be filled in later. 

For example, the MADlib \texttt{profile} module takes an arbitrary table as input, producing univariate summary statistics for each of its columns.  The input schema to this module is not fixed, and the output schema is a function of the input schema (a certain number of output columns for each input column).
% regression algorithm of Section~\ref{sec:regression} is designed to run over any subset
% of the columns of an input table, producing an output table including
% the same columns as well as a predicted output column. \fs{Actually, in most algorithms we only work with vectors/arrays, and variable arity does not pose a problem. Only when an algorithm prescribes how non-numerical input columns should be encoded, we have to work with variable arity (e.g., the C4.5 decision-tree algoorithm does that). Still, \emph{all} iterative algorithm necessarily rely on templated SQL.} \jmh{Please help me clarify here.  My mental example is the profile module, in which I explicitly synthesize SQL in Python.  But I didn't want to take the time to describe it here, so maybe you can tell me what ou mean by all the algorithms relying on templated SQL.} \fs{OK, I didn't have the profile module on my mind. So my only concern is that variable arity is not an issue for the example modules.} 
% SQL helps with
% problems of data types and casting, but cannot help with the variable
% arity of inputs and outputs.  
To address this issue, we use Python
UDFs to interrogate the database catalog for details of input tables,
and then synthesize customized SQL queries based on templates to
produce outputs.  Simpler versions of this issue arise in most of our iterative algorithms.

Unfortunately, templated SQL relies on identifiers or expressions passed as strings to represent database objects like tables. As such, the DBMS backend will discover syntactical errors only when the generated SQL is executed, often leading to error messages that are enigmatic to the user. As a result, templated SQL necessitates that MADlib code perform additional validation and error handling up front, in order to not compromise usability.
% {\bf Figure~\ref{fig:secondorder} shows Python code that illustrates this.} \fs{I assume profile is the only Python module dealing with variable arity. The DT code is not Python unfortunately. Figure~\ref{fig:log-reg-driver} gives an example of templated SQL, but not variable arity.} 
% This pattern is currently done in an ad hoc way in
% each method that needs it.  
In the future we plan to support this pattern
as a Python library that ships with MADlib and provides useful programmer APIs and user feedback. 
% Similarly, the lack of native support for higher-order functions in SQL again necessitates templated SQL or a technical solution as discussed in Section~\ref{sec:C++AL}.


\subsection{Micro-Programming: Data Representations and Inner Loops}
\label{sec:micro}

In addition to doing the coarse-grained orchestration of chunks, the
database engine must very efficiently invoke the single-node code that
performs arithmetic on those chunks.  For UDFs that operate at the row level (perhaps called multiple times per row), the standard practice is to implement them in C or C++. When computing dense matrix operations, these functions would
make native calls to an open-source library like LAPACK~\cite{laug} or Eigen~\cite{eigenweb}.

Sparse
matrices are not as well-handled by standard math libraries, and require
more customization for efficient representations both on disk and in
memory.  We chose to write our own sparse matrix library in C for
MADlib, which implements a run-length encoding scheme.  Both of
these solutions require careful low-level coding, and formed part of
the overhead of getting MADlib started. 
% \fs{We inherited the sparse-vector implementation and did not actually make a decision on expected needs. So maybe we should deemphasize this point.}
% Let's include Luke as an author and give him credit for getting us off the ground.

The specifics of a given method's linear algebra can be coded in a
low-level way using loops of basic arithmetic in a language like C,
but it is nicer if they can be expressed in a higher-level syntax that
captures the semantics of the linear algebra at the level of matrices
and arrays. We turn to this issue next.
% Moreover, for maintainability as well as portability it is best to separate database logic and APIs from mathematical code.  We therefore provide a C++ abstraction layer in MADlib for writing performance-critical inner loops, which we describe next. \fs{The last two sentences here are now redundant with the first two sentences in the next paragraph. I would just remove everything between `Moreover' and `describe next'.}

\subsection{A C++ Abstraction Layer for UDFs} \label{sec:cplusplus}
There are a number of complexities involved in writing C or C++-based user-defined functions over a legacy DBMS like PostgreSQL, all of which can get in the way of maintainable, portable application logic.  This complexity can be especially frustrating for routines whose pseudocode amounts to a short linear-algebra expression that {\em should} result in a compact implementation. 
% \fs{I don't understand the last sentence.}  
MADlib provides a C++ abstraction layer both to ease the burden of writing high-performance UDFs, and to encapsulate DBMS-specific logic inside the abstraction layer, rather than spreading the cost of porting across all the UDFs in the library.  In brief, the MADlib C++ abstraction provides three classes of functionality: type bridging, resource management shims, and math library integration.  

Type bridging is provided via an encapsulated mapping of C++ types and methods to database types and functions.  UDFs can be written with standard C++ atomic types, as well as the vector and matrix types that are native to a high-performance linear-algebra library.  We have successfully layered multiple alternative libraries under this interface, and are currently using Eigen~\cite{eigenweb} (see Section~\ref{sec:performance} for performance results and lessons learned about integrating linear-algebra libraries in a RDBMS).  The translation to and from database types (including composite or array types like \texttt{double precision[]} for vectors) is handled by the abstraction layer.  Similarly, higher-order functions in C++ can be mapped to the appropriate object IDs of UDFs in the database, with the abstraction layer taking care of looking up the function in the database catalog, verifying argument lists, ensuring type-safety, etc.

%  One is the mapping of C++ language types and function pointers to database types and DBMS-registered functions.  A second is the correct use of DBMS facilities ordinarily associated with operating-system or language-level standard libraries: memory management, exception handling, signal handling and so on.  Finally, the key goal in MADlib is for mathematical expressions to take advantage of the expressivity and performance tuning of mature linear algebra libraries.  The typical coding style for PostgreSQL extensions makes all these issues explicit, verbose, and easy to get wrong.  We cover each of these three topics briefly.
% 
% UDFs written in C or C++ are invoked as dynamically-linked function pointers, with arguments passed as an array of pointers and additional metadata.  These UDFs typically begin with lengthy, system-specific boilerplate code for type-checking: it must ensure that the passed data is of the correct type, it must copy immutable data before doing modifications, verify array lengths, etc. The MADlib C++ abstraction layer encapsulates these issues within a recursive C++ class called \texttt{AnyType} that can contain either a primitive type (like, e.g., \texttt{int} or \texttt{double}) or multiple other values of type \texttt{AnyType} (for representing a composite type). This encapsulation works both for passing data from the DBMS to the C++ function, as well as returning values back from C++. To give an example: A simple, portable, and completely type-safe (though arguably not very useful) function that adds two numbers can thus be implemented with essentially as little code as in a high-level scripting language:
% \begin{cppcode*}{gobble=2}
%   AnyType
%   sum_two_doubles::run(AnyType &args) {
%       return args[0].getAs<double>()
%            + args[1].getAs<double>();
%   }
% \end{cppcode*}
% \jmh{I don't really understand this.}
% 
% A second responsibility of the abstraction layer is to supplement the bridging with additional semantics, in order to facilitate rapid implementation of mathematical functions: For instance, double-precision arrays in the DBMS are the canonical way to represent vectors in Euclidean space. Our C++ abstraction layer therefore not only provides an array-to-array bridge but also maps DBMS arrays to vectors of the linear algebra library Eigen~\cite{eigenweb}. That way users can immediately make use of the very sophisticated vector and matrix operations provided by Eigen.
% 
% \jmh{This is about higher-order functions, not 2nd-order logic.}
% The abstraction layer can also help compensate for SQL's lack of higher-order logic: For instance, an \texttt{AnyType} object can contain a ``pointer'' to a user-defined function. With the syntactic sugar possible in C++, this essentially makes in-database function first-class objects like they commonly are in modern programming languages. Internally, the abstraction layer maps UDFs to their object ID in the database, and it takes care of looking up the function in the database catalog, verifying argument lists, ensuring type-safety, etc.

The second aspect of the C++ abstraction layer is to provide a safe and robust standard runtime interface to DBMS-managed resources.  This includes layering C++ object allocation/deallocation over DBMS-managed memory interfaces, providing shims between C++ exception handling and DBMS handlers, and correctly propagating system signals to and from the DBMS.
% 
% . For instance, PostgreSQL maintains a hierarchy of memory contexts: When a query is started, a new memory context is created and all transient memory allocations are supposed to occur within this context. When the query ends, disposing of the query context provides a simple and effective way of garbage collection. Our C++ abstraction layer makes sure that such modi operandi are followed. On the other hand, the C++ abstraction layer also facilitates writing C++ code with a well-defined interface. This is particularly necessary if (as is typically the case) a DBMS only provides a C plugin interface: In that case it is important that exceptions, signals, etc. to not cross runtime boundaries.

Finally, by incorporating proven third-party libraries, the C++ abstraction layer makes it easy for MADlib developers to write correct and performant code.  For example, the Eigen linear-algebra library contains well-tested and well-tuned code that makes use of the SIMD instruction sets (like SSE) found in today's CPUs. Likewise, the abstraction layer itself has been tuned for efficient value marshalling, and code based on it will automatically benefit from future improvements. 
% Some examples include: All type bridges are aware of mutable and immutable objects and avoid making copies whenever possible. DBMS-catalogue lookups are minimized by caching.  
By virtue of being a template library, the runtime and abstraction overhead is reduced to a minimum.

As an illustration of the high-level code one can write over our abstraction layer, Listings~\ref{fn:lin-reg-trans} and \ref{fn:lin-reg-final} show reduced, but fully functional code snippets that implement multiple linear regression (as discussed further in Section~\ref{sec:regression}).

% We have spent significant time tuning the C++ abstraction layer over PostgreSQL and Eigen.  Figure~\ref{fig:regression} illustrates our progress in performance tuning, but also shows that some runtime overhead still has to be resolved. One of our goals for version 1.0 is to reduce the overhead of the C++ abstraction layer to virtually zero. 
% \ksn{Preceding paragraph. This is too much of a forward reference. Move the paragraph to Section 4.4?} \fs{Tend to agree, except you probably mean 4.5, not 4.4.}

\section{Examples}

To illustrate the above points, we look at three different algorithmic scenarios.  The first is Linear Regression using Ordinary Least Squares (OLS), which is an example of a widely useful, simple single-pass aggregation technique.  The second is binary Logistic Regression, another widely used technique, but one that employs an iterative algorithm.  Finally, we look at $k$-Means 
% \fs{changed this from `K-Means' to `$k$-means' because that's what I used everywhere else.} 
Clustering, an iterative algorithm with large intermediate states spread across machines.

\subsection{Single-Pass: Ordinary Least Squares} \label{sec:regression}
% \ksn{A question people would ask on reading the algorithm is what happens when the number of independent variables is large (as is common in text analytics problem).
   % Would the implementation crash if the matrix to be inverted can't fit in memory? Does the system fail gracefully in that case?
   % Is there a sentence or two we can add to try to head off that criticism up front?}
  % \jmh{I don't believe we've dealt with that in our current implementation, and it complicates the discussion of a single-pass UDA anyhow.  So we'll ignore.}

In ordinary-least-squares (OLS) linear regression the goal is to fit a linear function to a set of points, with the objective of minimizing the sum of squared residuals. Formally, we are given points $(\vec x_1, y_1), \dots, (\vec x_n, y_n)$, where $\vec x_i \in \R^d$ and $y_i \in \R$, and our goal is to find the vector $\widehat{\vec b}$ that minimizes $\sum_{i=1}^n (y_i - \langle \widehat{\vec b}, \vec x_i \rangle)^2$. OLS is one of the most fundamental methods in statistics. Typically, each $y_i$ is assumed to be an (independent) noisy measurement of $\langle \vec b, \vec x_i \rangle$, where $\vec b$ is an unknown but fixed vector and the noise is uncorrelated with mean 0 and unknown but fixed variance. Under these assumptions, $\widehat{\vec b}$ is the best linear unbiased estimate of~$\vec b$ (Gauss-Markov). Under additional assumptions (normality, independence), $\widehat{\vec b}$ is also the maximum-likelihood estimate. Letting $X$ denote the matrix whose rows are $\vec x_i^T$, and defining $\vec y := (y_1, \dots, y_n)^T$, it is well-known that the sum of squared residuals is minimized by $\widehat{\vec b} = (X^TX)^{-1}X^T \vec y$ (for exposition purposes we assume the full-rank case here, though this is not a requirement for MADlib).

It has been observed before that computing $\widehat{\vec b}$ lends itself well to data-parallel implementations in databases~\cite{ordonez-sigmod2000,ordonez-tkde10} and map-reduce \cite{DBLP:conf/nips/ChuKLYBNO06}.  In extensible-database terms, this task can be done with a simple user-defined aggregate. The principal observation is this: $X^TX = \sum_{i=1}^n \vec x_i \vec x_i^T$ and $X^T \vec y = \sum_{i=1}^n \vec x_i y_i$ are just sums of transformations of each data point.  Summation is associative, so data parallelism virtually comes for free---we can compute the per-process subsums of the previous expressions locally in each process, and then sum up all subsums during a second-phase aggregation. As a final non-parallelized step, we compute the inverse of $X^TX$ and then multiply with $X^T \vec y$. These final operations are comparatively cheap, since the number of independent variables (and thus the dimensions of $X^T X$ and $X^T \vec y$) is typically ``small''.

\subsubsection{MADlib Implementation}
We assume that data points are stored as tuples \texttt{(x DOUBLE PRECISION[], y DOUBLE PRECISION)}. Linear regression is then implemented as a user-defined aggregate with a transition and final function roughly as in Listings~\ref{fn:lin-reg-trans} and \ref{fn:lin-reg-final}, respectively. (For compactness, we omitted finiteness checks and several output statistics in the example here.)
 The merge function, which is not shown, just adds all values in the transition states together.  Running the code produces the following:

\begin{scriptsize}
\begin{verbatim}
psql# SELECT (linregr(y, x)).* FROM data;
-[ RECORD 1 ]+--------------------------------------------
coef         | {1.7307,2.2428}
r2           | 0.9475
std_err      | {0.3258,0.0533}
t_stats      | {5.3127,42.0640}
p_values     | {6.7681e-07,4.4409e-16}
condition_no | 169.5093
\end{verbatim}
\end{scriptsize}

Note that the \texttt{linregr} Python UDF produces a composite record type in the output, which is a feature of PostgreSQL and Greenplum.   This would be easy to flatten into a string in a strictly relational implementation.
% \jmh{That's some pretty non-relational looking output.  Shall we explain?} \fs{Do you mean explaining that logregr returns a composite type?}  \ksn{Please reduce the precision of the floating point numbers to something sensible.}


\begin{figure}[t]
\begin{scriptsize}
  % \begin{code}
  %   \begin{cppcode}
  \begin{lstlisting}
AnyType
linregr_transition::run(AnyType& args) {
    // Transition state is a class that wraps an array.
    // We expect a mutable array. If DBMS allows
    // modifications, copying will be avoided.
    LinRegrTransitionState<
        MutableArrayHandle<double> > state = args[0];
    // Dependent variable is a double-precision float
    double y = args[1].getAs<double>();
    // Vector of independent variables wraps an immutable
    // array (again, no unnecessary copying). This maps
    // to an Eigen type
    MappedColumnVector x
        = args[2].getAs<MappedColumnVector>();
    
    if (state.numRows == 0)
        // The first row determines the number 
        // of independent variables
        state.initialize(*this, x.size());
    state.numRows++;
    state.y_sum += y;
    state.y_square_sum += y * y;
    // noalias informs Eigen to multiply in-place
    state.X_transp_Y.noalias() += x * y;
    // Since X^T X is symmetric, we only need to
    // compute a triangular part
    triangularView<Lower>(state.X_transp_X)
        += x * trans(x);
    
    return state;
}
  \end{lstlisting}
%	\end{cppcode}
	\caption{Linear regression transition function}
	\label{fn:lin-reg-trans}
% \end{code}
\end{scriptsize}
\end{figure}

\begin{figure}[t]
\begin{scriptsize}
% \begin{code}
  \begin{lstlisting}
AnyType
linregr_final::run(AnyType& args) {
    // Immutable array: Array will never be copied
    LinRegrTransitionState<ArrayHandle<double> > state
    		= args[0];

    // The following is a MADlib class that wraps Eigen's
    // solver for self-adjoint matrices.
    SymmetricPositiveDefiniteEigenDecomposition<Matrix>
        decomposition(state.X_transp_X, EigenvaluesOnly,
            ComputePseudoInverse);
    
    Matrix inverse_of_X_transp_X
        = decomposition.pseudoInverse();
    // Let backend allocate array for coefficients so to
    // avoid copying on return (if supported by DBMS).
    MutableMappedColumnVector coef(
        allocateArray<double>(state.widthOfX));
    coef.noalias() = inverse_of_X_transp_X
        * state.X_transp_Y;

    // Return a composite value.
    AnyType tuple;
    tuple << coef << decomposition.conditionNo();
    return tuple;
}
\end{lstlisting}
%	\end{cppcode}
	\caption{Linear regression final function}
	\label{fn:lin-reg-final}
	% \end{code}
	\end{scriptsize}
	\end{figure}
	
\subsection{Multi-Pass: (Binary) Logistic Regression}
\label{sec:logistic}
In (binary) logistic regression, we are given points $(\vec x_1, y_1)$, $\dots$, $(\vec x_n, y_n)$, where $\vec x_i \in \R^d$ and $y_i \in \{ 0,1 \}$, and our goal is to find the vector $\widehat{\vec b}$ that maximizes $\prod_{i=1}^n \sigma((-1)^{y_i + 1} \cdot \langle \widehat{\vec b}, \vec x_i \rangle)$. Here, $\sigma(z) = \frac{1}{1+\exp(z)}$ denotes the logistic function. Statistically, this is the max\-i\-mum-likelihood estimate for an unknown vector $\vec b$ under the assumption that each $y_i$ is a random variate with $\Pr[y_i = 1 \mid \vec x_i] = \sigma( \langle \vec b, \vec x_i \rangle )$ and that all observations are independent.

It is well-known that, in general, no closed-formula expression for $\widehat{\vec b}$ exists. Instead, $\widehat{\vec b}$ can be computed as the solution of a convex program via standard iterative methods. Arguably, the most common method is to maximize the logarithm of the likelihood using Newton's method. In the case of logistic regression this reduces to \emph{iteratively reweighted least squares} with iteration rule $\widehat{\vec \beta}_{m+1} = (X^T D_m X)^{-1} X^T D_m \vec z_m$. 
%\ksn{I don't think D in the iteration rule is explained.} \fs{Fixed.} 
Here, the diagonal matrix $D_m$ and the vector $\vec z_m$ are transformations of $X$ and $\widehat{\vec \beta}_m$.

\subsubsection{MADlib Implementation} \label{sec:log-regression-impl}

%\jmh{I think the logistic regression section will need work from you, to help the user understand how the SQL statement invokes the Python driver, and how the Python driver in turn invokes each iteration in SQL.  Clarifying that control flow will help a lot more than Listing 3, so if a picture is in order feel free to replace Listing 3.  (BTW feel free to upload a photo of a pencil drawing for now, we can clean it up tomorrow).  Alternatively, be sure to add comments or caption to Listing 3 so it's understandable.}

Each individual iteration can be implemented via a user-defined aggregate using linear regression as a blueprint. However, the handling of iterations and checking for convergence require a further outer loop. We therefore implement a driver UDF in Python. The control flow follows the high-level outline from Section~\ref{sec:DriverFunctions} and is illustrated as an activity diagram in Figure~\ref{fig:log-reg-driver}. Here, the shaded shapes are executions of generated SQL, where \texttt{\textit{current\_iteration}} is a template parameter that is substituted with the corresponding Python variable. 

Specifically, the UDF first creates a temporary table for storing the inter-iteration states. Then, the Python code iteratively calls the UDA for updating the iteration state, each time adding a new row to the temporary table. Once the convergence criterion has been reached, the state is converted into the return value. The important point to note is that there is no data movement between the driver function and the database engine---all heavy lifting is done within the database engine.

Unfortunately, implementing logistic regression using a driver function leads to a different interface than the one we provided for linear regression:

\begin{scriptsize}
  \lstset{language=SQL, numbers=none, frame=none}
\begin{lstlisting}
SELECT * FROM logregr('y', 'x', 'data');
\end{lstlisting}
\end{scriptsize}

A problem with this implementation is that the \texttt{logregr} UDF is not an aggregate function and cannot be used in grouping constructs. To perform multiple logistic regressions at once, one needs to use a join construct instead.  We intend to address this non-uniformity in interface in a future version of MADlib. 
%\fs{That's not the full story. Without support for what I previously called ``multi-pass UDAs'', uniformity would require changing linear regression. That is, making linear regression a UDF, too. This is not a good solution because it is often desirable to perform multiple regressions at once, using GROUP BY.} 
%\jmh{You can drive multiple computations using JOIN more generally.  GROUP BY is basically a fancy self-join if you like to look at it that way. Anyhow I didn't say *how* we'd make it uniform, just that we'd "address" it.  We don't need to belabor this.}
We highlight the issue here in part to point out that SQL can be a somewhat ``over-rich'' language.  In many cases there are multiple equivalent patterns for constructing simple interfaces, but no well-accepted, uniform design patterns for the kind of algorithmic expressions we tend to implement in MADlib.  We are refining these design patterns as we evolve the library.
%\fs{Hmm. My honest perspective is quite different. Syntactically, SQL is a very restrictive language---it's (ugly) workarounds like above that we need to find on the fly. I find passing names of database objects as string a bad thing. It's just unavoidable currently.}
%\jmh{If SQL was recursion-free Datalog, many of our weirdnesses would go away without increasing the language expressiveness.  A problem with SQL that is causing us trouble is that it's got too many ways to do the same relational patterns.  Doing something beyond relational is a separate point.  I hear what you're saying from another angle, but I think it's not what I want to highlight for this audience.}

% \fs{I would now remove the rest of the subsubsection. For now, I am leaving the old discussion for reference.}
% In addition, this syntax hides identifiers from the SQL parser. \jmh{I don't know what that means.} \fs{My main point is that identifiers passed as strings only become visible to the DBMS when the generated SQL is executed. However, we then should not rely on the SQL parser any more to give reasonable error messages. Waiting for the generated SQL to fail will usually lead to error messages that are enigmatic to the user.} Instead, MADlib is burdened with the responsibility of name binding, including all needed validation and error handling. \jmh{When we say MADlib, we main the logistic regression code.  Yes?  What's the lesson here?  Iterative methods cannot be aggs? Or they can but we failed to think about them properly?} \fs{``Multi-pass'' aggregates would indeed be very useful from a MADlib perspective and enhance the user experience. -- That said, I'll the whole paragraph should be rephrased. See also KeeSiong's comment.}  Also, any errors will only be detected at runtime.
% \ksn{The discussion ends with too much gloom. Are we not going to propose some solution moving forward? Or does this go down as one of the fundamental limitations of implementing things in SQL?}

\begin{figure}
	\centering
	\includegraphics[scale=0.71]{LogisticRegression}
	\caption{Sequence diagram for logistic regression}
	\label{fig:log-reg-driver}
\end{figure}

% \begin{figure}[t]
% \begin{scriptsize}
%   \lstset{language=Python}
%   % \begin{code}
%   \begin{lstlisting}
%return __runIterativeAlg(
%    stateType = "FLOAT8[]",
%    initialState = "NULL",
%    source = source,
%    updateExpr = """
%        {MADlibSchema}.logregr_{optimizer}_step(
%            ({depColumn})::BOOLEAN,
%            ({indepColumn})::FLOAT8[],
%            {{state}}
%        )
%        """.format(
%            MADlibSchema = MADlibSchema,
%            depColumn = depColumn,
%            indepColumn = indepColumn,
%            optimizer = optimizer),
%    terminateExpr = """
%        {MADlibSchema}.
%        internal_logregr_{optimizer}_step_distance(
%            {{newState}}, {{oldState}}
%        ) < {precision}
%        """.format(
%            MADlibSchema = MADlibSchema,
%            optimizer = optimizer,
%            precision = precision),
%    maxNumIterations = maxNumIterations)
%     \end{lstlisting}
% % \end{pythoncode}
%	\caption{Logistic regression driver}
%	\label{fn:log-reg-driver}
% % \end{code}
% \end{scriptsize}
% \end{figure}
\subsection{Large-State Iteration: k-Means}
\label{sec:kmeans}
In $k$-means clustering, we are given $n$ points $x_1, \dots, x_n \in \R^d$, and our goal is to position $k$ centroids $c_1, \dots, c_k \in \R^d$ so that the sum of squared distances between each point and its closest centroid is minimized. Formally, we wish to minimize
\begin{math}
    \sum_{i=1}^n \min_{j=1}^k \|x_i - c_j \|^2.
\end{math}
Solving this problem exactly is usually prohibitively expensive (for theoretical hardness results see, e.g., \cite{ADH09a,MNV10a}).
% \ksn{This is a bit mysterious, since d and k are always fixed before we run the algorithm.
%   Can we word this more carefully? Is the statement even needed at all?}
% \fs{If the dimension of the input vectors is not a priori fixed but part of the input, or likewise, if k is part of the input, then k-means is NP hard. Otherwise, it is not. It's pretty typical that complexity differs depending on which parameters are part of the input. So I might not understand your concern. Would it be clearer to write "unless d and k are constants"?}
However, the local-search heuristic proposed by Lloyd~\cite{L82a} performs reasonably well both in theory and in practice~\cite{AV07a,AMR09a}. At a high level, it works as follows:
%
\begin{enumerate}
	\item Seeding phase: Find initial positions for $k$ centroids $c_1, \dots, c_k$.
	\item Assign each point $x_1, \dots, x_n$ to its closest centroid. \label{enum:kmeans_abstract_points}
	\item Reposition each centroid to the barycenter (mean) of all points assigned to it.
	\item If no (or only very few) points got reassigned, stop. Otherwise, goto \eqref{enum:kmeans_abstract_points}.
\end{enumerate}

\subsubsection{MADlib implementation}
$k$-means has a natural implementation in SQL~\cite{DBLP:journals/tkde/Ordonez06}.
Based on the assumption that we can always comfortably store $k$ centroids in main memory, we can implement $k$-means similarly to logistic regression: Using a driver function that iteratively calls a user-defined aggregate. In the following, we take a closer look at this implementation. It is important to make a clear distinction between the inter-iteration state (the output of the UDA's final function) and intra-iteration state (as maintained by the UDA's transition and merge functions). During aggregation, the transition state contains both inter\nobreakdash- and intra-iteration state, but only modifies the intra-iteration state. We only store $k$ centroids in both the inter\nobreakdash- and intra-iteration states, and consider the assignments of points to centroids as implicitly given.

In the transition function, we first compute the centroid that the current point was closest to at the beginning of the iteration using the inter-iteration state. We then update the barycenter of this centroid in the intra-iteration state. Only as the final step of the aggregate, the intra-iteration state becomes the new inter-iteration state.

Unfortunately, in order to check the convergence criterion that no or only few points got reassigned, we have to do two closest-centroid computations per point and iteration: First, we need to compute the closest centroid in the previous iteration and then the closest one in the current iteration. If we stored the closest points explicitly, we could avoid half of the closest-centroid calculations.

We can store points in a table called \texttt{points} that has a \texttt{coords} attribute containing the points' coordinates and has a second attribute for the current \texttt{centroid\_id} for the point.  The iteration state stores the centroids' positions in a matrix called \texttt{centroids}.  MADlib provides a UDF \texttt{closest\_column(a,b)} that determines the column in a matrix \texttt{a} that is closest to vector \texttt{b}.  Thus, we can make the point-to-centroid assignments explicit using the following SQL:

\begin{scriptsize}
  \lstset{language=SQL, numbers=none, frame=none}
\begin{lstlisting}
UPDATE points
SET centroid_id = closest_column(centroids, coords)
\end{lstlisting}
\end{scriptsize}

\noindent
Ideally, we would like to perform the point reassignment and the repositioning with a single pass over the data. Unfortunately, this cannot be expressed in standard SQL.\footnote{While PostgreSQL and Greenplum provide an optional \texttt{RETURNING} clause for \texttt{UPDATE} commands, this returns only one row for each row affected by the \texttt{UPDATE}, and aggregates cannot be used within the \texttt{RETURNING} clause. Moreover, an \texttt{UPDATE ... RETURNING} cannot be used as a subquery.}

Therefore, while we can reduce the number of closest-centroid calculations by one half, PostgreSQL processes queries one-by-one (and does not perform cross-statement optimization), so it will need to make two passes over the data per one $k$-means iteration. In general, the performance benefit of explicitly storing points depends on the DBMS, the data, and the operating environment.

The pattern of updating temporary state is made a bit more awkward in PostgreSQL due to its legacy of versioned storage. PostgreSQL performs an update by first inserting a new row and then marking the old row as invisible \cite[Section~23.1.2]{postgres:9.1.3}. As a result, for updates that touch many rows it is typically faster to copy the updated data into a new table (i.e., \texttt{CREATE TABLE AS SELECT} and \texttt{DROP TABLE}) rather than issue an \texttt{UPDATE}. These kinds of DBMS-specific  performance tricks may merit encapsulation in an abstraction layer for SQL portability.
% Our goal is to eventually hide much of the SQL generation in versatile abstraction layers. \fs{That's my goal, but do we really work in that direction?}

% \subsection{MADlib in Practice}
% 
% \fs{A quick draft about how a customer uses logistic regression. Feel free to remove this again if it feels out of place.} \ksn{ This section should probably be Section 5, since the material doesn't really belong to section named Examples...}
% \jmh{I found the 4.4 section hard to read, and without a real customer story it didn't seem to help.  We could talk it over, but my inclination is to drop it.  We're also a bit over-space just now, though we could fix that.  Depends on how much energy we both have the next couple days (I'm waning, have to move on to other stuff).}
% \fs{I don't think there is enough to make a real customer story out of 4.4. It's mainly an example of how to use the catalog to convert normalized data into MADlib's array/vector representation. I don't feel strongly about it. Another option: Instead of having a section with the pretentious title "MADlib in Practice", we could move the catalog example into the logistic-regression section.}
% 
% A primary design objective in MADlib is to provide modular components that can easily be used as building blocks for new algorithms like, e.g., ensemble or multi-stage learning. While we intend to include simplified interfaces for important special use cases, the most flexible way of using MADlib is always via its core ``textbook'' interfaces. Practitioners are therefore employing the same techniques used in the design of MADlib itself: Using templated SQL and interrogating the database catalog. Suppose, e.g., we want to do a logistic regression on a table called \texttt{crime} that has a large number of columns, and we intend to experiment which columns to include in our analysis. The MADlib logistic-regression function, as seen in Section~\ref{sec:log-regression-impl}, expects a vector/array column as argument. The database catalog is the natural means of automating the mapping between column names and array indices. This can be expressed in standard SQL using the \texttt{information\_schema}, but PostgreSQL's native catalog provides an even more succinct way to access table meta data: Extracting all column names---and initializing them as included in the regression---could be done as follows:
% %
% \begin{scriptsize}
%   \lstset{language=SQL, numbers=none, frame=none}
%   \begin{lstlisting}
% CREATE TABLE crime_columns AS
% SELECT attnum, attname::TEXT, TRUE AS selected
% FROM pg_attribute
% WHERE attnum > 0 AND attrelid = 'crime'::regclass;
% \end{lstlisting}
% \end{scriptsize}
% %
% Now select or deselecting columns can be done in standard ways, if desired also with a GUI tool. From there, it is a simple step to generate SQL using array and string functions prevalent in DBMSs. Thus, we could call MADlib's logistic regression as follows (PostgreSQL syntax again):
% %
% \begin{scriptsize}
%   \lstset{language=SQL, numbers=none, frame=none}
%   \begin{lstlisting}
%   SELECT * FROM logregr(
%     'crime',
%     'crimerat >= 110',
%     'ARRAY[1, ' || array_to_string( (
%         SELECT array_agg(ivar ORDER BY pos)
%         FROM crime_columns
%     ), ', ') || ']');
%     \end{lstlisting}
%     \end{scriptsize}%
% We remark that the dependent variable here is the boolean condition \texttt{(crimerat >= 110)}, and we include a constant 1 among the independent variables to determine the intercept in the logistic regression.

\subsection{Infrastructure Performance Trends} \label{sec:performance}
% \ksn{ I think we're going to cop a lot criticism on this preliminary performance testing.
%  The whole point of the in-database approach is to scale to large datasets, and we don't even attempt to solve a large enough linear regression problem that doesn't fit completely in memory?   Very hard to argue against that...}
% \fs{I addressed that point and hoped the message would be: We have perfect scalability for aggregates, and disk I/O will not affect scalability. It would hide the C++ AL overhead in our measurements though. So we are actually more critical of ourselves.}
% 
In its current beta version, MADlib has been tuned a fair bit over PostgreSQL and Greenplum, though much remains to be done.  Here we report on some results for a basic scenario that exercises our core functionality, including the C++ abstraction layer, our ability to call out to linear-algebra packages, and parallel-speedup validation. We defer macro-benchmarking of MADlib's current methods to future work, that will focus on specific algorithm implementations.  

The basic building block of MADlib is a user-defined aggregate, typically one that calls out to a linear-algebra library.  In order to evaluate the scalability of this construct, we ran linear regression over Greenplum's parallel DBMS on various data sizes, using a 24-core test cluster we had available, which was outfitted with 144 GB of RAM over 51 TB of raw storage.\footnote{Our cluster is made up of four SuperMicro X8DTT-H server modules, each equipped with one six-core Intel Xeon X5670 processor, clocked at 2.93~GHz. While hyperthreading is enabled, we only run a single Greenplum ``segment'' (query process) per physical core. Each machine has 24~GB of RAM, an LSI MegaRAID~2108 ``Raid On a Chip'' controller with six attached 360~GB solid-state drives, and a Brocade~1020 converged network adapter. The operating system is Red Hat Enterprise Linux Server release 5.5 (Tikanga). On that we are running Greenplum Database 4.2.0, compiled with gcc~4.4.2.  
%The following gcc options were used to compile the Greenplum database: \texttt{-O3 -funroll-loops -fargument-noalias-global -fno-omit-frame-pointer -finline-limit=1800 -fno-strict-aliasing -fwrapv -g}
}  This is obviously a relatively modest-sized cluster by today's standards, but it is sufficient to illuminate (a) our efforts at minimizing performance overheads, and (b) our ability to achieve appropriate parallel speedup.  
%\jmh{Should we add something like this: ``We note that the widely-discussed Mahout library has typically reported performance over much smaller data sets and configurations.  We also note that none of the current libraries for scalable machine learning target multi-rack or multi-datacenter tasks.''} \fs{I we really want, I could redo the test on a larger dataset. But from previous experience, I would strongly assume that run-times go up a lot, the differences between the versions become small, and scalability remains the same. I would probably not say that Mahout tests are usually smaller. That sounds too defensive for me.}

For running linear regression as outlined in Section~\ref{sec:regression}, we expect runtime
\begin{math}
	O(k^3 + (n \cdot k^2)/p)
\end{math}
where $k$ is the number of independent variables, $n$ is the number of observations, and $p$ is the number of query processes. The $k^3$ time is needed for the matrix inversion, and the $k^2$ is needed for computing each outer product $\vec x_i \vec x_i^T$ and adding it to the running sum. It turns out that our runtime measurements fit these expectations quite well, and the constant factors are relatively small. See Figures~\ref{fig:regression} and \ref{fig:regression-diagram}. In particular we note:
\begin{itemize}
	\item The overhead for a single query is very low and only a fraction of a second. This also implies that we lose little in implementating iterative algorithms using driver functions that run multiple SQL queries.
	\item Given the previous points, the Greenplum database achieves perfect linear speedup in the example shown.
\end{itemize}

In the example, all data was essentially in the buffer cache, and disk I/O was not a limiting factor. This is quite typical in large parallel installations, since an analyst often runs a machine-learning algorithm several times with different parameters. Given massive amounts of main memory on each node in a well-provisioned cluster, much of the data should moreover remain in the cache. Finally, the computational cost per row grows at least quadratically, and thus will easily surpass I/O cost for complex models. As we compared various cluster sizes, numbers of independent variables and the respective execution times for previous versions of MADlib, the lesson we learned is that even though we anticipate non-trivial overhead by the DBMS, careful performance tuning---e.g., by making use of instruction-accurate profiling using Valgrind~\cite{NS07a}---still makes significant differences:
\begin{itemize}
	\item Version 0.1alpha is an implementation in C that computes the outer-vector products $\vec x_i \vec x_i^T$ as a simple nested loop.
	\item 
	% \ksn{I suspect nobody cares about the issues we had with Armadillo in Verion 0.2.1beta... do we really want to devote so much space to that?} \fs{The take-away point was supposed to be: "Do not hide behind the argument disk I/O will overshadow everything." Performance tuning at a fairly low level is still important.}
		Version 0.2.1beta introduced an implementation in C++ that used the Armadillo \cite{armadillo} linear-algebra library as a frontend for LAPACK/BLAS. It turns out that this version was much slower for two reasons: The BLAS library used was the default one shipped with CentOS 5, which is built from the untuned reference BLAS. Profiling and examining the critial code paths revealed that computing $\vec y^T \vec y$ for a row vector $\vec y$ is about three to four times slower than computing $\vec x \vec x^T$ for a column vector $\vec x$ of the same dimension (and the MADlib implementation unfortunately used to do the former). Interestingly, the same holds for Apple's Accelerate framework on Mac~OS~X, which Apple promises to be a tuned library. The second reason for the speed disadvantage is runtime overhead in the first incarnation of the C++ abstraction layer (mostly due to locking and calls into the DBMS backend).
	\item Version 0.3 has an updated linear-regression implementation that relies on the Eigen C++ linear-algebra library and takes advantage of the fact that the matrix $X^T X$ is symmetric positive definite. Runtime overhead has been reduced, but some calls into the database backend still need better caching.
\end{itemize}
%
Other noteworthy results during our performance studies included that there are no measurable performance differences between PostgreSQL 9.1.1 (both in single and multi-user mode) and GP 4.1 in running the aggregate function on a single core. Moreover, while testing linear/logistic-regression execution times, single-core performance of even laptop CPUs (like the Core i5 540M) did not differ much from today's server CPUs (like the Xeon family). Typically the differences were even less than what the difference in clock speeds might have suggested, perhaps due to compiler and/or architecture issues that we have yet to unpack.
%  \jmh{Care to speculate why?} \fs{Pure speculation at this stage: Memory speed does not scale similarly, non-linear parts of the code/branches, different pipeline lengths, different number of clock cycles per FP operation, ...?}

\begin{comment}
% The following script was used for testing linear regression.
% --8<--
#!/usr/bin/env bash
PORT=5555
DB=schopf
VARS_LIST="10 20 40 80 160 320"
ROWS_LIST=10000000
GENERATOR=./LinearRegressionRandom
PSQL_ARGS="--port ${PORT} --tuples-only --output /dev/null ${DB}"
TABLE_NAME_FMT='linregr_v${VARS}_r${ROWS}'

for ROWS in ${ROWS_LIST}; do
	for VARS in ${VARS_LIST}; do
		TABLE_NAME=$(eval "echo ${TABLE_NAME_FMT}")
		psql ${PSQL_ARGS} &> /dev/null <<-EOF
			DROP TABLE IF EXISTS ${TABLE_NAME};
			CREATE TABLE ${TABLE_NAME} (
				id SERIAL,
				x DOUBLE PRECISION[],
				y DOUBLE PRECISION
			);
		EOF
		${GENERATOR} -i ${VARS} -r ${ROWS} -t | \
			psql ${PSQL_ARGS} -c \
				"COPY ${TABLE_NAME}(x, y) FROM STDIN;"
		psql ${PSQL_ARGS} <<-EOF
			SELECT count(*) FROM ${TABLE_NAME};
			\timing
			\echo MADlib v0.2.1beta. Vars: ${VARS} Rows: ${ROWS}
			SELECT (madlib_0_2_1beta.linregr(y,x)).coef[1] FROM ${TABLE_NAME};
			\echo
		EOF
		psql ${PSQL_ARGS} <<-EOF
			SELECT count(*) FROM ${TABLE_NAME};
			\timing
			\echo MADlib v0.3. Vars: ${VARS} Rows: ${ROWS}
			SELECT (madlib_0_3_g553770b.linregr(y,x)).coef[1] FROM ${TABLE_NAME};
			\echo
		EOF
		psql ${PSQL_ARGS} <<-EOF
			SELECT count(*) FROM ${TABLE_NAME};
			\timing
			\echo Greenplum. Vars: ${VARS} Rows: ${ROWS}
			SELECT (mregr_coef(y,x))[1] FROM ${TABLE_NAME};
			\echo
		EOF
    done
done
% -->8--

\end{comment}

\begin{figure*}
\centering
\begin{tabular}{+r^r^rD{.}{.}{3.3}D{.}{.}{3.3}D{.}{.}{3.4}}
\toprule\rowstyle{\bfseries}
\# segments & \# variables & \multicolumn{1}{^c}{\# rows}
										  & \multicolumn{1}{^c}{v0.3}
													& \multicolumn{1}{^c}{v0.2.1beta}
																 & \multicolumn{1}{^c}{v0.1alpha}\\
			&             & \multicolumn{1}{^c}{(million)}
										  & \multicolumn{1}{^c}{(s)}
													& \multicolumn{1}{^c}{(s)}
																 & \multicolumn{1}{^c}{(s)} \\
\otoprule
          6 &     10       & 10 &   4.447 &   9.501 &   1.337 \\
          6 &     20       & 10 &   4.688 &  11.60  &   1.874 \\
          6 &     40       & 10 &   6.843 &  17.96  &   3.828 \\
          6 &     80       & 10 &  13.28  &  52.94  &  12.98  \\
          6 &    160       & 10 &  35.66  & 181.4   &  51.20  \\
          6 &    320       & 10 & 186.2   & 683.8   & 333.4   \\
\midrule
         12 &     10       & 10 &   2.115 &   4.756 &   0.9600\\
         12 &     20       & 10 &   2.432 &   5.760 &   1.212 \\
         12 &     40       & 10 &   3.420 &   9.010 &   2.046 \\
         12 &     80       & 10 &   6.797 &  26.48  &   6.469 \\
         12 &    160       & 10 &  17.71  &  90.95  &  25.67  \\
         12 &    320       & 10 &  92.41  & 341.5   & 166.6   \\
\midrule
         18 &     10       & 10 &   1.418 &   3.206 &   0.6197\\
         18 &     20       & 10 &   1.648 &   3.805 &   1.003 \\
         18 &     40       & 10 &   2.335 &   5.994 &   1.183 \\
         18 &     80       & 10 &   4.461 &  17.73  &   4.314 \\
         18 &    160       & 10 &  11.90  &  60.58  &  17.14  \\
         18 &    320       & 10 &  61.66  & 227.7   & 111.4   \\
\midrule
         24 &     10       & 10 &   1.197 &   2.383 &   0.3904\\
         24 &     20       & 10 &   1.276 &   2.869 &   0.4769\\
         24 &     40       & 10 &   1.698 &   4.475 &   1.151 \\
         24 &     80       & 10 &   3.363 &  13.35  &   3.263 \\
         24 &    160       & 10 &   8.840 &  45.48  &  13.10  \\
         24 &    320       & 10 &  46.18  & 171.7   &  84.59  \\
\bottomrule
\end{tabular}
\caption{Linear regression execution times}
\label{fig:regression}
\end{figure*}

\begin{figure}
\centering
\includegraphics{linregr-diagram}
\vspace{-2em}
\caption{Linear regression execution times using MADlib v0.3 on Greenplum Database 4.2.0, 10 million rows}
\label{fig:regression-diagram}
\end{figure}

% \subsection{Using MADlib}
% 
% Typical User Workflow \fs{(needs to be substantiated)}:
% \begin{itemize}
%   \item Data loading and cleansing
%   \item Example: Dealing with a large number of independent variables. Convenient to use catalog information to generate code that uses MADlib.
%   \item Model hypothesis
%   \item ...
% \end{itemize}

\section{University Research and MADlib}
\label{sec:university}
An initial goal of the MADlib project was to engage closely with academic researchers, and provide a platform and distribution channel for their work.  To that end, we report on two collaborations over the past years.  We first discuss a more recent collaboration with researchers at the University of Wisconsin.  We then report on a collaboration with researchers at Florida and Berkeley, which co-evolved with MADlib and explored similar issues (as part of the BayesStore project~\cite{DBLP:journals/pvldb/WangFGH10,wang2011hybrid}) but was actually integrated with the MADlib infrastructure only recently.
\subsection{Wisconsin Contributions: \\Convex Optimization}

% \ksn{Stochastic gradient descent is the more modern term for Incremental gradient descent.
%   The last paragraph on better performance compared to the existing MADlib SVM implementation is pretty misguided.
%   The existing MADlib SVM implementation uses the same stochastic gradient descent approach.
%   The difference is that the SVM implementation works for both linear and non-linear kernels.
%   The new SVM implementation based on the convex programming meta framework only works for linear kernels...
%   The proper comparison is against the lsvm\_classification() algorithm (special case of linear SVM) in the existing SVM module, and I would be extremely surprised if that is slower.}

The MADlib framework goes a long way toward making in-database
analytic tools easier to deploy inside an RDBMS. Nevertheless, to
implement an algorithm within the MADlib framework, a developer must
undertake several steps: they must specify a model, select the
algorithm used to implement that model, optimize the algorithm, test
the model, and finally ensure that the resulting algorithm and model
are robust. This is a time consuming process that creates code that
must be tested and maintained for each data analysis technique. 

A focus of the MADlib work at the University of Wisconsin has been to explore
techniques to reduce this burden.  Our approach is to expose a mathematical abstraction on top of MADlib
that allows a developer to specify a smaller amount of code, which we
hope will lower the development time to add new techniques in many cases. We discuss
the challenge that we faced implementing this abstraction. 
% \fs{Maybe make clear already here that this is a mathematical generalization, not a software-engineering abstraction (like the C++ AL)?} 
To
demonstrate our ideas, we have implemented all of the models shown in
Table~\ref{fig:objs:grads} within the single abstraction (built within
MADlib) that we describe below.

\paragraph*{The Key Abstraction} An ideal abstraction would allow us to
decouple the specification of the model from the algorithm used to
solve the specification. Fortunately, there is a beautiful, powerful
abstraction called convex optimization 
% \fs{The previous section mentions `convex program' without any explanation.} 
that has been developed for the
last few decades~\cite{Rockafellar1996Convex,boyd:cvx} that allows one
to perform this decoupling. More precisely, in convex optimization, we
minimize a convex function over a convex set.  The archetype convex
function is $f(x) = x^2$ and is shown in
Figure~\ref{fig:xsquare}. Like all convex functions, any local minimum
of $f$ is a global minimum of $f$. Many different popular statistical
models are defined by convex optimization problems, e.g., linear
regression, support vector machines, logistic regression, conditional
random fields.  Not every data analysis problem is
convex---notable exceptions include the a priori algorithm and graph mining algorithms---but many are convex. Table~\ref{fig:objs:grads} lists models that we have
implemented in our abstraction over MADlib, all of which are convex. (Note that the previous built-in MADlib examples of linear and logistic regression fall into this category!)
% \fs{I would like a comment on the overlap of linear and logistic regression from the previous section.}


\begin{figure}[b]
\centering
\includegraphics[width=2in]{FIGS/x_square}
\caption{The archetypical convex function $f(x) = x^2$.}
\label{fig:xsquare}
\vspace{-1em}
\end{figure}

In spite of the expressive power of convex optimization, even simple
algorithms converge at provable rates to the true solution. For
intuition, examine the archetypical function $f(x) = x^2$ shown in
Figure~\ref{fig:xsquare}. The graph of this function is like all
convex sets: bowl shaped. To minimize the function, we just need to
get to the bottom of the bowl. As a result, even greedy schemes that
decrease the function at each step will converge to an optimal
solution. 
% \fs{Minor detail: Can we phrase the previous sentence in a mathematically exact way? Of course, just because a monotonically decreasing sequence is bounded from below by some number does not yet imply convergence against that number.} 
One such popular greedy method is called a gradient
method. The idea is to find the steepest descent direction. In 1-d,
this direction is the opposite direction of the derivative; in higher
dimensions, it's called the gradient of $f$.
% \footnote{One standard notation for the gradient is $\nabla_x f$~\cite{boyd:cvx}.} 
Using the gradient, we iteratively move toward a solution. This process can
be described by the following pseudocode:
\[ x \leftarrow x - \alpha \cdot G(x) \]
\noindent
where $G(x)$ is the gradient of $f(x)$ and $\alpha$ is a positive
number called the \textit{stepsize} that goes to zero with more
iterations.  For example, it suffices to set $\alpha=1/k$ where $k$ is
the number of iterations. In the $f(x) = x^2$ example, the gradient is
the derivative, $G(x) = 2x$. Since $x=0$ is the minimum value, we have
that for $x > 0$, $G(x) < 0$ while for $x < 0$ we have that $G(x) >
0$.  For a convex function, the gradient always tells us which
direction to go to find a minimum value, and the process described
above is guaranteed to converge at a known rate. One can provide a
provable rate of convergence to the minimum value, which is in
sharp contrast to a typical greedy search.  In our prototype
implementation in MADlib, we picked up one such simple greedy
algorithm, called stochastic (or sometimes, ``incremental'') gradient descent
(SGD)~\cite{RobbinsMonro51,BertsekasNLP}, that goes back to the
1960s. SGD is an approximation of gradient methods that is useful when the
convex function we are considering, $f(x)$, has the form:
 
\[ f(x)= \sum_{i=1}^N f_i (x) \]

If each of the $f_i$ is convex, then so is
$f$~\cite[pg.~38]{boyd:cvx}. Notice that all problems in Table~\ref{fig:objs:grads} are
of this form: intuitively each of these models is finding some model
(i.e., a vector $w$) that is scored on many different training
examples. SGD leverages the above form to construct a rough estimate
of the gradient of $f$ using the gradient of a single term: for
example, the estimate if we select $i$ is the gradient of $f_i$ (that
we denote $G_i(x)$). The resulting algorithm is then described as:
\begin{equation}
x \leftarrow x - \alpha N \cdot G_i(x)   
\label{eq:incremental}\end{equation}
This approximation is guaranteed to converge to an optimal solution~\cite{Nedic00}. 
% \fs{We should have the reference here.}

 
\paragraph*{Using the MADlib framework}
In our setting, each tuple in the input table for an analysis task
encodes a single $f_i$. We use the {\em micro-programming} interfaces of
Sections~\ref{sec:micro} and~\ref{sec:cplusplus} to perform the mapping from the tuples to the
vector representation that is used in Eq.~\ref{eq:incremental}.  Then,
we observe Eq.~\ref{eq:incremental} is simply an expression over each
tuple (to compute $G_i(x)$) which is then averaged together. Instead
of averaging a single number, we average a vector of numbers. Here, we
use the {\em macro-programming} provided by MADlib to handle all data
access, spills to disk, parallelized scans, etc.
% \footnote{Averaging
%   is more than a metaphor: recently a model-averaging technique where
%   one runs multiple models in parallel and averages the resulting
%   vectors component-wise is shown to improve convergence rates over a
%   serial approach~\cite{Zinkevich10}. Our implementation leverages
%   this observation to allow the DBMS freedom to process the aggregate in
%   whatever way it sees fit.
%   \ksn{The foot note "Averaging is more than a metaphor: recently a model-averaging?. " is also slightly misguided.
%   This is by no means a "recent" technique: such model averaging techniques have been around for quite some time (Bayes point machines, etc); the Smola paper merely gave a much more refined analysis to a well-known algorithm.}}
Finally, the macro programming layer
helps us test for convergence (which is implemented with either a
python combination or C driver.) Using this approach, we were able to
add in implementations of all the models in
Table~\ref{fig:objs:grads} in a matter of days.

In an upcoming paper we report initial experiments showing that our SGD based approach
achieves higher performance than prior data mining tools for some
datasets~\cite{Re:sigmod12}.
% For example, on the benchmark Forest dataset, our Logistic
% Regression implementation is up to $4.6\times$ faster than the original
% implementation in MADlib for runtime until 1\% convergence on a
% single-node multicore machine. Similarly, our SVM implementation is up
% to $5.8\times$ faster than the original MADlib implementation.\jmh{Do we need to rerun these comparisons?} \fs{I find this paragraph too vague as it is now. Can we name the data set? What is 1\% convergence? The difference in log-likelihood between two subsequent iterations is less than 1\%? That does not necessarily mean that the optimum has really been found.}


%% The model averaging is the final ingredient to implement the entire
%% SGD algorithm as a standard user-defined aggregation function: the
%% model vector is the state of the aggregate, the transition function
%% (or \texttt{sfunc}, in PostgreSQL terms) is the gradient step, and the
%% final merge step is to simply average the models. There are important
%% technical consequences of this observation. First, we automatically
%% leverage decades of database know-how in aggregation processing for
%% scalability (e.g., parallelism).  Second, we get many data analysis
%% tools almost for free: 

%% SGD is one of a spectrum of gradient methods that differ in how they
%% approximate they gradient term above: at one extreme is SGD where we
%% approximate the gradient using a single term, at the other extreme are
%% traditional batch gradient or conjugate gradient where one computes
%% the full gradient exactly (conjugate gradient is provided by MADlib as
%% well). In between, there are methods called {\em mini-batches} that
%% estimate the gradient using several terms of the function (in contrast
%% to both incremental gradient, which uses a single term, and batch
%% gradient that uses all the terms in the function to compute the true
%% gradient.). It is an interesting question to understand these
%% performance trade-offs in more detail for in-database analytics.


%% That said, our initial experiments show that our SGD based approach
%% acheives higher performance than prior data mining tools for some
%% datasets. For example, on the benchmark Forest dataset, our Logistic
%% Regression implementation is up to $4.6\times$ faster than the original
%% implementation in MADlib for runtime until 1\% convergence on a
%% single-node multicore machine. Similarly, our SVM implementation is up
%% to $5.8\times$ faster than the original MADlib implementation.

\begin{table}
\centering
{\small
\begin{tabular}[t]{|l||l|}
\hline
Application & Objective \\
\hline
\hline
Least Squares & $\sum_{(u,y) \in \Omega} (x^Tu - y)^2$\\
Lasso~\cite{Tibshirani94regressionshrinkage} & $\sum_{(u,y) \in \Omega} (x^Tu - y)^2 + \mu  \|x\|_1$\\
Logisitic Regression & $\sum_{(u,y) \in \Omega} \log(1 + \exp( - y x^tu ))$\\
Classification (SVM)& $\sum_{(u,y) \in \Omega} (1 - yx^{T}u)_{+}$\\
\hline
Recommendation & $\sum_{(i,j)\in\Omega} (L_i^T R_j - M_{ij})^2 + \mu\|L,R\|_F^2$\\
Labeling (CRF)~\cite{wallach:crf} & $\sum_{k} \left[ \sum_{j} x_j F_j(y_k, z_k) - \log Z(z_k) \right]$ \\
\hline
\end{tabular}}
\caption{Models currently Implemented in MADlib using the SGD-based
  approach. 
  % In classification and regression methods, given
  % a dataset $\Omega \subseteq \mathbb{R}^{N} \times \mathbb{R}$, we
  % minimize the error of a predictor $x \in \mathbb{R}^N$ on a
  % dataset. Given $(u,v) \in \Omega$, the methods above predict the
  % value of $y$ by the equation $x^Tu$. One can add a regularization
  % term to any regression or classification model, e.g., to combat
  % overfitting. For example, in Lasso, we minimize the square loss
  % against the data subject to an $\ell_1$ regularization.  In
  % recommendation, given a matrix $M \in \mathbb{R}^{n \times m}$
  % observed on a set of entries $\Omega \subseteq [n] \times [m]$, we
  % fit a so called low-rank approximation, i.e., $L \in \mathbb{R}^{n
  %   \times k}$ and $R \in \mathbb{R}^{m \times k}$ for $k \ll
  % \min\{m,n\}$ .  The low-rank matrix is regularized by the sums of the
  % squares of its factors.  In Labeling with Conditional Random fields,
  % we maximize the weights associated with features $(F_j)$ in the text
  % predict the labels. 
  }
\label{fig:objs:grads}
\end{table}

%% \paragraph*{Future Work and Challenges} We plan to work on three challenges: (1) improving performance, (2) an infrastructure to make tools in MADlib statistically robust, and (3) SAS-style Model Management. 
%% \begin{enumerate}
%% \item \textbf{Improving Performance}. SGDs are not a panacea. They
%%   have suboptimal runtime performance if extremely high precision is
%%   required. We plan to add in more both mathematical and systems-level
%%   optimization methods, which can be done transparently leveraging the
%%   decoupling provided by convex analysis.

%% \item \textbf{Increasing Robustness}. In its current version, MADlib
%%   encodes parameters of the algorithms (sometimes called
%%   hyperparameters) in the code itself. Some statistical techniques
%%   like regularization (a common technique to prevent overfitting or
%%   deal with poorly conditioned data) require that a user tune a
%%   parameter on their data. We would like to be able to expose such
%%   parameters in a generic way and then provide automated tuning
%%   techniques. In turn, our hope is to increase the robustness of these
%%   statistical tools automatically.

%% \item \textbf{Managing Models, SAS-style}. The term model management
%%   in the SAS literature is a distinct concept from the database
%%   term. The idea is to track how a model is derived, how robust its
%%   parameters are, which features are used in its construction. This
%%   has proved extremely useful in the SAS area to support higher-level
%%   business analytics. We plan to study this problem and develop an
%%   abstraction to support it.

%% \end{enumerate}

\subsection{Florida/Berkeley Contributions: \\Statistical Text Analytics}
\label{sec:viterbi}
\label{sec:mcmc}
%Importance of statistical text analysis 
The focus of the MADlib work at Florida and Berkeley has been to integrate statistical text analytics into a DBMS.
In many domains, structured
data and unstructured text are both important assets for data
analysis. For example, electronic medical record systems use transactional databases to store 
both structured data such as lab results and monitor readings, and unstructured data such as notes from physicians. 
With our goal to push analysis close to the data, it is important for MADlib to support basic statistical 
methods for data scientists to implement text analysis tasks.

% different text analysis tasks
Basic text analysis tasks include part-of-speech (POS) tagging,
named entity extraction (NER), and entity resolution (ER)~\cite{feldman2007text,JurafskyMartin}.
Different statistical models and algorithms are implemented for each
of these tasks with different runtime-accuracy tradeoffs. For
example, an entity resolution task could be to find all mentions in
a text corpus that refer to a real-world entity $X$. Such a task can
be done efficiently by approximate string matching~\cite{Navarro:2001:GTA:375360.375365}
techniques to find all mentions in text that approximately match the
name of entity $X$. However, such a method is not as accurate as the
state-of-the-art collective entity resolution algorithms based on
statistical models, such as Conditional Random Fields
(CRFs)~\cite{lafferty2001conditional}.
% \fs{This section introduces CRFs (and other techniques) essentially without explanation, whereas in previous sections we even describe linear/logistic regression or convex optimization.}

% our goal
% what we've done
\paragraph*{Pushing Statistical Text Analysis into MADlib}
Based on the MADlib framework, our group set out to implement
statistical methods in SQL to support various text analysis tasks.  We
use CRFs as the basic statistical model to perform more advanced text
analysis. Similar to Hidden Markov Models (HMM)~cite{}, CRFs are a
leading probabilistic model for solving many text analysis tasks,
including POS, NER and ER~\cite{lafferty2001conditional}. To support sophisticated text
analysis, we implement four key methods: text feature extraction, most-likely inference
over a CRF (Viterbi), MCMC inference, and
approximate string matching (Table~\ref{tab:uflmethods}). 

\begin{table}
  \centering
  \begin{small}
\begin{tabular}{|c||c|c|c|}
  \hline
  % after \\: \hline or \cline{col1-col2} \cline{col3-col4} ...
  Statistical Methods & POS & NER & ER \\ \hline\hline
  Text Feature Extraction & $\checkmark$ & $\checkmark$ & $\checkmark$ \\
  Viterbi Inference & $\checkmark$ & $\checkmark$ &  \\
  MCMC Inference &  & $\checkmark$ & $\checkmark$ \\
  Approximate String Matching &  &  & $\checkmark$ \\
  \hline
\end{tabular}
\end{small}
\caption{Statistical text analysis methods}
\label{tab:uflmethods}
\end{table}

%% \dzw{Add query-driven text analysis? -- Motivation for query-driven
%% text analysis techniques for IE and ER. Compare to tasks that does
%% not need to be real-time query-driven (e.g., POS, feature
%% extraction).}

\textbf{Text Feature Extraction:} Text feature extraction is a step in
most statistical text analysis methods, and it can be an expensive
operation. To achieve high quality, CRF methods often assign hundreds
of features to each token in the document. Examples of such features
include: (1) dictionary features: {\it does this token exist in a
  provided dictionary?} (2) regex features: {\it does this token
  match a provided regular expression?} (3) edge features: {\it is
  the label of a token correlated with the label of a previous
  token?} (4) word features: {\it does this the token appear in the
  training data?} and (5) position features: {\it is this token the
  first or last in the token sequence?} The right combination of
features depends on the application, and so leveraging the micro-programming
 interface, we implemented a common set of text feature extractors.

\textbf{Approximate String Matching:} A recurring primitive operation
in text processing applications is the ability to match strings
approximately. The technique we use is based on qgrams~\cite{gravano2001using}. 
% This technique is an
% approximate string match that was first introduced in the SQouT
% project~\cite{}. 
We used the trigram module in PostgreSQL to create
and index 3-grams over text.  Given a string ``Tim Tebow'' we can
create a 3-gram by using a sliding window of 3 characters over this
text string. Using the 3-gram index, we created an approximate matching UDF 
that takes in a query string and returns all documents in the corpus that contain at 
least one approximate match.

%The similarity between $term_1$ and $term_2$ is computed using the
%following equation:
%
%\begin{equation}\label{}
%    S(term_1, term_2) = \frac{|term_1 \cup
%    term_2|}{min\{|term_1|,|term_2|\}}.
%\end{equation}

Once we have the features, the next step is to perform inference on
the model. We also implemented two types of statistical inference
within the database: Viterbi (when we only want the most likely answer
from the model) and MCMC (when we want the probabilities or confidence
of an answer as well).

\textbf{Viterbi Inference:} The \textit{Viterbi} dynamic
programming algorithm is a popular algorithm to find
the top-k most likely labelings of a document for (linear chain) CRF
models~\cite{viterbi}. 
% The following dynamic programming computes the top-1 label
% sequence. (This can be easily extended to compute the top-k).
% 
% \vspace{-0.4cm}{\small
% \begin{equation}\label{Equation:crf_viterbi}
% V(i,y) = \left\{\begin{array}{lll}
%     \max_{y'} (V(i-1,y') & &\\
%     \quad + \sum^K_{k=1} \lambda_k f_k(y,y',x_i)), & $if$ & i\geq 0\\
%     0, & $if$ & i = -1.
%     \end{array}\right.
% \end{equation}}%
% \fs{If we give the recursion formula, then all symbols used should be explained (IMHO). What are the meanings of $i$, $y$, $\lambda_k$, etc.?}

Like any dynamic programming algorithm, the Viterbi algorithm is recursive. We experimented with two different implementations of macro-coordination over time. First, we chose to implement it using a combination of recursive SQL and window aggregate functions.  We discussed this implementation at some length in earlier work~\cite{DBLP:journals/pvldb/WangFGH10}. Our initial recursive SQL implementation only runs over PostgreSQL versions 8.4 and later; it does not run in Greenplum. Second, we chose to implement a Python UDF that uses iterations to drive the recursion in Viterbi. This iterative implementation runs over both PostgreSQL and Greenplum. In Greenplum, Viterbi can be run in parallel over different subsets of the document on a multi-core machine.

\textbf{MCMC Inference:} Markov chain Monte Carlo (MCMC) methods are
classical sampling algorithms that can be used to estimate probability
distributions. We implemented two MCMC methods in MADlib: Gibbs
sampling and Metropolis-Hastings (MCMC-MH). 
% The iterative Gibbs
% sampling algorithm is shown in Figure~\ref{Algo:Gibbs}.

The MCMC algorithms involve iterative procedures where the current iteration depends on previous iterations. We used SQL window aggregates for macro-coordination in this case, to carry ``state'' across iterations to perform the Markov-chain process. This window function based implementation runs over PostgreSQL version 8.4 and later. We discussed this implementation at some length in recent work~\cite{wang2011hybrid}.  We are currently working on integrating MCMC algorithms into Greenplum DBMS. We also plan to implement MCMC using Python UDF macro-coordination analogous to Section~\ref{sec:kmeans}, and compare the performance between the two implementations.

% \ksn{Is Figure 4 really necessary?
%    A question people would ask is this: Why don't we just use NLTK through PL/Python to access all the different  text analytics functionalities?}
% 
% \begin{figure}
% \hrulefill\\
% {\small
% \begin{algo}{Gibbs}{N}
%     w_0 \: \CALL{Init}(); w \: w_0; |  // initialize|
%     \FOR{idx = 1,...,N}
%         i \: idx \% n; |  // propose variable to sample next|
%         w' \sim \pi (w_{i}\mid w_{-i}) |  // generate sample|
%         \RETURN{| | next | | w'} |  // return a new sample|
%         w \: w'; |  // update current world|
%     \ENDFOR
% \end{algo}}
% \hrulefill\\
% \caption{Pseudo-code for MCMC Gibbs sampling algorithm over a model
% with $n$ variables.}  \label{Algo:Gibbs}
% \end{figure}

% main results, lessons, surprises
\paragraph*{Using the MADlib Framework} 
Because this work predated the release of MADlib, it diverged from the macro-coordination patterns of Section~\ref{sec:macro}, and took advantage of  PostgreSQL features that are less portable.  These details require refactoring to fit into the current MADlib design style, which we are currently working on.
In addition to the work reported above, there are a host of other features of both PostgreSQL and MADlib
that are valuable for text analytics.  Extension libraries for PostgreSQL and Greenplum provides text processing features such
as inverted indexes, trigram indexes for approximate string matching,
and array data types for model parameters.  Existing modules in MADlib,
such as Naive Bayes and Sparse/Dense Matrix manipulations, are
building blocks to implement statistical text analysis
methods. Leveraging this diverse set of tools and techniques that are
already within the database allowed us to build a sophisticated text
analysis engine that has comparable raw performance to off-the-shelf
implementations, but runs natively in a DBMS close to the data~\cite{DBLP:journals/pvldb/WangFGH10,wang2011hybrid}. 

%% Statistical methods can be implemented in a relational database
%% efficiently. Prior work~\cite{} shows comparable performance between
%% in-database SQL and off-the-shelf implementations. Real-time
%% query-driven text analysis can be supported using indexes over a large
%% document corpus, which can be orders-of-magnitude faster than an
%% off-line batch process.


%% \textbf{Query-Driven Techniques}

%% Information Extraction (Viterbi, Feature Extraction)

%% Entity Resolution (MCMC, Approximate String Matching)

%% \textbf{Future Work}

%% \begin{itemize}
%%   \item Performance Improvement (Feature Extraction)
%%   \item MPP Implementation
%%   \item POS/IE/ER models
%%   \item Text Analysis query interface design
%% \end{itemize}
%% Performance Improvement (Feature Extraction) MPP Implementation
%% POS/IE/ER models Query-driven Text Analysis Interface (including
%% key-word search and more advanced)

\section{Related Work}
The space of approaches to marry analytics with data is large and
seems to be growing rapidly. At a high-level there are two approaches,
(1) bring a statistical language to a data processing substrate, or
(2) provide a framework to express statistical techniques on top of a
data processing substrate. Specifically, we split the approaches to
achieve this marriage into two groups: (1) {\em top-down approaches}
begin with a high-level statistical programming language, e.g., R or
Matlab. The technical goal is to build a parallel data processing
infrastructure that is able to support running this high-level
language in parallel. Examples of approach include System ML from
IBM~\cite{systemml}, Revolution
Analytics~\cite{Revolution}, and
SNOW~\cite{SNOW:Tierney:2009:SPC:1598913.1598918}. (2) We call the
second group {\em framework-based approaches} whose goal is to provide
a set of building blocks (individual machine-learning algorithms)
along with library support for micro- and macro-programming to write the
algorithms. Recent examples are Mahout~\cite{mahout},
Graphlab~\cite{Graphlab:conf/uai/LowGKBGH10}, SciDB~\cite{scidb} and MADlib. 
% % One can view
% % the work of the 90s of integrating the machine learning toolkits as an
% early framework-based approach.

\paragraph*{Top-Down, Language-based Approaches} 
Within the approaches of type (1), we again divide the space into
imperative and declarative approaches. In the imperative approach, an
analyst takes the responsibility of expressing how to parallelize the
data access, e.g., SNOW and Parallel R packages provide an
MPI-interface along with calls for data partitioning within the
framework of R. In contrast in a declarative approach, the analyst
declares their analysis problem and it is the responsibility of the
system to achieve this goal (in analogy with a standard
RDBMS). Examples of the declarative approach include
SystemML~\cite{systemml}, which is an
effort from IBM to provide an R-like language to specify machine
learning algorithms. In SystemML, these high-level tasks are then
compiled down to a Hadoop-based infrastructure. The essence of
SystemML is its compilation techniques: they view R as a declarative
language, and the goal of their system is to compile this language
into an analog of the relational algebra. Similar approaches are taken
in Revolution Analytics~\cite{Revolution} and Oracle's new parallel R
offering~\cite{ORE} in that these approaches attempt to automatically
parallelize code written in R.

\paragraph*{Framework-based Approaches} 
Framework-based ap\-proach\-es attempt to provide the analyst with
low-level primitives and coordination primitives built on top of a
data processing substrate. Typically, framework-based approaches
offer a template whose goal is to automate the common aspects of
deploying an analytic task. Framework-based approaches differ in what
their data-processing substrates offer. For example, MADlib is in in this
category as it provides a library of functions over an RDBMS. The
macro- and micro-programming described earlier are examples of design
templates. The goal of Apache Mahout~\cite{mahout} is to provide an
open-source machine learning library over Apache Hadoop. Currently,
Mahout provides a library of machine learning algorithms and a
template to extend this library. At a lower level, the DryadLINQ large
vector library provides the necessary data types to build analysis
tasks (e.g., vectors) that are commonly used by machine learning
techniques. 
SciDB advocates a completely rewritten DBMS engine for numerical computation, arguing that RDBMSs have ``the wrong data model, the wrong operators, and are missing required capabilities''~\cite{scidb}.  MADlib may be seen as a partial refutation of this claim. Indeed, the assertion in~\cite{scidb} that ``RDBMSs, such as GreenPlum~[sic]...must convert a table to an array inside user-defined functions'' is incorrect: the MADlib C++ library hides representation from the UDF developer and only pays for copies when modifying immutable structures. 
GraphLab is a framework to simplify the design of
programming parallel machine learning tasks. The core computational
abstraction is a graph of processing nodes that allows one to define
asynchronous updates. Graphlab initially focused on providing a
framework for easy access to multicore
parallelism~\cite{Graphlab:conf/uai/LowGKBGH10}. However, the core
computational abstraction is general and is now also deployed in a
cluster setting. 
Another popular toolkit in this space is
Vowpal Wabbit~\cite{vw} that is extensively used in the academic
machine learning community and provides
high performance.

Two recent efforts to provide Scala-based domain-specific languages (DSLs) and parallel execution frameworks blur distinctions between frameworks and language-based approaches. Spark~\cite{Zaharia:EECS-2011-82} is a Scala DSL targeted at Machine Learning, providing access to the
fault-tolerant, main-memory {\em resilient distributed datasets}:
which are read-only collections of data that can be partitioned across
a cluster. ScalOps~\cite{scalops} provides a Scala DSL for Machine Learning that is translated to Datalog, which is then optimized to run in parallel on the Hyracks infrastructure~\cite{hyracks}.  Given its roots in Datalog and parallel relational algebra, ScalOps bears more similarity to MADlib than any of the other toolkits mentioned here.  It would be interesting to try and compile its DSL to the MADlib runtime.


\paragraph*{Other Data Processing Systems}
There are a host of data processing techniques that can be used to
solve elements of the underlying data analysis problem. For example,
Pregel \cite{Pregel:Malewicz:2010:PSL:1807167.1807184} from Google is
designed for data analysis over graphs.  In Pregel, the abstraction is
to write ones code using a graph-based abstraction: each function can
be viewed as a node that sends and receives messages to its neighbors
in the graph. Pregel distributes the computation and provides fault
tolerance.
% \jmh{I think we need a pro-forma related work section that cites various efforts for scalable ML, particularly in open source.}
%       Traditional analysts bring data to the code: see the SAS \textit{SEMMA}
%       approach.  Import/export cycle with SAS, R, etc.  This will
%       always exist, but it will be increasingly less useful.  There
%       are active efforts to scale and parallelize these, but those are
%       still quite limited, and companies like SAS are working hard to
%       partner with approaches from below...
% 
%       The new model of scalable analytics depends on bringing the code
%       to the data in a scalable storage/analysis framework.  So you
%       can structure the space in terms of where the data lives.  Not
%       going to get into \textit{what's better} here -- truth is that there
%       are pros and cons, and many organizations use more than one.
% 
%       Things are really in flux.  Segments kinda like so:
%       \begin{itemize}
%           \item SQL universe has proprietary {\em data mining} toolkits, many of which date back to the 1990s.  These cost a lot and evolve very slowly.  Cite the Teradata research papers, IBM white book.  MADlib is an effort to raise all boats in this domain.
%           \item HDFS/Hadoop universe is burgeoning.  Mahout is the center of energy for ML algorithms right now. (also System ML?)  Exciting trend there, and while the ML libraries are still quite young, they may evolve quickly.
%           \item A variety of emerging in-memory distributed systems like Graphlab, Vowpal Wabbit, Spark, Pregel/Giraph.  These merit a close look.  Some of them present alternative architectures to traditional dataflow parallelism (e.g. GraphLab, VW).  Others are targeted to problems like graph algorithms that may prove to be weak spots for more general purpose systems (e.g. Graphlab, Pregel/Giraph).
%         \end{itemize}


% \section{MADlib Beta Release: Status Report and Future Direction}
% \jmh{Do we need this?  or is the table and discussion up front enough?} \fs{I would include the references in the table up front. Future plans could be moved to the conclusion. Possibly renamed to Conclusion and Outlook.}
% \ksn{This looks awfully like Section 2.4.}
%      Engineering team leading to Beta was XX Greenplum heads + 1
%      Berkeley head.  Greenplum also donating ongoing automated
%      testing on x OSes and y DBMS versions (PG \& GP, multiple
%      versions of each).  Contributors from Wisconsin and Florida
%      constitutes YYY heads to date.
% 
%      Expand the following from the user docs.  For each method, should be refs to literature on the algorithms being used.
%      \begin{itemize}
%          \item Stat and linear algebra fundamentals
%          \item Descriptive statistics (deterministic, and one-pass approximation algorithms)
%          \item Regression
%          \item Unsupervised learning/Data mining
%          \item Supervised learning
%        \end{itemize}
\section{Conclusion and Discussion}
Scalable analytics are a clear priority for the research and industrial communities.  MADlib was designed to fill a vacuum for scalable analytics in SQL DBMSs, and connect database research to market needs.
In our experience, a parallel DBMS provides a very efficient and flexible dataflow substrate for implementing statistical and analytic methods at scale. 
Standardized support for SQL extensions across DBMSs could be better---robust and portable support for recursion, window aggregates and linear-algebra packages would simplify certain tasks.  Since this is unlikely to occur across vendors in the short term, we believe that MADlib can evolve reasonable workarounds at the library and ``design pattern'' level.

The popular alternative to a DBMS infrastructure today is Hadoop MapReduce, which provides much lower-level programming APIs than SQL.
We have not yet undertaken performance comparisons with Hadoop-based analytics projects like Mahout.  Performance comparisons between MADlib and Mahout today would likely boil down to (a) variations in algorithm implementations, and (b) well-known (and likely temporary) tradeoffs between the current states of DBMSs and Hadoop: C vs. Java, pipelining vs. checkpointing, etc.~\cite{pavlo2009comparison}  None of these variations and tradeoffs seem endemic, and they may well converge over time.  Moreover, from a marketplace perspective, such comparisons are not urgent: many users deploy both platforms and desire analytics libraries for each.  So a reasonable strategy for the community is to foster analytics work in both SQL and Hadoop environments, and explore new architectures (GraphLab, SciDB, ScalOps, etc.) at the same time.  

MADlib has room for growth in multiple
dimensions.  The library infrastructure itself is still in beta, and
has room to mature. There is room for enhancements in its core
treatment of mathematical kernels (e.g., linear algebra over both
sparse and dense matrices) especially in out-of-core settings.  And of
course there will always be an appetite for additional statistical models and
algorithmic methods, both textbook techniques and cutting-edge research.  Finally,
there is the challenge of porting MADlib to DBMSs other than PostgreSQL and Greenplum. As discussed in Section~\ref{sec:gen}, MADlib's macro-coordination logic is written in largely standard Python and SQL, but its finer grained ``micro-programming'' layer exploits proprietary DBMS extension interfaces.
Porting MADlib across DBMSs is a
mechanical but non-trivial software development effort.  At the macro level, porting will involve the package infrastructure (e.g.\ a cross-platform
installer) and software engineering framework (e.g.\ testing scripts for additional database engines).  At the micro-programming logic level, inner loops of various methods will need to be revisited (particularly user-defined
functions). Finally, since Greenplum retains most of the extension interfaces exposed by PostgreSQL, the current MADlib portability interfaces (e.g., the C++ abstraction of Section~\ref{sec:cplusplus}) will likely require revisions when porting to a system without PostgreSQL roots.
%\fs{True for pure SQL modules. For pgSQL/C modules, porting at this point is essentially rewriting. And for C++ AL modules, porting is a non-trivial rewrite of the AL.}
% \jmh{I think that's all covered above.}

Compared to most highly scalable analytics packages today, MADlib v0.3 provides a relatively large number of widely-used statistical and analytic methods.  It is still in its early stages of development, but is already in use both at research universities and at customer sites.  As the current software matures, we hope to foster more partnerships with academic institutions, database vendors, and customers.  In doing so, we plan to pursue additional analytic methods prioritized by both research ``push'' and customer ``pull''.  We also look forward to ports across DBMSs.


% More interesting questions arise by focusing a level up, at the language layer.  MADlib's use of Python, SQL, and C++ abstractions for algebra reflect the realities of the SQL development ecosystem, rather than an idealized language design.  Perhaps a better approach would be to design a clean high-level language for machine learning, and compile down to the pragmatic mix that works on a contemporary DBMS.  systemml and ScalOps take this approach for MapReduce.  More interestingly, one could try to {\em cross-compile} such a language to multiple platforms---including not only the well-known parallel dataflow platforms (DBMSs, MapReduce) but also specialized machine learning engines like GraphLab that are architected quite differently.  Doing so would not only unify the open source ecosystem for analytics around a single language, but could also highlight important tradeoffs between these very different architectures for parallelism.

\section{Acknowledgments}
In addition to the authors, the following people have contributed to MADlib and its development:
Jianwang Ao, 
Joel Cardoza, 
Brian Dolan,
Hlya Emir-Farinas, 
%Eugene Fratkin, 
Christan Grant, 
Hitoshi Harada, 
Steven Hillion, 
Luke Lonergan, 
%KeeSiong Ng, 
Gavin Sherry,
Noelle Sio, 
Kelvin So,
Gavin Yang, 
Ren Yi, 
Jin Yu,
Kazi Zaman, 
Huanming Zhang.  %Thanks to them all.
{\small \bibliographystyle{abbrv-etal}
  \bibliography{chris.local.wopages,WriteupRelWork} }

\end{document}