Skip to content

MADlib Installer Specs

agorajek edited this page Apr 16, 2011 · 7 revisions

Created by Gavin Sherry. Edited by Florian Schoppmann, Aleks Gorajek.

Contents:

  1. Introduction
  2. Overview
  3. Package Managers
  4. Requirements for (OS-level) Installation
  5. Requirements for Command-Line Tool
  6. Functional Specs
    • 6.1. OS-level Installation
    • 6.2. DB-level Installation
      • 6.2.1. Installing/Updating MADlib
      • 6.2.2. Uninstalling MADlib
      • 6.2.3. Displaying version info
  7. Database Objects Migration
  8. Design Considerations
    • 8.1. Instance, Database, Schema
    • 8.2. Multi-host environments
    • 8.3. User dependencies on MADlib code

1. Introduction

Many database systems already support some statistical functions. MADlib is an open source project which provides a common set of functions to relational database systems. An open source project which is hard to install, manage and use is not a compelling one. As such, an important aspect of MADlib will be the ability to easily procure the code and make its functionality available to the data processing system. This calls for a user-friendly mechanism to aid installation, upgrades and uninstallation.

2. Overview

The MADlib installation philosophy has two layered components: On the (operating) system level, files and libraries are installed using the platform-native installation infrastructure. Since in general there is a one-to-many correspondence between the operating system and the databases running on it, we need a second tool on that makes the MADlib user-level API available on the database level. This is facilitated by the MADlib command-line installation tool specified in this document.

This document is guided by the following important criteria for success:

  • Don't reinvent the wheel. Using the existing and proven tools on each platform ensures that MADlib can be incorporated into existing software-distribution infrastructures, and users immediately feel at home.
  • Clean interface. A command line and programmatic interface is provided for making the MADlib user-level API available in a particular database. The command-line tool is easy to use and similar to other tools that perform actions on a single database (e.g., createdb, pg_dump, pg_restore, ...).
  • Version tracking. It must handle version stamped releases and understand the relationship be- tween versions. It must understand the difference between major and minor releases. That is, releases which modify interfaces and those which do not.
  • Robustness. The package manager will be used in production environments where serious bugs or flaws taint the reputation of the project.
  • Just works. The tool should work intuitively.
  • Completeness. The host of minor features which make the tool polished and able to be used in unexpected ways.
  • Portability. The ability to deploy packages for a variety of data processing platforms. In the initial release, these will be RedHat/CentOS (rpm/yum), Solaris (pkg-get) and Mac OS X (MacPorts) on the system level, and Greenplum and Postgres on the database level. Support will follow soon after for other RDBMS platforms.

3. Package Managers

There are a signicant number of software package managers in the open source world. Some popular used package managers are the RPM Package Manager (RPM)1, the Comprehensive Perl Archive Network (CPAN)2, RubyGems3 and Debian Packages (dpkg)4. Though they target different users and platforms, these package managers share several things in common:

  • Well defined interface. A command line tool and API are provided. The API allows users to write extensions, such as GUIs.
  • Internet-based repository. A centralised location for the publishing and retrieval of packages. The client ships with pre-defined methods of locating repositories. Packages are downloaded via common Internet protocols like HTTP or FTP.
  • Versions. Packages can be versioned and version information is understood by the package man- ager.
  • Dependencies. A graph of arbitrary complexity can be constructed and managed to determine dependencies of one package upon another.
  • Robust management. Packages can be downloaded as source or as a platform specic binary, they can be built locally, activated, deactivated, rebuilt, listed, searched, removed and more. The installation step itself provides the ability to add scripts for pre and post processing. The package manager allows users to construct their own packages for local use or publishing to the centralized repository.
  • Architecture awareness. Packages are availability for specic architectures and platforms and the package manager is smart enough to install the appropriate package for the host platform.

4. Requirements for (OS-level) Installation

Requirement Description
Version 1.0
P1 If possible, MADlib package will provide binaries only. Installation from source will only occur where necessary (e.g., MacPorts).
P2 MADlib installation will not interfere with other packages. In particular, uninstalling the databases that MADlib was used in will never leave stray les.
P3 All dependencies on 3rd party packages (e.g., LAPACK) will be automatically resolved when possible.

5. Requirements for Command-Line Tool

Requirement Description
Version 1.0
R1 The command-line tool is able to "install" the MADlib user-level API on a selected database instance. "Installing" includes the initial setup, up- and down-grading, and uninstalling.
R2 Packages may be installed on Postgres and Greenplum Database systems.
R3 The command-line installation tool comes with basic manual page support.
Post Version 1.0
R4 MADlib can be installed to template1 (on Postgres-based databases) so that any newly created database will automatically contain MADlib.
R5 MADlib will be added to repositories for supported platforms.
R6 The package management client is able to manage packages for other systems, like mySQL, Oracle and others.

6. Functional Specs

6.1. OS-level Installation

Installing MADlib on the OS-level should be no more than (example taken from RedHad/CentOS)

yum install madlib

on a single-node setup (like PostgreSQL or Greenplum Single Node Edition) and

gpssh -f host_file "yum install madlib"

on a multi-node setup (like Greenplum cluster).

6.2. DB-level Installation: The MADlib command-line tool

The basic form for the command-line tool's arguments is:

madpack <command> <command-options> <connect-string>

where, commands are:

  • install/update Installs or updates the database layer MADlib code.
  • uninstall Removes all database level objects w/o "touching" the OS layer.
  • version Reports version of the MADlib OS and/or DB layer.

and command-options are:

  • -s schemaname The name of the target database schema for MADlib objects. If not specied, "madlib" will be used by default.
  • -v verbose Verbose output style.
  • -db database-type Type of the target database platform, like PostgerSQL or Greenplum.

and connect-string is: user[/pass]@[host:]port/dbname

  • user Database user name to connect as.
  • pass Password for the database user. If not supplied the installer will prompt for it.
  • host Database host to connect to. If not supplied thet defaults to "localhost".
  • port TCP port to connect to.
  • dbname Database name to install MADlib into.

6.2.1. Installing/Updating MADlib

The install command installs/updates the MADlib in-database API to the MADlib installation installed at the OS-level. Installation can therefore be the first setup, an upgrade or a downgrade. For example:

madpack install -s madlib -db gp gpadmin@mdw:5432/testdb

6.2.2. Uninstalling MADlib

Example of uninstall command (with verbose option): For example:

madpack uninstall -db gp gpadmin@mdw:5432/testdb

6.2.3. Displaying version info

The version command reports the version of MADlib installed in the database as well as the version of the madlib command-line tool (which is the version of the files installed at OS-level). For example:

madpack version -db gp gpadmin@mdw:5432/testdb

If there is no dbtype and connect-string supplied only the version of the madpack tool is reported.

7. Database Objects Migration

In order to provide an easy way to install MADlib objects and avoid any destructive actions on the user objects dependent on the old version MADlib objects we propose the following design.

Definitions:

  • MADLIB_SCHEMA - name of the schema to install MADlib objects into, specified by the user (or read from the default value: madlib)
Case Install Actions Rollback Actions (if needed during Installation)
Version 1.0
MADLIB_SCHEMA does not exist 1) Create MADLIB_SCHEMA
2) Create MADlib objects
Drop MADLIB_SCHEMA
MADLIB_SCHEMA exists
w/o MADlib objects
1) Create MADlib objects WARNING Message:
Drop MADLIB_SCHEMA before retrying
OR install MADlib into a different schema
MADLIB_SCHEMA exists
w/ MADlib objects
1) INFO Message:
MADLIB_SCHEMA will be renamed to MADLIB_SCHEMA_vXYZ
2) Rename MADLIB_SCHEMA to MADLIB_SCHEMA_vXYZ
3) Create MADLIB_SCHEMA
4) Create MADlib objects
1) Drop MADLIB_SCHEMA
2) Rename MADLIB_SCHEMA_vXYZ to MADLIB_SCHEMA
After Version 1.0
MADLIB_SCHEMA exists
w/ MADlib objects
Upgrade MADlib objects one by one (if possible) based on MADlib metadata about each version Undo all steps executed so far. More research is needed here.

8. Design Considerations

8.1. Instance, Database, Schema

These terms are very charged in the database world. Postgres and Greenplum Database have instances, which host databases, which themselves host schemas, which contain objects. Oracle on the other hand denes an instance as the instantiation of a given database. TODO: We need to research the details of this for all possible db platform ports and perhaps make the database & schema optional.

8.2. Multi-host environments

While Postgres and the Greenplum Database largely present identically to the user, the latter is a cluster of networked hosts. To create a C language user defined function, the shared library that the C function was compiled to needs to be present on all hosts comprising the cluster. This means that installing the package is not just a case of copying the data from the package to the file system invoking the package management client. For the Greenplum Database, we must assemble a host list and call gpssh utility to run the OS-level package manager on all nodes. While MADlib is not part of mainstream repositories, we must include detailed instructions on how to modify the package-manger congurations on all nodes at once.

8.3. User dependencies on MADlib code

Users are going to create views and functions which reference objects created by MADlib packages. This establishes a dependency outside of the MADlib schema. By understanding major and minor version changes, that is, those revisions which modify an interface and those which do not, the command-line tool can attempt to install a new revision of a package without breaking an application which uses it.

Clone this wiki locally