Skip to content

Latest commit

 

History

History
204 lines (149 loc) · 9.02 KB

using.md

File metadata and controls

204 lines (149 loc) · 9.02 KB

Using Loom

Overview

Loom is organized as a collection of python modules that wrap C++ stand-alone engine tools. Using loom is very file-oriented, with the main tasks being purely-functional transformations between files. The highest-level tasks are in the loom.tasks module.

The most common operations are accessible both in python and at the command line with syntax like

python -m loom.<modulename> function argument1 key1=value1 key2=value2

To see help with a particular module, simply

python -m loom.<modulename>    # lists available commands

The most common transforms are available by the top level

python -m loom    # lists available commands

Loom stores all intermediate files in a tree rooted at loom.store.STORE which defaults to $LOOM/data and can be overridden by setting the LOOM_STORE environment variable.

See Adapting Loom for detailed dataflow.

Input format

Loom inputs a pair of files schema.csv and rows.csv via loom.tasks.transform, which creates a stricter schema and table with basic types suitable for loom.tasks.ingest. The schema.csv accepted by loom.tasks.transform should document all relevant columns in rows.csv.

An example schema.csv:

Feature Name Type
full name id
start date optional_date
age real
zipcode unbounded_categorical
description text
misc unused

Loom currently supports the following fluent feature types in loom.tasks.transform:

Fluent Type Example Values Transforms To
boolean '0', '1', 'true', 'false' bb
categorical 'Monday', 'June' dd
unbounded_categorical 'CRM', '90210' dpd
count '0', '1', '2', '3', '4' gp
real '-100.0', '1e-4' nich
sparse_real '0', '0', '0', '0', '123456.78', '0', '0', '0' bb + nich
date '2014-03-31', '10pm, August 1, 1979' many nich + many dpd
text 'This is a text feature.', 'Hello World!' many bb
tags '', 'big_data machine_learning platform' many bb
optional_(TYPE) '', ...examples of TYPE... bb + TYPE

Text fields typically transform to 100-1000 boolean features corresponding to word presence of words that occur in at least 1% of documents. Date fields transform to absolute, cyclic and relative features, so it is expensive to have lots of date fields. An input schema with N date fields will transform to features:

N date = N nich         # absolute date
       + N(N-1)/2 nich  # relative dates for each pair
       + 4 N dpd        # month + day-of-month + day-of-week + hour-of-day

Loom currently supports the following basic feature models in loom.tasks.ingest:

Name Basic Type Example Values Probabilistic Model Relative Cost
bb boolean 0,1,true,false Beta-Bernoulli 1
dd categorical up to 256 Monday, June Dirichlet-Discrete 1.5
dpd unbounded categorical CRM, 90210 Dirichlet-Process-Discrete 3
gp count 0, 1, 2, 3, 4 Gamma-Poisson 50
nich real number -100.0, 1e-4 Normal-Inverse-Chi-Squared 20

Loom file formats

Loom ingests data in various gzipped files in .csv, protobuf (.pb) messages and protobuf streams (.pbs). See src/schema.proto for protobuf message formats.

The loom.store module gives programmatic access to the loom store. Loom stores files under $LOOM_STORE, defaulting to loom/data/. The store is organized as

$LOOM_STORE/
$LOOM_STORE/my-dataset/
$LOOM_STORE/my-dataset/ingest/      # directory for starting dataset
$LOOM_STORE/my-dataset/samples/     # directory for inference
$LOOM_STORE/my-dataset/.../         # directories for various operations

Each directory has a standard set of files

ingest/                             # ingested data
  version.txt                       # loom version when importing data
  transforms.pickle.gz              # feature transforms
  schema.json | schema.json.gz      # schema to interpret csv data
  rows.csv.gz | rows_csv/*.csv.gz   # one or more input data files
  rowids.csv.gz                     # internal <-> external id mapping
  encoding.json.gz                  # csv <-> protobuf encoding definition
  rows.pbs.gz                       # stream of data rows
  schema_row.pb.gz                  # example row to serve as schema
  tares.pbs.gz                      # list of tare rows
  diffs.pbs.gz                      # compressed rows stream
samples/                            # inferred samples
  sample.0/                         # per-sample data for sample 0
    config.pb.gz                    # inference configuration
    init.pb.gz                      # initial model parameters etc.
    shuffled.pbs.gz                 # shuffled, compressed rows
    model.pb.gz                     # learned model parameters
    groups/                         # sufficient statistics
      mixture.0.pbs.gz              # sufficient statistics for kind 0
      mixture.1.pbs.gz              # sufficient statistics for kind 1
      ...
    assign.pbs.gz                   # stream of inferred group assignments
    infer_log.pbs                   # stream of log messages
    checkpoint.pb.gz                # checkpointed inference state
  sample.1/                         # per-sample data for sample 1
    ...
query/                              # query server data
  config.pb.gz                      # query configuration
  query_log.pbs                     # stream of log messages

You can inspect any of these files with

python -m loom cat FILENAME         # parse + prettyprint

And watch log files with

python -m loom watch /path/to/infer_log.pbs

In single-pass inference rows can be streamed in via stdin and assignments can be streamed out via stdout.

Querying Results

Loom supports flexible, interactive querying of inference results. These queries are divided between low-level operations, implemented in loom.runner.query, and higher-level operations, in loom.preql.

The low-level primitives, sample, score, entropy, and score_derivative are written in C++. They can be accessed as a transformation of protobuf messages via the loom.runner.query function. See src/schema.proto for the query message format.

The loom.query module provides a convenient way to create a persistent query server with both protobuf and python interfaces.

Higher level operations are supported via the loom.preql module. Assuming that you have completed an inference run named iris, you can create a query server using the following convenience method in loom.tasks:

import loom.tasks
server = loom.tasks.query('iris')
  • related returns scores between 0 and 1 representing the strength of the predictive relationship between two columns or groups of columns. A score of 1 indicates that loom has found very strong evidence for a relationship between the columns, and a score of 0 means that no evidence was found; intermediate values represent more graded levels of confidence.

    print server.relate(['class'])

  • predict is a very flexible operation that returns a simulated values for one or more unknown columns, given fixed values for a different subset of columns. This flexibility is possible because loom learns a joint model of the data. Standard classification and regression tasks are therefore one special case, in which the value of one "dependent" or "target" column is predicted given values for all of the other columns (known as "independent variables", "predictors", "regressors", or "features").

    Some examples of predict usage are as follows:

    • Fix all but one column and predict that column.

    • Fix some columns, predict some other columns.

    • Fix no columns, predict all columns.