Skip to content

Statistics generator

Artur Czeczko edited this page May 27, 2014 · 2 revisions

Statistics generator

Introduction

The statistics generator module is a part of the CoAnSys project. It is designed to calculate different statistics based on input data which can be represented as groups of string pairs. The module contains some predefined operations and is extendable by user-defined components. It uses apache hadoop and oozie as backend and the protocol buffers as input and output data format.

Getting started - building and running statistics generator

To get sources of statistics generator (and whole CoAnSys project) type:

git clone https://github.com/CeON/CoAnSys.git

Build the module:

cd CoAnSys/statistics-generator
mvn install -D jobPackage

The built artefact is located here:

statistics-generator-workflow/target/statistics-generator-workflow-<VERSION>-oozie-wf.tar.gz

Copy it to your hadoop cluster, unpack and place in the HDFS filesystem:

tar xzvf statistics-generator-workflow-*-oozie-wf.tar.gz
hadoop fs -put pl.edu.icm.synatlogsanalysis-synatlogsanalysis-workflow

Prepare the job.properties file containing at least:

nameNode=hdfs://namenodehost:8020
jobTracker=jobtrackerhost:8021
queueName=default
oozie.use.system.libpath=true
oozie.wf.application.path=${nameNode}/user/${user.name}/pl.edu.icm.synatlogsanalysis-synatlogsanalysis-workflow
unfilteredInput=inputDataDir
filteredInput=filteredInputDir
allStatistics=statsDir
aggregatedStatistics=aggrStatsDir

where inputDataDir, filteredInputDir, statsDir, aggrStatsDir are HDFS dirs and can be replaced by any other names. The inputDataDir directory has to be already existent directory and contain the input data (see sections Input and output data).

To run the workflow write the following commands:

export OOZIE_URL=http://ooziehost:11000/oozie
oozie job -run -config job.properties

Input and output data

The input data is grouped into records, each containing set of string pairs, which can be treated as key-value pairs. The number of key-value pairs may varry but the significant subset of input data records should share common keys set.

The input data must be stored in the hadoop’s sequence files. Every input record is serialized by protocol buffers into byte array and stored in the sequence file as BytesWritable objects. The output data is also stored in the sequence files, serialized in the protocol buffers.

Protocol buffers classes used in the statistics generator module are:

pl.edu.icm.coansys.models.StatisticsProtos$InputEntry – for the input data,

pl.edu.icm.coansys.models.StatisticsProtos$InputEntry – for the output after 2nd phase (see Data processing phases section),

pl.edu.icm.coansys.models.StatisticsProtos$InputEntry – for the output after 3rd phase (see Data processing phases section).

Data processing phases

The data processing is performed in three phases: filtering input data, calculating statistics and aggregating (groupping, sorting and limiting) statistics. The first and third phases are optional. To disable them place into job.properties file:

filter_input_data=false
aggregate_statistics=false

Phase 1 - input data filtering

To filter input data you can write a formula containing and, or, not operators and brackets. Place the formula to the job.properties file as input_filter_formula property, e.g.:

input_filter_formula=(af1 and af2) or not af3

af1, af2 and af3 are labels of atomic filters which have to be defined as follows:

input_filter_names=af1;af2;af3
input_filter_classes=EQFILTER;EQFILTER;NONEMPTYFILTER
input_filter_classes_args=key1#value1;key2#value2;key3

EQFILTER is the label of a predefined filtering class which checks if the input record contains a pair <key1, value1> for filter af1, <key2, value2> for filter af2, <key3, value3> for filter af3.

NONEMPTYFILTER identifies a class which checks if the input record contains a key-value pair with given key (key3 in our example) and non empty value.

As you can perceive, the semicolon (;) is used to separate atomic filters and options relative to individual filters. The number sign (hash, #) separates options of one filter. The same convention is used in other properties.

You can implement your own filtering class and use it here. See the User-defined operations section.

The output of the filtering phase is a subset of original data, exactly in the same format.

Phase 2 - calculating of statistics

Partitions

As indicated in the section Input and output data, the input data is a set of records, each containing key-value pairs. Otherwise you can imagine this data as a multidimensional space, where keys are dimensions and theirs’ values are positions in the dimension.

In the second phase, after filtering, selected dimensions of the space are divided into parts (every part receives it’s id). In the result, the whole space is divided into segments, called further “partitions”. Every partition is identified by a set of pairs key - part id.

Usually, one record belongs to one partition, but it is not a strict rule.

To configure partitioning the input data, set the following options:

partitions_names – this variable contains names of keys which are used to partition input data. Names are separated by semicolons;

partitions_classes – place here labels of partitioning classes, predefined or user defined. The number of classes must be the same as the number of partition names. They have be separated by semicolons;

partition_classes_args – partitioning classes can require some parameters, which have to been placed here. Sets of parameters for individual classes must be separated by semicolon. Parameters of single class are separated by number sign (#).

Here is the list of predefined partitining classes:

EQUALS – the class which puts in the same partitions records which have the same values of given key. This class doesn’t require any additional parameters. As identifier of parts created by this class is a value of the key.

REGEX – the class which partitions the data according to given regular expression. The regular expression containing at least one group (in the brackets) must be specified as an additional parameter of the class. If the second or further group is intended to be a group identifier, the number of the group must be set as a second parameter of the class;

DATERANGES – the class which creates partitions for dates ranges, each ending in the current date and beginning n days in the past (before current date or a date specified in the configuration as the end of all logs). There is one partition which contains all records (so you can see that partitions aren’t disjoint). For example parameters 7#30#180 mean that four partitions will be created: 1) last 7 days, 2) last 30 days, 3) last 180 days, 4) all data. In the configuration, besides numbers of days, user can specify additional options, like timezone or date of end of logs, e.g: 7#30#timezone=Europe/Warsaw#logsEnd=2014-05-01

User can specify her or his own partitioning classes. More info in the section User-defined operations.

Statistics

After that, statistics are counted for each partitions. The class which counts statistics receives the iterator througs all records in the partitions and can use any data available in records. The counted statistic is a floating point number (double).

The simplest class (which is predefined in the module) is the class which counts records in the partition. It’s identifier (needed in the configuration) is COUNT.

The configuration of statistics is similar to the configuration of partitioning. It consist of the following options:

statistics_names – labels for the counted statistics, separated by semicolons;

statistics_classes – identifiers of classes counting statistics (e.g. COUNT);

statistics_classes_args – additional parameters for classes counting statistics.

In this phase one output record per partition is generated. It contains the data identifying the partition (pairs key - identifier of the part of data) and all statistics counted for this partition.

Phase 3 - groupping, sorting, limiting

In the third (optional) phase, operations similar to ones known from SQL, like GROUP BY, ORDER BY and LIMIT are applied on records containing counted statistics.

Statistics are groupped according to a subset of the keys which were used to partitioning the input data (partitions_names). Place those keys in the group_keys variable.

The sort_stat variable contains a name of the statistic (see statistics_names), which will be used to sort records inside the group.

sort_order contains the sorting order, ASC or DESC.

The limit variable contains number of records which will be left.

The output of this phase is one single output record wrapping statistics records, no more than specified in the limit variable.

User defined operations

As mentioned in the previous sections, user can define her or his own classes to filter input data, partition data or count statistics. Those classes should implement one of the interfaces:

pl.edu.icm.coansys.statisticsgenerator.operationcomponents.FilterComponent
pl.edu.icm.coansys.statisticsgenerator.operationcomponents.Partitioner
pl.edu.icm.coansys.statisticsgenerator.operationcomponents.StatisticsCalculator

A common method of those interfaces is

public void setup(String... params)

The parameters set in the configuration in *_classes_args variables are passed here.

Furthermore, you have to implement one specialized method. In the filtering class:

public boolean filter(InputEntry entry)

For the given input record, the metod must return true if the record satisfies the filter.

In the partitionning class:

public String[] partition(String inputField)

For the given value which comes from the key-value pair from certain input record, the method should return an array of partition identifiers the record belongs to. Typically it will be a single-value array.

In the statistic class:

public double calculate(Iterable<BytesWritable> messages)

This method can iterate through a group of statistics, serialized by protocol buffers and return a statistic.

To use your classes, build a jar containing them and place it in the lib subdirectory of the workflow (see Getting started).

In the configuration, set 2 additional parameters:

user_operations – place here labels for your custom operations, separated by semicolons. You will be able to use them in the input_filter_classes, partition_classes and statistics_classes variables;

user_operation_classes – full paths of your classes (package and class name) in the same order as theirs labels, also separated by semicolons.

Use statistics generator as a subworkflow

The statistics generator project is created using oozie-maven-plugin, so you can simply use it as a subproject, without downloading and building (it will be automatically downloaded from the maven repository). For detailed informations, see the oozie-maven-plugin documentation:

https://github.com/CeON/oozie-maven-plugin