Skip to content

Latest commit

 

History

History
 
 

1.8

Fast Data: Application Logs

In this demo we have a look into interactively analyzing application logs. As a source for the application logs we're using WordPress, a popular blogging engine.

The demo shows how to ingest the application logs into Minio, an object store akin to Amazon S3 and demonstrates how to query those logs with SQL, using Apache Drill, a distributed schema-free query engine.

  • Estimated time for completion:
  • Install: 20min
  • Target audience: Anyone interested in interactive application log analysis.

Table of Contents:

Architecture

Application Logs demo architecture

Log data is generated in WordPress (WP) by an end-user interacting with it, this data gets loaded into Minio and Apache Drill is then used to interactively query it.

Prerequisites

  • A running DC/OS 1.8.7 or higher cluster with at least 3 private agents and 1 public agent each with 2 CPUs and 5 GB of RAM available as well as the DC/OS CLI installed in version 0.14 or higher.
  • The dcos/demo Git repo must be available locally, use: git clone https://github.com/dcos/demos.git if you haven't done so, yet.
  • The JSON query util jq must be installed.
  • SSH cluster access must be set up.

Going forward we'll call the directory you cloned the dcos/demo Git repo into $DEMO_HOME.

Install

Marathon-LB

For Minio and Apache Drill we need to have Marathon-LB installed:

$ dcos package install marathon-lb

Minio

To serve the log data for analysis in Drill we use Minio in this demo, just as you would use, say, S3 in AWS.

To set up Minio find out the IP of the public agent and store it in an environment variable called $PUBLIC_AGENT_IP, for example:

$ export PUBLIC_AGENT_IP=52.24.255.200

Now you can install the Minio package like so:

$ cd $DEMO_HOME/1.8/applogs/
$ ./install-minio.sh

After this, Minio is available on port 80 of the public agent, so open $PUBLIC_AGENT_IP in your browser and you should see the UI.

Next, we will need to get the Minio credentials in order to access the Web UI (and later on the HTTP API). The credentials used by Minio are akin to the ones you might know from Amazon S3, called $ACCESS_KEY_ID and $SECRET_ACCESS_KEY. In order to obtain these credentials, go to the Services tab of the DC/OS UI and select the running Minio service; click on the Logs tab and you should see:

Obtaining Minio credentials

Note that you can learn more about Minio and the credentials in the respective example.

Apache Drill

Apache Drill is a distributed SQL query engine, allowing you to interactively explore heterogenous datasets across data sources (CSV, JSON, HDFS, HBase, MongoDB, S3).

A prerequisite for the Drill install to work is that three environment variables are defined: $PUBLIC_AGENT_IP (the public agent IP address), as well as $ACCESS_KEY_ID and $SECRET_ACCESS_KEY (Minio credentials); all of which are explained in the previous section. I've been using the following (specific for my setup):

$ export PUBLIC_AGENT_IP=52.24.255.200
$ export ACCESS_KEY_ID=MRQZLLB72IJRPUGY30MJ
$ export SECRET_ACCESS_KEY=f5nGdq3lxlvpJF1nMOFAgk8h71ZMlM0h4fzUwakj

Now do the following to install Drill:

$ cd $DEMO_HOME/1.8/applogs/
$ ./install-drill.sh

Go to http://$PUBLIC_AGENT_IP:8047/ to access the Drill Web UI:

Apache Drill Web UI

Next we need to configure the S3 storage plugin in order to access data on Minio. For this, go to the Storage tab in Drill, enable the s3 plugin, click on the Update button and paste the content of your (local) drill-s3-plugin-config.json into the field, overwriting everything which was there in the first place:

Apache Drill storage plugin config

After another click on the Update button the data is stored in ZooKeeper and persists even if you restart Drill.

To check if everything is working fine, go to Minio and create a test bucket and upload drill/apache.log into it. Now, go to the Drill UI, change to the Query tab and execute the following query to verify your setup:

select * from s3.`apache.log`

You should see something like the following:

Apache Drill test query result

Wordpress

Next we install WordPress, acting as the data source for the logs. Note that the environment variable called $PUBLIC_AGENT_IP must be exported.

$ cd $DEMO_HOME/1.8/applogs/
$ ./install-wp.sh

Discover where WP is available via HAProxy http://$PUBLIC_AGENT_IP:9090/haproxy?stats (look for the wordpress_XXXXX frontend):

WP on Marathon-LB

In my case, WP is available via port 10102 on the public agent, that is via http://$PUBLIC_AGENT_IP:10102/:

WP setup

Finally, complete the WP install so that it can be used.

Use

The following sections describe how to use the demo after having installed it.

First interact with WP, that is, create some posts and surf around. Then, to capture the logs, execute the following locally (on your machine):

$ cd $DEMO_HOME/1.8/applogs/
$ echo remote ignore0 ignore1 timestamp request status size origin agent > session.log && dcos task log --lines 1000 wordpress | tail -n +30 | sed 's, \[\(.*\)\] , \"\1\" ,' >> session.log

Next upload session.log into the test bucket in Minio.

Now you can use Drill to understand the usage patterns, for example:

List HTTP requests with more than 1000 bytes payload:

select remote, request from s3.`session.log` where size > 1000

Above query results in something like:

Apache Drill query result 1

List HTTP requests that succeeded (HTTP status code 200):

select remote, request, status from s3.`session.log` where status = 200

Above query results in something like:

Apache Drill query result 2

Discussion

In this demo we ingested application log data from WordPress into Minio and queried it using Apache Drill.

  • An area of improvement is the ingestion process which is currently implemented locally, that is, via manually using the DC/OS CLI on your machine. A more advanced scenario would, for example, use a DC/OS Job to periodically ingest the logs in a timestamped manner into Minio.
  • While Drill is set up in distributed mode currently only a single drillbit is used; by scaling the Drill service, one can query more data, faster.
  • The current result is of tabular form (as a result of the SQL queries issued). A more insightful way to render the query results would be to use BI tools such as Tableau or Datameer, connecting them via the JDBC interface.

Should you have any questions or suggestions concerning the demo, please raise an issue in Jira or let us know via the [email protected] mailing list.