copyright | lastupdated | ||
---|---|---|---|
|
2019-03-07 |
{:shortdesc: .shortdesc} {:new_window: target="_blank"} {:codeblock: .codeblock} {:screen: .screen} {:tip: .tip} {:pre: .pre}
{: #big-data-log-analytics}
In this tutorial, you will build a log analysis pipeline designed to collect, store and analyze log records to support regulatory requirements or aid information discovery. This solution leverages several services available in {{site.data.keyword.cloud_notm}}: {{site.data.keyword.messagehub}}, {{site.data.keyword.cos_short}}, SQL Query and {{site.data.keyword.streaminganalyticsshort}}. A program will assist you by simulating transmission of web server log messages from a static file to {{site.data.keyword.messagehub}}.
With {{site.data.keyword.messagehub}} the pipeline scales to receive millions of log records from a variety of producers. By applying {{site.data.keyword.streaminganalyticsshort}} log data can be inspected in realtime to integrate business processes. Log messages can also be easily redirected to long term storage using {{site.data.keyword.cos_short}} where developers, support staff and auditors can work directly with data using SQL Query.
While this tutorial focuses on log analysis, it is applicable to other scenarios: storage-limited IoT devices can similarly stream messages to {{site.data.keyword.cos_short}} or marketing professionals can segment and analyze customer events across digital properties with SQL Query. {:shortdesc}
{: #objectives}
- Understand Apache Kafka publish-subscribe messaging
- Store log data for audit and compliance requirements
- Monitor logs to create exception handling processes
- Conduct forensic and statistical analysis on log data
{: #services}
This tutorial uses the following runtimes and services:
- {{site.data.keyword.cos_short}}
- {{site.data.keyword.messagehub}}
- {{site.data.keyword.sqlquery_short}}
- {{site.data.keyword.streaminganalyticsshort}}
This tutorial may incur costs. Use the Pricing Calculator to generate a cost estimate based on your projected usage.
{: #architecture}
- Application generates log events to {{site.data.keyword.messagehub}}
- Log event is intercepted and analyzed by {{site.data.keyword.streaminganalyticsshort}}
- Log event is appended to a CSV file located in {{site.data.keyword.cos_short}}
- Auditor or support staff issues SQL job
- SQL Query executes on log file in {{site.data.keyword.cos_short}}
- Result set is stored in {{site.data.keyword.cos_short}} and delivered to auditor and support staff
{: #prereqs}
- Install Git
- Install {{site.data.keyword.Bluemix_notm}} CLI
- Install Node.js
- Download Kafka 0.10.2.X client
{: #setup}
In this section, you will create the services required to perform analysis of log events generated by your applications.
This section uses the command line to create service instances. Alternatively, you may do the same from the service page in the catalog using the provided links. {: tip}
- Login to {{site.data.keyword.cloud_notm}} via the command line and target your Cloud Foundry account. See CLI Getting Started.
{: pre}
ibmcloud login
{: pre}ibmcloud target --cf
- Create a Lite instance of {{site.data.keyword.cos_short}}.
{: pre}
ibmcloud resource service-instance-create log-analysis-cos cloud-object-storage \ lite global
- Create a Lite instance of SQL Query.
{: pre}
ibmcloud resource service-instance-create log-analysis-sql sql-query lite \ us-south
- Create a Standard instance of {{site.data.keyword.messagehub}}.
{: pre}
ibmcloud service create messagehub standard log-analysis-hub
{: #topics}
Begin by creating a {{site.data.keyword.messagehub}} topic and {{site.data.keyword.cos_short}} bucket. Topics define where applications deliver messages in publish-subscribe messaging systems. After messages are received and processed, they will be stored within a file located in an {{site.data.keyword.cos_short}} bucket.
- In your browser, access the
log-analysis-hub
service instance from the Resources. - Click the + button to create a topic.
- Enter the Topic Name
webserver
and click the Create topic button. - Click Service Credentials and the New Credential button.
- In the resulting dialog, type
webserver-flow
as the Name and click the Add button. - Click View Credentials and copy the information to a safe place. It will be used in the next section.
- Back in the Resource List, select the
log-analysis-cos
service instance. - Click Create bucket.
- Enter a unique Name for the bucket.
- Select Cross Region for Resiliency.
- Select us-geo as the Location.
- Click Create bucket.
{: #streamsflow}
In this section, you will begin configuring a Streams flow that receives log messages. The {{site.data.keyword.streaminganalyticsshort}} service is powered by {{site.data.keyword.streamsshort}}, which can analyze millions of events per second, enabling sub-millisecond response times and instant decision-making.
- In your browser, access Watson Data Platform.
- Select the New project button or tile, then the Basic tile and click OK.
- Enter the Name
webserver-logs
. - The Storage option should be set to
log-analysis-cos
. If not, select the service instance. - Click the Create button.
- Enter the Name
- On the resulting page, select the Settings tab and check Streams Designer in Tools. Finish by clicking the Save button.
- Click the Add to project button then Streams flow from the top navigation bar.
- Click Associate an IBM Streaming Analytics instance with a container-based plan.
- Create a new {{site.data.keyword.streaminganalyticsshort}} instance by selecting the Lite radio button and clicking Create. Do not select Lite VM.
- Provide the Service name as
log-analysis-sa
and click Confirm. - Type the streams flow Name as
webserver-flow
. - Finish by clicking Create.
- On the resulting page, select the {{site.data.keyword.messagehub}} tile.
- Click Add Connection and select your
log-analysis-hub
{{site.data.keyword.messagehub}} instance. If you do not see your instance listed, select the IBM {{site.data.keyword.messagehub}} option. Manually enter the Connection details that you obtained from the Service credentials in the previous section. Name the connectionwebserver-flow
. - Click Create to create the connection.
- Select
webserver
from the Topic dropdown. - Select Start with the first new message from the Initial Offset dropdown.
- Click Continue.
- Click Add Connection and select your
- Leave the Preview Data page open; it will be used in the next section.
{: #kafkatools}
The webserver-flow
is currently idle and awaiting messages. In this section, you will configure Kafka console tools to work with {{site.data.keyword.messagehub}}. Kafka console tools allow you to produce arbitrary messages from the terminal and send them to {{site.data.keyword.messagehub}}, which will trigger the webserver-flow
.
- Download and unzip the Kafka 0.10.2.X client.
- Change directory to
bin
and create a text file namedmessage-hub.config
with the following contents.{: pre}sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required username="USER" password="PASSWORD"; security.protocol=SASL_SSL sasl.mechanism=PLAIN ssl.protocol=TLSv1.2 ssl.enabled.protocols=TLSv1.2 ssl.endpoint.identification.algorithm=HTTPS
- Replace
USER
andPASSWORD
in yourmessage-hub.config
file with theuser
andpassword
values seen in Service Credentials from the previous section. Savemessage-hub.config
. - From the
bin
directory, run the following command. ReplaceKAFKA_BROKERS_SASL
with thekafka_brokers_sasl
value seen in Service Credentials. An example is provided.{: pre}./kafka-console-producer.sh --broker-list KAFKA_BROKERS_SASL \ --producer.config message-hub.config --topic webserver
./kafka-console-producer.sh --broker-list \ kafka04-prod02.messagehub.services.us-south.bluemix.net:9093,\ kafka05-prod02.messagehub.services.us-south.bluemix.net:9093,\ kafka02-prod02.messagehub.services.us-south.bluemix.net:9093,\ kafka01-prod02.messagehub.services.us-south.bluemix.net:9093,\ kafka03-prod02.messagehub.services.us-south.bluemix.net:9093 \ --producer.config message-hub.config --topic webserver
- The Kafka console tool is awaiting input. Copy and paste the log message from below into the terminal. Hit
enter
to send the log message to {{site.data.keyword.messagehub}}. Notice the sent messages also display on thewebserver-flow
Preview Data page.{: pre}{ "host": "199.72.81.55", "timestamp": "01/Jul/1995:00:00:01 -0400", "request": "GET /history/apollo/ HTTP/1.0", "responseCode": 200, "bytes": 6245 }
{: #streamstarget}
In this section, you will complete the streams flow configuration by defining a target. The target will be used to store incoming log messages in the {{site.data.keyword.cos_short}} bucket created earlier. The process of storing and appending incoming log messages to a file will be done automatically by {{site.data.keyword.streaminganalyticsshort}}.
- On the
webserver-flow
Preview Page, click the Continue button. - Select the {{site.data.keyword.cos_full_notm}} tile as a target.
- Click Add Connection and select
log-analysis-cos
. - Click Create.
- Enter the File path
/YOUR_BUCKET_NAME/http-logs_%TIME.csv
. ReplaceYOUR_BUCKET_NAME
with the one used in the first section. - Select csv in the Format dropdown.
- Check the Column header row checkbox.
- Select File Size in the File Creation Policy dropdown.
- Set the limit to be 100MB by entering
102400
in the File Size (KB) textbox. - Click Continue.
- Click Add Connection and select
- Click Save.
- Click the > play button to Start the streams flow.
- After the flow is started, again send multiple log messages from the Kafka console tool. You can watch as messages arrive by viewing the
webserver-flow
in Streams Designer.{: pre}{ "host": "199.72.81.55", "timestamp": "01/Jul/1995:00:00:01 -0400", "request": "GET /history/apollo/ HTTP/1.0", "responseCode": 200, "bytes": 6245 }
- Return to your bucket in {{site.data.keyword.cos_short}}. A new
log.csv
file will exist after enough messages have entered the flow or the flow is restarted.
{: #streamslogic}
Up to now, the Streams flow is a simple pipe - moving messages from {{site.data.keyword.messagehub}} to {{site.data.keyword.cos_short}}. More than likely, teams will want to know events of interest in realtime. For example individual teams might benefit from alerts when HTTP 500 (application error) events occur. In this section, you will add conditional logic to the flow to identify HTTP 200 (OK) and non HTTP 200 codes.
- Use the pencil button to Edit the streams flow.
- Create a filter node that handles HTTP 200 responses.
- From the Nodes palette, drag the Filter node from PROCESSING AND ANALYTICS to the canvas.
- Type
OK
in the name textbox, which currently contains the wordFilter
. - Enter the following statement in the Condition Expression text area.
{: pre}
responseCode == 200
- With your mouse, draw a line from the {{site.data.keyword.messagehub}} node's output (right side) to your OK node's input (left side).
- From the Nodes palette, drag the Debug node found under TARGETS to the canvas.
- Connect the Debug node to the OK node by drawing a line between the two.
- Repeat the process to create a
Not OK
filter using the same nodes and the following condition statement.{: pre}responseCode >= 300
- Click the play button to Save and run the streams flow.
- If prompted click the link to run the new version.
{: #streamsload}
To view conditional handling in your Streams flow, you will increase the message volume sent to {{site.data.keyword.messagehub}}. The provided Node.js program simulates a realistic flow of messages to {{site.data.keyword.messagehub}} based on traffic to the webserver. To demonstrate the scalability of {{site.data.keyword.messagehub}} and {{site.data.keyword.streaminganalyticsshort}}, you will increase the throughput of log messages.
This section uses node-rdkafka. See the npmjs page for troubleshooting instructions if the simulator installation fails. If problems persist, you can skip to the next section and manually upload the data.
- Download and unzip the Jul 01 to Jul 31, ASCII format, 20.7 MB gzip compressed log file from NASA.
- Clone and install the log simulator from IBM-Cloud on GitHub.
{: pre}
git clone https://github.com/IBM-Cloud/kafka-log-simulator.git
- Change to the simulator's directory and run the following commands to setup the simulator and produce log event messages. Replace
LOGFILE
with the file you downloaded. ReplaceBROKERLIST
andAPIKEY
with the corresponding Service Credentials used earlier. An example is provided.npm install
npm run build
{: pre}node dist/index.js --file LOGFILE --parser httpd --broker-list BROKERLIST \ --api-key APIKEY --topic webserver --rate 100
node dist/index.js --file /Users/ibmcloud/Downloads/NASA_access_log_Jul95 \ --parser httpd --broker-list \ "kafka04-prod02.messagehub.services.us-south.bluemix.net:9093,\ kafka05-prod02.messagehub.services.us-south.bluemix.net:9093,\ kafka02-prod02.messagehub.services.us-south.bluemix.net:9093,\ kafka01-prod02.messagehub.services.us-south.bluemix.net:9093,\ kafka03-prod02.messagehub.services.us-south.bluemix.net:9093" \ --api-key Np15YZKN3SCdABUsOpJYtpue6jgJ7CwYgsoCWaPbuyFbdM4R \ --topic webserver --rate 100
- In your browser, return to your
webserver-flow
after the simulator begins producing messages. - Stop the simulator after a desired number of messages have gone through the conditional branches using
control+C
. - Experiment with {{site.data.keyword.messagehub}} scaling by increasing or decreasing the
--rate
value.
The simulator will delay sending the next message based on the elapsed time in the webserver log. Setting --rate 1
sends events in realtime. Setting --rate 100
means that for every 1 second of elapsed time in the webserver log a 10ms delay between messages is used.
{: tip}
{: #sqlquery}
Depending on the number of messages sent by the simulator, the log file on {{site.data.keyword.cos_short}} has certainly grown in file size. You will now act as an investigator answering audit or compliance questions by combining SQL Query with your log file. The benefit of using SQL Query is that the log file is directly accessible - no additional transformations or database servers are necessary.
If you prefer not to wait for the simulator to send all log messages, upload the complete CSV file to {{site.data.keyword.cos_short}} to get started immediately. {: tip}
-
Access the
log-analysis-sql
service instance from the Resource List. Select Open UI to launch SQL Query. -
Enter the following SQL into the Type SQL here ... text area.
-- What are the top 10 web pages on NASA from July 1995? -- Which mission might be significant? SELECT REQUEST, COUNT(REQUEST) FROM cos://us-geo/YOUR_BUCKET_NAME/http-logs_TIME.csv WHERE REQUEST LIKE '%.htm%' GROUP BY REQUEST ORDER BY 2 DESC LIMIT 10
{: pre}
-
Retrieve the Object SQL URL from the logs file.
- From the Resource List, select the
log-analysis-cos
service instance. - Select the bucket you created previously.
- Click the overflow menu on the
http-logs_TIME.csv
file and select Object SQL URL. - Copy the URL to the clipboard.
- From the Resource List, select the
-
Update the
FROM
clause with your Object SQL URL and click Run. -
The result can be seen on the Result tab. While some pages - like the Kennedy Space Center home page - are expected one mission is quite popular at the time.
-
Select the Query Details tab to view additional information such as the location where the result was stored on {{site.data.keyword.cos_short}}.
-
Try the following question and answer pairs by adding them individually to the Type SQL here ... text area.
-- Who are the top 5 viewers? SELECT HOST, COUNT(*) FROM cos://us-geo/YOUR_BUCKET_NAME/http-logs_TIME.csv GROUP BY HOST ORDER BY 2 DESC LIMIT 5
{: pre}
-- Which viewer has suspicious activity based on application failures? SELECT HOST, COUNT(*) FROM cos://us-geo/YOUR_BUCKET_NAME/http-logs_TIME.csv WHERE `responseCode` == 500 GROUP BY HOST ORDER BY 2 DESC;
{: pre}
-- Which requests showed a page not found error to the user? SELECT DISTINCT REQUEST FROM cos://us-geo/YOUR_BUCKET_NAME/http-logs_TIME.csv WHERE `responseCode` == 404
{: pre}
-- What are the top 10 largest files? SELECT DISTINCT REQUEST, BYTES FROM cos://us-geo/YOUR_BUCKET_NAME/http-logs_TIME.csv WHERE BYTES > 0 ORDER BY CAST(BYTES as Integer) DESC LIMIT 10
{: pre}
-- What is the distribution of total traffic by hour? SELECT SUBSTRING(TIMESTAMP, 13, 2), COUNT(*) FROM cos://us-geo/YOUR_BUCKET_NAME/http-logs_TIME.csv GROUP BY 1 ORDER BY 1 ASC
{: pre}
-- Why did the previous result return an empty hour? -- Hint, find the malformed hostname. SELECT HOST, REQUEST FROM cos://us-geo/YOUR_BUCKET_NAME/http-logs_TIME.csv WHERE SUBSTRING(TIMESTAMP, 13, 2) == ''
{: pre}
FROM clauses are not limited to a single file. Use cos://us-geo/YOUR_BUCKET_NAME/
to run SQL queries on all files in the bucket.
{: tip}
{: #expand}
Congratulations, you have built a log analysis pipeline with {{site.data.keyword.cloud_notm}}. Below are additional suggestions to enhance your solution.
- Use additional targets in Streams Designer to store data in {{site.data.keyword.cloudant_short_notm}} or execute code in {{site.data.keyword.openwhisk_short}}
- Follow the Build a data lake using Object Storage tutorial to add a dashboard to log data
- Integrate additional systems with {{site.data.keyword.messagehub}} using {{site.data.keyword.appconserviceshort}}.
{: #removal}
From the Resource List, use the Delete or Delete service menu item in the overflow menu to remove the following service instances.
- log-analysis-sa
- log-analysis-hub
- log-analysis-sql
- log-analysis-cos
{:related}