Skip to content

Latest commit

 

History

History
526 lines (406 loc) · 22.4 KB

File metadata and controls

526 lines (406 loc) · 22.4 KB

Hive Connector over Backblaze B2 Cloud Storage

This tutorial is closely based on the trino-minio tutorial by bitsondatadev. In this version of the tutorial, you will configure Trino to use Backblaze B2 Cloud Storage for file storage.

Introduction

If you are new to Trino or Presto®, I recommend that you check out the following blog to get a sense for the Hive connector architecture. A gentle introduction to the Hive connector

Goals

In this tutorial, you will:

  1. Learn how to run a CTAS (CREATE TABLE AS) statement in Trino.
  2. Learn the roles of the Trino runtime, metastore, and storage.
  3. Dive into the relational database that contains the Hive model and metadata that is stored in the Hive metastore service.
  4. Run SQL queries against the real-world Backblaze Drive Stats public data set.

Steps

Running Services

First, you want to start the services. Make sure that you are in the trino-getting-started/hive/trino-b2 directory. Now run the following command:

docker compose up -d

You should expect to see the following output (you may also have to download the Docker images before you see the "done" message):

[+] Running 5/5
 ✔ Network trino-b2_trino-network          Created  0.0s 
 ✔ Volume "trino-b2_maria_volume"          Created  0.0s 
 ✔ Container trino-b2-mariadb-1            Started  0.4s 
 ✔ Container trino-b2-trino-coordinator-1  Started  0.4s 
 ✔ Container trino-b2-hive-metastore-1     Started  0.7s

Open Trino CLI

Once this is complete, you can log into the Trino coordinator node. We will do this by using the exec command and run the trino CLI executable as the command we run on that container. Notice the container id is trino-b2-trino-coordinator-1 so the command you will run is:

docker container exec -it trino-b2-trino-coordinator-1 trino

When you start this step, you should see the trino cursor once the startup is complete. It should look like this when it is done:

trino>

The first step to understanding the Hive metastore's role in the Hive connector is to run a CTAS (CREATE TABLE AS) query that pushes data from one of the TPC connectors into the hive catalog that points to Backblaze B2. The TPC connectors generate data on the fly so that we can run simple tests like this.

First, run a command to show the catalogs to see the tpch and b2 catalogs since these are what we will use in the CTAS query.

SHOW CATALOGS;

You should see that the b2 catalog is registered. This is actually a Hive connector configured under the name b2 to delineate the underlying storage we are using.

Create a Backblaze Account

If you already have a Backblaze account with B2 enabled, you can skip to the next section.

If you have a Backblaze account, but you don't see the B2 Cloud Storage menu on the left after you sign in, then you need to enable Backblaze B2 for your account. On the left, under Account, click My Settings, then, under Enabled Products, click the checkbox for B2 Cloud Storage.

Enable B2

The page will refresh and you will see the B2 Cloud Storage menu on the left.

If you do not yet have a Backblaze account, navigate to the Backblaze B2 signup page, enter your email address and a password, and click Sign Up for Backblaze B2. Your account includes 10 GB of storage free of charge, and you don't need to submit any credit card details until you need more.

Create a Bucket in Backblaze B2

If we look at the Trino Architecture, we're first going to prep the file storage where the actual data will reside. In earlier iterations of big data systems, this layer was commonly HDFS or other S3 compatible storage and AWS S3. For our example, we're using Backblaze B2, which is also S3 compatible. Creating a bucket gives us a location to write our data to and we can tell Trino where to find it.

Storage

Sign in to Backblaze B2, and, in the B2 Cloud Storage menu on the left, click Buckets.

Click Buckets

If you don't see the B2 Cloud Storage menu, you need to enable B2 on your Backblaze account.

Click Create a Bucket and give the bucket a name. We're using trino-tiny in this tutorial, since the dataset we will be transferring will be small, but you will need to choose your own bucket name, since bucket names in Backblaze B2 must be globally unique.

Create a bucket

Leave the remaining settings with their defaults and click Create a Bucket.

Make a note of the endpoint value in the bucket details; you'll need that to configure Trino in a moment. The region is the second part of the endpoint - for example, in the endpoint s3.us-west-004.backblazeb2.com, the region is us-west-004.

Bucket Endpoint

Create an Application Key in Backblaze B2

Now you have a bucket in Backblaze B2, you have to create an application key that Trino's Hive connector can use to access it. In the menu on the left, under Account, click App Keys, then click Add a New Application Key.

It's good practice to limit a key to access a single bucket if you can, so give the key a name, again, we're using trino-tiny as an example, and select the bucket you created in the Allow Access to Bucket(s) dropdown. Many tools require the ability to list all of the buckets in an account, even if they will only be using a single bucket, so enable Allow List All Bucket Names. Leave the remaining settings with their defaults and click Create New Key.

Create an Application Key

IMPORTANT: You MUST copy the application key immediately after you create it. You cannot access the key after you leave this page. Make a note of the key id as well as the application key.

Application Key

Configuring Trino

We need to configure Trino's Hive Connector to access the bucket in Backblaze B2. There are several edits across three configuration files, so, before you start, ensure you have the required information to hand. The configuration files contain the following placeholders:

  • BUCKET_NAME
  • APPLICATION_KEY
  • KEY_ID
  • ENDPOINT
  • REGION

BUCKET_NAME is simply the name of the bucket you created; you should have a note of APPLICATION_KEY, KEY_ID, ENDPOINT and REGION from creating the bucket and application key earlier. Note that you must use the full https URL for ENDPOINT; for example, https://s3.us-west-004.backblazeb2.com.

Edit each of the following three files and replace the placeholders with your values:

  • conf/core-site.xml
  • conf/metastore-site.xml
  • etc/catalog/b2.properties

For example, in conf/core-site.xml, the first edit is to change BUCKET_NAME to the name of your bucket. i.e. change

    <property>
        <name>fs.defaultFS</name>
        <value>s3a://BUCKET_NAME</value>
    </property>

to

    <property>
        <name>fs.defaultFS</name>
        <value>s3a://trino-tiny</value>
    </property>

replacing trino-tiny with the name of your bucket.

Once you've completed editing the configuration files, save them, and restart Trino by running the following command from the trino-getting-started/hive/trino-b2 directory:

docker compose restart trino-b2-trino-coordinator-1

You will also need to restart the Trino CLI:

docker container exec -it trino-b2-trino-coordinator-1 trino

Querying Trino

Now that we've set up the Backblaze B2 bucket and application key, lets move to creating a SCHEMA that points us to the bucket in Backblaze B2 and then run our CTAS query. When we create a table using CTAS, we're telling the table to copy the table schema and the data from the source table into the table we're creating. This will make more sense when you see the query below.

Note: There are two meanings we just used when saying the word "schema". There is the table schema that defines columns of a table, then there is the SCHEMA that I intentionally put in all caps that signifies the SCHEMA in the containment hierarchy used by Trino. Trino defines a CATALOG which contains multiple SCHEMAS, which contain multiple TABLES. In other databases like Hive and MySQL

Runtime

Back in the terminal create the b2.tiny SCHEMA. This will be the first call to the metastore to save the location of the S3 schema location in Backblaze B2. Again, substitute the name of your bucket for trino-tiny:

CREATE SCHEMA b2.tiny
WITH (location = 's3a://trino-tiny/');

Now that we have a SCHEMA that references the bucket where we store our tables in Backblaze B2, we now can create our first table.

Optional: To view your queries run, log into the Trino UI and log in using any username (it doesn't matter since no security is set up).

Move the customer data from the tiny generated tpch data into Backblaze B2 using a CTAS query. Run the following query and if you like, watch it running on the Trino UI:

CREATE TABLE b2.tiny.customer
WITH (
    format = 'ORC',
    external_location = 's3a://trino-tiny/customer/'
) 
AS SELECT * FROM tpch.tiny.customer;

Go back to the Backblaze B2 UI, and click the Upload/Download button for the trino-tiny bucket. You will now see a customer directory generated from that table and underneath that directory will be a file with a name comprised of uuid and date. This is the orc file generated by the Trino runtime residing in Backblaze B2.

Now that there is a table under Backblaze B2, you can query this data by running the following SQL statement:

SELECT * FROM b2.tiny.customer LIMIT 50;

So the question now is: how does Trino know where to find the orc file residing in Backblaze B2 when all we specify is the catalog, schema, and table? How does Trino know what columns exist in the orc file, and even the file it is retrieving is an orc file to being with? Find out more in the next step.

Exploring the Hive Metastore

Metastore

In order for Trino to know where to locate this file, it uses the Hive metastore to manage and store this information or metadata in a relational database that the metastore points to, in this case our mariadb instance. Execute the following statement to log into the mariadb instance and follow the remaining commands to learn how the metadata gets split into different tables. Understanding this model will also solidify the metastore's role in the scheme of Trino's use of it in the Hive connector.

Open another terminal and run the following command:

docker container exec -it "trino-b2-mariadb-1" /bin/bash

Once you see the root@mariadb terminal, enter into the cli.

mysql -uroot -p"$MYSQL_ROOT_PASSWORD"

Now that you're in the metastore's database command line interface, you can run SQL commands on this database to see where the metadata is stored. First, let's look at the databases stored in the metastore.

SELECT
 DB_ID,
 DB_LOCATION_URI,
 NAME, 
 OWNER_NAME,
 OWNER_TYPE,
 CTLG_NAME
FROM metastore_db.DBS;
+-------+---------------------------+---------+------------+------------+-----------+
| DB_ID | DB_LOCATION_URI           | NAME    | OWNER_NAME | OWNER_TYPE | CTLG_NAME |
+-------+---------------------------+---------+------------+------------+-----------+
|     1 | file:/user/hive/warehouse | default | public     | ROLE       | hive      |
|     2 | s3a://trino-tiny/         | tiny    | trino      | USER       | hive      |
+-------+---------------------------+---------+------------+------------+-----------+

This shows the databases. What may be strange at first glance, is this is showing the schema that we created under the database table. This is because the Hive metastore has two abstractions for its metadata, databases and tables. Since Trino follows the traditional 3 level ANSI SQL catalog standard, schema is equivalent to a database. So just as a database contains multiple tables, a schema will contain multiple tables. Notice the DB_LOCATION_URI is in the bucket location created before in Backblaze B2 and set when you created this schema. The owner is the trino user coming from the user in the trino instance. Also note the CTLG_NAME references the trino catalog.

The next command will show us metadata about the customer table created in the previous step

SELECT 
 t.TBL_ID, 
 t.DB_ID, 
 t.OWNER, 
 t.TBL_NAME, 
 t.TBL_TYPE,
 t.SD_ID
FROM metastore_db.TBLS t 
 JOIN metastore_db.DBS d 
  ON t.DB_ID= d.DB_ID 
WHERE d.NAME = 'tiny';
+--------+-------+-------+----------+----------------+-------+
| TBL_ID | DB_ID | OWNER | TBL_NAME | TBL_TYPE       | SD_ID |
+--------+-------+-------+----------+----------------+-------+
|      1 |     2 | trino | customer | EXTERNAL_TABLE |     1 |
+--------+-------+-------+----------+----------------+-------+

There's nothing unexpected here. You should note that the DB_ID matches with the id of the tiny database (ie schema) name. The owner is the same trino user from our trino instance. The TBL_NAME is the name of the customer table created in the last step.

You may notice the location for the table seems to be missing but that information is actually on another table. The next query will show this location. Take note of the SD_ID before running the next query.

SELECT 
 s.SD_ID,
 s.INPUT_FORMAT,
 s.LOCATION,
 s.SERDE_ID 
FROM metastore_db.TBLS t 
 JOIN metastore_db.DBS d
  ON t.DB_ID = d.DB_ID
 JOIN metastore_db.SDS s 
  ON t.SD_ID = s.SD_ID
WHERE t.TBL_NAME = 'customer'
 AND d.NAME='tiny';
+-------+-------------------------------------------------+---------------------------+----------+
| SD_ID | INPUT_FORMAT                                    | LOCATION                  | SERDE_ID |
+-------+-------------------------------------------------+---------------------------+----------+
|     1 | org.apache.hadoop.hive.ql.io.orc.OrcInputFormat | s3a://trino-tiny/customer |        1 |
+-------+-------------------------------------------------+---------------------------+----------+

This table should contain a row that matches the SD_ID from the last query result. You should also see the expected INPUT_FORMAT class which since we specified we were storing orc files it should be the OrcInputFormat. Also notice the LOCATION is the schema location we set. If we hadn't set this it would have defaulted to <schema_url>/<table_name>. Then there is the SERDE_ID. SerDe is an abbreviation for serializer/deserializer. This will point us to another table that contains the information to find which serializer to use when parsing the file in Backblaze B2.

To find out the serializer used, run the following query:

SELECT 
 sd.SERDE_ID,
 sd.NAME,
 sd.SLIB
FROM metastore_db.TBLS t 
 JOIN metastore_db.DBS d
  ON t.DB_ID = d.DB_ID
 JOIN metastore_db.SDS s 
  ON t.SD_ID = s.SD_ID
 JOIN metastore_db.SERDES sd 
  ON s.SERDE_ID = sd.SERDE_ID
WHERE t.TBL_NAME = 'customer'
 AND d.NAME='tiny';
+----------+----------+-------------------------------------------+
| SERDE_ID | NAME     | SLIB                                      |
+----------+----------+-------------------------------------------+
|        1 | customer | org.apache.hadoop.hive.ql.io.orc.OrcSerde |
+----------+----------+-------------------------------------------+

This is a pretty simple table, you will notice the NAME refers to the table the serializer is used for, and SLIB is the serializer library used when parsing the file in Backblaze B2.

Our last metadata query is looking at the columns on the table.

SELECT c.* 
FROM metastore_db.TBLS t
 JOIN metastore_db.DBS d
  ON t.DB_ID = d.DB_ID
 JOIN metastore_db.SDS s
  ON t.SD_ID = s.SD_ID
 JOIN metastore_db.COLUMNS_V2 c
  ON s.CD_ID = c.CD_ID
WHERE t.TBL_NAME = 'customer'
 AND d.NAME='tiny'
ORDER by CD_ID, INTEGER_IDX;
+-------+---------+-------------+--------------+-------------+
| CD_ID | COMMENT | COLUMN_NAME | TYPE_NAME    | INTEGER_IDX |
+-------+---------+-------------+--------------+-------------+
|     1 | NULL    | custkey     | bigint       |           0 |
|     1 | NULL    | name        | varchar(25)  |           1 |
|     1 | NULL    | address     | varchar(40)  |           2 |
|     1 | NULL    | nationkey   | bigint       |           3 |
|     1 | NULL    | phone       | varchar(15)  |           4 |
|     1 | NULL    | acctbal     | double       |           5 |
|     1 | NULL    | mktsegment  | varchar(10)  |           6 |
|     1 | NULL    | comment     | varchar(117) |           7 |
+-------+---------+-------------+--------------+-------------+

You'll notice that the COLUMNS_V2 table has a foreign key CD_ID to the SDS storage table. Each key will correlate to a specific table and so you'll see that the columns are for the customer table. You can now notice the COLUMN_NAME, TYPE_NAME, and the order these fields are expected in the INTEGERD_IDX.

So now you have a working understanding of the Hive metastore and the model it uses to store metadata about the files that are generated and written to when inserting using the Hive connector.

Accessing the Backblaze Drive Stats Data Set

Drive Stats is a public data set of the daily metrics on the hard drives in Backblaze’s cloud storage infrastructure that Backblaze has open-sourced starting with April 2013. Currently, Drive Stats comprises over 346 million records, rising by over 200,000 records per day. Drive Stats is an append-only dataset effectively logging daily statistics that once written are never updated or deleted.

Each day, Backblaze collects a Drive Stats record from each hard drive containing some or all the following fields:

  • date: the date of collection.
  • serial_number: the unique serial number of the drive.
  • model: the manufacturer’s model number of the drive.
  • capacity_bytes: the drive’s capacity, in bytes.
  • failure: 1 if this was the last day that the drive was operational before failing, 0 if all is well.
  • datacenter: the data center containing the drive; currently one of ams5 (Amsterdam, Netherlands), iad1 (Reston, VA), phx1 (Phoenix, AZ), sac0 (Sacramento, CA), sac2 (Stockton, CA), or an empty string.
  • cluster_id: the cluster containing the drive; currently one of 0, 20, 31, 40, 50, 52.
  • vault_id: the vault containing the drive; four digits, for example, 1042.
  • pod_id: the identifier of the drive's pod within its vault; an integer between 1 and 20 inclusive.
  • pod_slot_num: the drive's slot within its pod; an integer between 1 and 59 inclusive. May be NULL.
  • is_legacy_format: for future use; currently always 0.
  • A collection of SMART attributes. The number of attributes collected has risen over time; currently we store 87 SMART attributes in each record, each one in both raw and normalized form, with field names of the form smart_n_normalized and smart_n_raw, where n is between 1 and 255.

The vault_id, pod_id, and is_legacy_format fields were introduced in Q2 2023 and are explained further in Backblaze Drive Stats for Q2 2023, while datacenter, cluster_id and pod_slot_num appeared in Q3 2023 and are covered in Backblaze Drive Stats for Q3 2023. These fields are NULL in records created before their introduction.

In total, each record currently comprises 195 data fields describing the location and state of an individual hard drive on a given day (the number of SMART attributes collected has risen over time).

The entire Backblaze Drive Stats data set is available in Parquet format in a public Backblaze B2 bucket. At the time of writing, the data set comprises 200 files occupying 21.7 GiB of storage.

To access the Drive Stats data set via Trino, start Trino as explained above, then follow the instructions above for configuring Trino, with the following configuration values:

  • BUCKET_NAME: drivestats-parquet
  • APPLICATION_KEY: K004cogT4GIeHHfhCyPPLsPBT4NyY1A
  • KEY_ID: 0045f0571db506a0000000007
  • ENDPOINT: https://s3.us-west-004.backblazeb2.com
  • REGION: us-west-004

Run the following command to load the Drive Stats schema into Trino's metastore:

docker container exec -it trino-b2-trino-coordinator-1 trino -f /etc/trino/drivestats.sql

Now open the CLI, setting the default catalog and schema:

docker container exec -it trino-b2-trino-coordinator-1 trino --catalog b2 --schema ds

Here are some sample queries to get you started:

How many records are in the current Drive Stats data set?

SELECT COUNT(*) 
FROM drivestats;

How many hard drives was Backblaze spinning on a given date?

SELECT COUNT(*) 
FROM drivestats 
WHERE year = 2022 AND month = 9 AND day = 30;

How many exabytes of raw storage was Backblaze managing on a given date?

SELECT ROUND(SUM(CAST(capacity_bytes AS bigint))/1e+18, 2) 
FROM drivestats 
WHERE year = 2022 AND month = 9 AND day = 30;

What are the top 10 most common drive models in the data set?

SELECT model, COUNT(DISTINCT serial_number) AS count 
FROM drivestats 
GROUP BY model
ORDER BY count DESC
LIMIT 10;

How many drives were in the various Backblaze data centers on a given date?

SELECT datacenter, COUNT(*) AS count
FROM drivestats
WHERE year = 2024 AND month = 6 AND day = 30
GROUP BY datacenter;

You can learn more about querying the Drive Stats data set from Querying a Decade of Drive Stats Data.

Stopping Services

Once you're done, the resources used for this exercise can be released by running the following command:

docker compose down

See trademark and other legal notices.