Skip to content

Data pipeline playbook

Rick Viscomi edited this page Jan 4, 2022 · 3 revisions

HTTP Archive data pipeline playbook

Wiki playbook to document the processes for maintaining HTTP Archive's monthly data pipeline and how to handle various issues that may arise.

How the monthly data pipeline works (normally)

TODO

There are two types of data output by the pipeline: HAR and CSV. The legacy pipeline uses a MySQL table to store summary stats about pages and requests, which it reports in CSV format. The sync_csv.sh script takes the pages/requests CSV data and creates corresponding summary_pages and summary_requests tables for desktop and mobile.

The raw JSON data emitted by WebPageTest is stored in HAR files, one per test. Each HAR file can be tens of MB, so over millions of tests the total size per crawl is on the order of TB. The sync_har.sh script kicks off a Dataflow job that opens up each HAR file, parses through the JSON for semantic globs of data, and pipes the data to several BigQuery tables: pages, requests, response_bodies, technologies, and lighthouse.

Remediating issues

CSV data

If there is a loss of CSV data, it can be regenerated from the HAR data entirely within BigQuery.

TODO: add link to query and document.

HAR data

If there is a loss of HAR data, we may be able to recover it depending on where in the pipeline the issue occurred.

A loss of HAR data would be apparent by looking at the BigQuery Meta Dashboard and comparing the size of the # rows and # GB for any of the HAR-based tables listed above (pages, response_bodies, etc).

If the HAR files are all safely stored in their Google Cloud Storage bucket, we can regenerate the BigQuery tables by rerunning the sync_har.sh script.

  1. Delete all existing HAR-based tables for the given crawl
    • Go to BigQuery
    • Search for the table name (eg 2021_12_01_mobile)
    • Select all tables from HAR-based datasets
    • Select "Delete Table" from the BigQuery UI
  2. Rerun the script
    • Access a local or cloud machine with a copy of the HTTPArchive/bigquery repo
    • Run gcloud auth login to authenticate with the Google Cloud CLI
    • Run sync_har.sh to kick off the Dataflow pipeline