Skip to content

Latest commit

 

History

History
243 lines (200 loc) · 10.7 KB

README.md

File metadata and controls

243 lines (200 loc) · 10.7 KB

CockroachDB Geo Tourist

Use the spatial features in CockroachDB to find pubs, restaurants, cafes, etc.

Screenshot pubs (App shown running on a laptop)

This is a simple Python Flask and Javascript app which illustrates some of the new spatial capabilities in CockroachDB 20.2. The scenario is this: in the Web app, an icon represents the user, and this user is situated at a location randomly chosen from a set of destinations, each time the page is refreshed. Then, an HTTP POST is made from the Javascript front end, including the user's location and the type of amenity to search for. Within the Python Flask app, those values are featured in a SQL query against a CockroachDB instance loaded with spatial data. This query uses the following spatial data types, operators, and indexes to find and return a set of the nearest amenities, sorted by distance:

  1. GEOGRAPHY: the data type to represent each of the POINT data elements associated with the amenity
  2. ST_Distance: used to calculate the distance from the user to each of these locations
  3. ST_Y and ST_X: are used to retrieve the longitude and latitude of each of these points, for plotting onto the map
  4. ST_DWithin: used in the WHERE clause of the SQL query to constrain the results to points within 5km of the user's location
  5. ST_MakePoint: converts the longitude and latitude representing the user's location into a POINT
  6. A GIN index on the ref_point column in the osm table speeds up the calculation done by ST_DWithin
  7. ST_GeoHash: to create the primary key for the tourist_locations table

These types and operators, together with the GIN index, will be familiar to users of PostGIS, the popular spatial extension available for PostgreSQL. In CockroachDB, this layer was created from scratch and PostGIS was not used, though its API was preserved.

One aspect of CockroachDB's spatial capability is especially interesting: the way the spatial index works. In order to preserve CockroachDB's unique ability to scale horizontally by adding nodes to a running cluster, its approach to spatial indexing is to decompose of the space being indexed into buckets of various sizes. Deeper discussion of this topic is available in the docs and in this blog post.

Running on iPhone

(App running in an iPhone, in Safari, maps by OpenStreetMap)

Data

The data set is a sample of an extract of the OpenStreetMap Planet Dump which is accessible from here. The planet-latest.osm.pbf file was downloaded (2020-08-01) and then processed using Osmosis as documented in this script. The bounding box specified for the extract was --bounding-box top=72.253800 left=-12.666450 bottom=33.120960 right=34.225994, corresponding to the area shown in the figure below. The result of this operation was a 36 GB Bzip'd XML file (not included here). This intermediate file was then processed using this Perl script, with the result being piped through grep and, finally, gzip to produce a smaller data set containing a smaller set of points which lie in the areas the app focuses on.

Boundary of OSM data extract

DDL and sample SQL queries: The data set is loaded into one table which has a primary key and one secondary index. Here is the DDL:

DROP TABLE IF EXISTS osm;
CREATE TABLE osm
(
  id BIGINT
  , date_time TIMESTAMP WITH TIME ZONE
  , uid TEXT
  , name TEXT
  , key_value TEXT[]
  , ref_point GEOGRAPHY
  , geohash4 TEXT -- First 4 characters of geohash, corresponding to a box of about +/- 20 km
  , amenity TEXT
  , CONSTRAINT "primary" PRIMARY KEY (geohash4 ASC, amenity ASC, id ASC)
);
CREATE INDEX ON osm USING GIN(ref_point);

NOTE: ./load_osm_stdin.py creates the osm table and the GIN index if they don't already exist.

There is an additional table, tourist_locations (see below), which contains the set of places where our "tourist" might be situated when the page loads. This is populated by load_osm_stdin.py. Only locations for which enabled is TRUE will be used, so the number of possible locations can be managed by manipulating the existing rows in this table, or by adding new ones, though the data set may need to be expanded to accommodate the new values. The DDL for this table contains features worth mentioning: the goal was to use the geohash column as the primary key, but to also derive this value from the lat and lon values; lines 7 and 8 show how this can be acheived within CockroachDB:

1	CREATE TABLE tourist_locations
2	(
3	  name TEXT
4	  , lat FLOAT8
5	  , lon FLOAT8
6	  , enabled BOOLEAN DEFAULT TRUE
7	  , geohash CHAR(9) AS (ST_GEOHASH(ST_SETSRID(ST_MAKEPOINT(lon, lat), 4326), 9)) STORED
8	  , CONSTRAINT "primary" PRIMARY KEY (geohash ASC)
9	);

The Flask app runs one of two variations of a query, depending on whether the environment variable USE_GEOHASH is set and, if so, its value (true or false), as shown in the following code block (line numbers have been added here). The main difference is that, when USE_GEOHASH is set to true, the GIN index is not used, but rather the four character substring of the geohash of the point is used, which effectively constrains the search area to a +/- 20 km box (see lines 14 - 17). This geohash4 column is the leading component of the primary key, so is indexed, allowing this to perform very well while also having lower impact on data load speeds. Now, if this query was more complex than "find me the N closest points within a radius of X", the GIN index would be preferable since it permits far more complex comparisons.

 1	  sql = """
 2	  WITH q1 AS
 3	  (
 4	    SELECT
 5	      name,
 6	      ST_Distance(ST_MakePoint(%s, %s)::GEOGRAPHY, ref_point)::NUMERIC(9, 2) dist_m,
 7	      ST_Y(ref_point::GEOMETRY) lat,
 8	      ST_X(ref_point::GEOMETRY) lon,
 9	      date_time,
10	      key_value
11	    FROM osm
12	    WHERE
13	  """
14	  if useGeohash:
15	    sql += "geohash4 = SUBSTRING(%s FOR 4) AND amenity = %s"
16	  else:
17	    sql += "ST_DWithin(ST_MakePoint(%s, %s)::GEOGRAPHY, ref_point, 5.0E+03, TRUE) AND key_value && ARRAY[%s]"
18	  sql += """
19	  )
20	  SELECT * FROM q1
21	  """
22	  if useGeohash:
23	    sql += "WHERE dist_m < 5.0E+03"
24	  sql += """
25	  ORDER BY dist_m ASC
26	  LIMIT 10;
27	  """

Some extra SQL query examples

Here is an example showing a query to find all pubs within 300 meters of a path along the center of the Thames in central London. It uses GeoJSON functions and this nice UI.

Run the app in one of 3 ways: (1) locally, (2) locally, but with app in a Docker container, (3) in Kubernetes (K8s)

If running locally, with or without Docker

  • Download, install, and start a CockroachDB cluster using version 20.2 or above. Installation instructions can be found here, and the startup procesure is documented here. The default user is root and the default database is defaultdb, so these values don't need to be set.

  • Load the data (see above) using this script:

$ export PGHOST=localhost
$ export PGPORT=26257
$ curl -s -k https://storage.googleapis.com/crl-goddard-gis/osm_50k_eu.txt.gz | gunzip - | ./load_osm_stdin.py

Run the app locally, without Docker

  • Start the Python Flask app, which provides the data REST service and also serves the app's HTML template and static assets (PNG, CSS, and JS files):
$ export PGHOST=localhost
$ export PGPORT=26257

Run the app via its Docker image

  • Edit ./docker_run_image.sh, changing environment variables as necessary to suit your deployment.
$ ./docker_run_image.sh

Optional: stop the app, disable the use of the GIN index in favor of the primary key index on the geoash substring, then restart the app. Try both ways (e.g. unset USE_GEOHASH vs. export USE_GEOHASH=true) and compare the time it takes to load the amenity icons in the browser.

$ export USE_GEOHASH=true

Deploy the app in Kubernetes (K8s) using the CockroachDB K8s operator

  • You'll need access to a K8s environment. This document describes running this in Google's GKE.

  • What follows is derived from these docs.

  • The procedure outlined below demonstrates the following:

    • Deploying the CockroachDB K8s operator
    • Using that to spin up a 3 node CockroachDB cluster
    • The DB Console
    • Deployment of the CockroachDB Geo Tourist web app
    • Loading data for the app into the CockroachDB cluster
    • Performing a zero-downtime upgrade of the CockroachDB software
    • Scaling the cluster from 3 to 4 pods
    • Terminating one of the pods and verifying that the app remains available
  • The files in the ./k8s subdirectory are used for a K8s deployment. They are:

  • Change to the ./k8s directory: cd ./k8s/

  • Edit ./deploy_k8s.sh, changing any of the following to suit your needs:

MACHINETYPE="e2-standard-4"
NAME="${USER}-geo-tourist"
ZONE="us-east4-b"
N_NODES=5
  • Run the script and follow the prompts: ./deploy_k8s.sh

If you need to rebuild the Docker image

Edit Dockerfile as necessary, and then change ./docker_include.sh to set docker_id and anything else you'd like to change.

$ ./docker_build_image.sh
$ ./docker_tag_publish.sh

Reference

Geohash precision as a function of length

(source: Wikipedia)