Skip to content

Commit

Permalink
Merge pull request #196 from onderkalaci/crunchy_bridge_for_analytics
Browse files Browse the repository at this point in the history
Add Crunchy Bridge for Analytics
  • Loading branch information
rschu1ze authored Aug 4, 2024
2 parents deb427b + 1b989f6 commit a0a2aee
Show file tree
Hide file tree
Showing 6 changed files with 318 additions and 0 deletions.
48 changes: 48 additions & 0 deletions crunchy-bridge-for-analytics/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@

[Crunchy Bridge for Analytics](https://docs.crunchybridge.com/analytics) allows users to query parquet files on S3 directly.


1) Setup the cluster:

- Go to https://crunchybridge.com/
- Click `Create Cluster` -> `Create Analytics Cluster`
- Choose Region (`eu-central-1`) where `s3://clickhouse-public-datasets/hits_compatible/hits.parquet` is
- Click `Analytics-256`
- Click `Create Cluster`
- Step Two: Set Up Analytics Credentials: Click "Skip for now"
- Wait until the state of the machine becomes "Running"

2) Setup a VM on `aws` in the same region as the cluster `eu-central-1`.

This is to make sure the latency between the server and the client is not high. We are going to need `psql` on this VM, so you should install `sudo yum install -y postgresql16` etc. depending on the linux distro.


3) Get the application connection strings:

3.1) Application connection
- Click the "Connection" tab from the left menu
- Pick role: application, Format psql
- Click "Copy"

Set the `APPCONNCMD` that we are going to use with what you copied above:
```bash
export APPCONNCMD='psql postgres://application:[email protected]:5432/postgres'
```

3.2) Get the postgres connection string:

- Click the "Connection" tab from the left menu
- Pick role: postgres, Format psql
- Click "Copy"

Set the `SUPERUSERCONNCMD` that we are going to use with what you copied above:
```bash
export SUPERUSERCONNCMD='psql postgres://postgres:[email protected]:5432/postgres'
```

4) Run the script:
```bash
./run.sh
```

For the cold run, we directly access to S3 while running the queries. For the warm runs, we first download the file from S3 to a local cache drive, then run the queries. This logic is coded into `run.sh ` script.
114 changes: 114 additions & 0 deletions crunchy-bridge-for-analytics/create.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
-- Auto-infer the columns from the hits.parquet
CREATE FOREIGN TABLE hits_auto () SERVER crunchy_lake_analytics OPTIONS (path 's3://clickhouse-public-datasets/hits_compatible/hits.parquet');

-- hits.parquet uses integers for date and time,
-- which most databases cannot pass directly
-- to date/time functions. Hence we convert them to date and
-- timestamptz in a view that wraps around the table
CREATE OR REPLACE VIEW hits as SELECT
WatchID,
JavaEnable,
Title,
GoodEvent,
to_timestamp(EventTime) AS EventTime,
to_date(EventDate) AS EventDate,
CounterID,
ClientIP,
RegionID,
UserID,
CounterClass,
OS,
UserAgent,
URL,
Referer,
IsRefresh,
RefererCategoryID,
RefererRegionID,
URLCategoryID,
URLRegionID,
ResolutionWidth,
ResolutionHeight,
ResolutionDepth,
FlashMajor,
FlashMinor,
FlashMinor2,
NetMajor,
NetMinor,
UserAgentMajor,
UserAgentMinor,
CookieEnable,
JavascriptEnable,
IsMobile,
MobilePhone,
MobilePhoneModel,
Params,
IPNetworkID,
TraficSourceID,
SearchEngineID,
SearchPhrase,
AdvEngineID,
IsArtifical,
WindowClientWidth,
WindowClientHeight,
ClientTimeZone,
ClientEventTime,
SilverlightVersion1,
SilverlightVersion2,
SilverlightVersion3,
SilverlightVersion4,
PageCharset,
CodeVersion,
IsLink,
IsDownload,
IsNotBounce,
FUniqID,
OriginalURL,
HID,
IsOldCounter,
IsEvent,
IsParameter,
DontCountHits,
WithHash,
HitColor,
LocalEventTime,
Age,
Sex,
Income,
Interests,
Robotness,
RemoteIP,
WindowName,
OpenerName,
HistoryLength,
BrowserLanguage,
BrowserCountry,
SocialNetwork,
SocialAction,
HTTPError,
SendTiming,
DNSTiming,
ConnectTiming,
ResponseStartTiming,
ResponseEndTiming,
FetchTiming,
SocialSourceNetworkID,
SocialSourcePage,
ParamPrice,
ParamOrderID,
ParamCurrency,
ParamCurrencyID,
OpenstatServiceName,
OpenstatCampaignID,
OpenstatAdID,
OpenstatSourceID,
UTMSource,
UTMMedium,
UTMCampaign,
UTMContent,
UTMTerm,
FromTag,
HasGCLID,
RefererHash,
URLHash,
CLID
from hits_auto;
43 changes: 43 additions & 0 deletions crunchy-bridge-for-analytics/queries.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
SELECT COUNT(*) FROM hits;
SELECT COUNT(*) FROM hits WHERE AdvEngineID <> 0;
SELECT SUM(AdvEngineID), COUNT(*), AVG(ResolutionWidth) FROM hits;
SELECT AVG(UserID) FROM hits;
SELECT COUNT(DISTINCT UserID) FROM hits;
SELECT COUNT(DISTINCT SearchPhrase) FROM hits;
SELECT MIN(EventDate), MAX(EventDate) FROM hits;
SELECT AdvEngineID, COUNT(*) FROM hits WHERE AdvEngineID <> 0 GROUP BY AdvEngineID ORDER BY COUNT(*) DESC;
SELECT RegionID, COUNT(DISTINCT UserID) AS u FROM hits GROUP BY RegionID ORDER BY u DESC LIMIT 10;
SELECT RegionID, SUM(AdvEngineID), COUNT(*) AS c, AVG(ResolutionWidth), COUNT(DISTINCT UserID) FROM hits GROUP BY RegionID ORDER BY c DESC LIMIT 10;
SELECT MobilePhoneModel, COUNT(DISTINCT UserID) AS u FROM hits WHERE MobilePhoneModel <> '' GROUP BY MobilePhoneModel ORDER BY u DESC LIMIT 10;
SELECT MobilePhone, MobilePhoneModel, COUNT(DISTINCT UserID) AS u FROM hits WHERE MobilePhoneModel <> '' GROUP BY MobilePhone, MobilePhoneModel ORDER BY u DESC LIMIT 10;
SELECT SearchPhrase, COUNT(*) AS c FROM hits WHERE SearchPhrase <> '' GROUP BY SearchPhrase ORDER BY c DESC LIMIT 10;
SELECT SearchPhrase, COUNT(DISTINCT UserID) AS u FROM hits WHERE SearchPhrase <> '' GROUP BY SearchPhrase ORDER BY u DESC LIMIT 10;
SELECT SearchEngineID, SearchPhrase, COUNT(*) AS c FROM hits WHERE SearchPhrase <> '' GROUP BY SearchEngineID, SearchPhrase ORDER BY c DESC LIMIT 10;
SELECT UserID, COUNT(*) FROM hits GROUP BY UserID ORDER BY COUNT(*) DESC LIMIT 10;
SELECT UserID, SearchPhrase, COUNT(*) FROM hits GROUP BY UserID, SearchPhrase ORDER BY COUNT(*) DESC LIMIT 10;
SELECT UserID, SearchPhrase, COUNT(*) FROM hits GROUP BY UserID, SearchPhrase LIMIT 10;
SELECT UserID, extract(minute FROM EventTime) AS m, SearchPhrase, COUNT(*) FROM hits GROUP BY UserID, m, SearchPhrase ORDER BY COUNT(*) DESC LIMIT 10;
SELECT UserID FROM hits WHERE UserID = 435090932899640449;
SELECT COUNT(*) FROM hits WHERE URL LIKE '%google%';
SELECT SearchPhrase, MIN(URL), COUNT(*) AS c FROM hits WHERE URL LIKE '%google%' AND SearchPhrase <> '' GROUP BY SearchPhrase ORDER BY c DESC LIMIT 10;
SELECT SearchPhrase, MIN(URL), MIN(Title), COUNT(*) AS c, COUNT(DISTINCT UserID) FROM hits WHERE Title LIKE '%Google%' AND URL NOT LIKE '%.google.%' AND SearchPhrase <> '' GROUP BY SearchPhrase ORDER BY c DESC LIMIT 10;
SELECT * FROM hits WHERE URL LIKE '%google%' ORDER BY EventTime LIMIT 10;
SELECT SearchPhrase FROM hits WHERE SearchPhrase <> '' ORDER BY EventTime LIMIT 10;
SELECT SearchPhrase FROM hits WHERE SearchPhrase <> '' ORDER BY SearchPhrase LIMIT 10;
SELECT SearchPhrase FROM hits WHERE SearchPhrase <> '' ORDER BY EventTime, SearchPhrase LIMIT 10;
SELECT CounterID, AVG(length(URL)) AS l, COUNT(*) AS c FROM hits WHERE URL <> '' GROUP BY CounterID HAVING COUNT(*) > 100000 ORDER BY l DESC LIMIT 25;
SELECT REGEXP_REPLACE(Referer, '^https?://(?:www\\.)?([^/]+)/.*$', '\1') AS k, AVG(length(Referer)) AS l, COUNT(*) AS c, MIN(Referer) FROM hits WHERE Referer <> '' GROUP BY k HAVING COUNT(*) > 100000 ORDER BY l DESC LIMIT 25;
SELECT SUM(ResolutionWidth), SUM(ResolutionWidth + 1), SUM(ResolutionWidth + 2), SUM(ResolutionWidth + 3), SUM(ResolutionWidth + 4), SUM(ResolutionWidth + 5), SUM(ResolutionWidth + 6), SUM(ResolutionWidth + 7), SUM(ResolutionWidth + 8), SUM(ResolutionWidth + 9), SUM(ResolutionWidth + 10), SUM(ResolutionWidth + 11), SUM(ResolutionWidth + 12), SUM(ResolutionWidth + 13), SUM(ResolutionWidth + 14), SUM(ResolutionWidth + 15), SUM(ResolutionWidth + 16), SUM(ResolutionWidth + 17), SUM(ResolutionWidth + 18), SUM(ResolutionWidth + 19), SUM(ResolutionWidth + 20), SUM(ResolutionWidth + 21), SUM(ResolutionWidth + 22), SUM(ResolutionWidth + 23), SUM(ResolutionWidth + 24), SUM(ResolutionWidth + 25), SUM(ResolutionWidth + 26), SUM(ResolutionWidth + 27), SUM(ResolutionWidth + 28), SUM(ResolutionWidth + 29), SUM(ResolutionWidth + 30), SUM(ResolutionWidth + 31), SUM(ResolutionWidth + 32), SUM(ResolutionWidth + 33), SUM(ResolutionWidth + 34), SUM(ResolutionWidth + 35), SUM(ResolutionWidth + 36), SUM(ResolutionWidth + 37), SUM(ResolutionWidth + 38), SUM(ResolutionWidth + 39), SUM(ResolutionWidth + 40), SUM(ResolutionWidth + 41), SUM(ResolutionWidth + 42), SUM(ResolutionWidth + 43), SUM(ResolutionWidth + 44), SUM(ResolutionWidth + 45), SUM(ResolutionWidth + 46), SUM(ResolutionWidth + 47), SUM(ResolutionWidth + 48), SUM(ResolutionWidth + 49), SUM(ResolutionWidth + 50), SUM(ResolutionWidth + 51), SUM(ResolutionWidth + 52), SUM(ResolutionWidth + 53), SUM(ResolutionWidth + 54), SUM(ResolutionWidth + 55), SUM(ResolutionWidth + 56), SUM(ResolutionWidth + 57), SUM(ResolutionWidth + 58), SUM(ResolutionWidth + 59), SUM(ResolutionWidth + 60), SUM(ResolutionWidth + 61), SUM(ResolutionWidth + 62), SUM(ResolutionWidth + 63), SUM(ResolutionWidth + 64), SUM(ResolutionWidth + 65), SUM(ResolutionWidth + 66), SUM(ResolutionWidth + 67), SUM(ResolutionWidth + 68), SUM(ResolutionWidth + 69), SUM(ResolutionWidth + 70), SUM(ResolutionWidth + 71), SUM(ResolutionWidth + 72), SUM(ResolutionWidth + 73), SUM(ResolutionWidth + 74), SUM(ResolutionWidth + 75), SUM(ResolutionWidth + 76), SUM(ResolutionWidth + 77), SUM(ResolutionWidth + 78), SUM(ResolutionWidth + 79), SUM(ResolutionWidth + 80), SUM(ResolutionWidth + 81), SUM(ResolutionWidth + 82), SUM(ResolutionWidth + 83), SUM(ResolutionWidth + 84), SUM(ResolutionWidth + 85), SUM(ResolutionWidth + 86), SUM(ResolutionWidth + 87), SUM(ResolutionWidth + 88), SUM(ResolutionWidth + 89) FROM hits;
SELECT SearchEngineID, ClientIP, COUNT(*) AS c, SUM(IsRefresh), AVG(ResolutionWidth) FROM hits WHERE SearchPhrase <> '' GROUP BY SearchEngineID, ClientIP ORDER BY c DESC LIMIT 10;
SELECT WatchID, ClientIP, COUNT(*) AS c, SUM(IsRefresh), AVG(ResolutionWidth) FROM hits WHERE SearchPhrase <> '' GROUP BY WatchID, ClientIP ORDER BY c DESC LIMIT 10;
SELECT WatchID, ClientIP, COUNT(*) AS c, SUM(IsRefresh), AVG(ResolutionWidth) FROM hits GROUP BY WatchID, ClientIP ORDER BY c DESC LIMIT 10;
SELECT URL, COUNT(*) AS c FROM hits GROUP BY URL ORDER BY c DESC LIMIT 10;
SELECT 1, URL, COUNT(*) AS c FROM hits GROUP BY 1, URL ORDER BY c DESC LIMIT 10;
SELECT ClientIP, ClientIP - 1, ClientIP - 2, ClientIP - 3, COUNT(*) AS c FROM hits GROUP BY ClientIP, ClientIP - 1, ClientIP - 2, ClientIP - 3 ORDER BY c DESC LIMIT 10;
SELECT URL, COUNT(*) AS PageViews FROM hits WHERE CounterID = 62 AND EventDate >= '2013-07-01' AND EventDate <= '2013-07-31' AND DontCountHits = 0 AND IsRefresh = 0 AND URL <> '' GROUP BY URL ORDER BY PageViews DESC LIMIT 10;
SELECT Title, COUNT(*) AS PageViews FROM hits WHERE CounterID = 62 AND EventDate >= '2013-07-01' AND EventDate <= '2013-07-31' AND DontCountHits = 0 AND IsRefresh = 0 AND Title <> '' GROUP BY Title ORDER BY PageViews DESC LIMIT 10;
SELECT URL, COUNT(*) AS PageViews FROM hits WHERE CounterID = 62 AND EventDate >= '2013-07-01' AND EventDate <= '2013-07-31' AND IsRefresh = 0 AND IsLink <> 0 AND IsDownload = 0 GROUP BY URL ORDER BY PageViews DESC LIMIT 10 OFFSET 1000;
SELECT TraficSourceID, SearchEngineID, AdvEngineID, CASE WHEN (SearchEngineID = 0 AND AdvEngineID = 0) THEN Referer ELSE '' END AS Src, URL AS Dst, COUNT(*) AS PageViews FROM hits WHERE CounterID = 62 AND EventDate >= '2013-07-01' AND EventDate <= '2013-07-31' AND IsRefresh = 0 GROUP BY TraficSourceID, SearchEngineID, AdvEngineID, Src, Dst ORDER BY PageViews DESC LIMIT 10 OFFSET 1000;
SELECT URLHash, EventDate, COUNT(*) AS PageViews FROM hits WHERE CounterID = 62 AND EventDate >= '2013-07-01' AND EventDate <= '2013-07-31' AND IsRefresh = 0 AND TraficSourceID IN (-1, 6) AND RefererHash = 3594120000172545465 GROUP BY URLHash, EventDate ORDER BY PageViews DESC LIMIT 10 OFFSET 100;
SELECT WindowClientWidth, WindowClientHeight, COUNT(*) AS PageViews FROM hits WHERE CounterID = 62 AND EventDate >= '2013-07-01' AND EventDate <= '2013-07-31' AND IsRefresh = 0 AND DontCountHits = 0 AND URLHash = 2868770270353813622 GROUP BY WindowClientWidth, WindowClientHeight ORDER BY PageViews DESC LIMIT 10 OFFSET 10000;
SELECT DATE_TRUNC('minute', EventTime) AS M, COUNT(*) AS PageViews FROM hits WHERE CounterID = 62 AND EventDate >= '2013-07-14' AND EventDate <= '2013-07-15' AND IsRefresh = 0 AND DontCountHits = 0 GROUP BY DATE_TRUNC('minute', EventTime) ORDER BY DATE_TRUNC('minute', EventTime) LIMIT 10 OFFSET 1000;
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
{
"system": "Crunchy Bridge for Analytics (Parquet)",
"date": "2024-06-05",
"machine": "Analytics-256GB (64 vCores, 256 GB)",
"cluster_size": 1,
"comment":"",
"tags": ["column-oriented", "PostgreSQL compatible", "analytical", "managed"],
"data_size": 14779976446,

"result": [
[0.0628,0.0726,0.0699],
[0.3610,0.0713,0.0706],
[0.4281,0.0821,0.0819],
[0.4133,0.0884,0.0879],
[0.3735,0.1454,0.1463],
[0.4762,0.2440,0.2360],
[0.2493,0.0761,0.0774],
[0.1362,0.0734,0.0737],
[0.2802,0.1653,0.1656],
[0.5351,0.2102,0.2109],
[0.3286,0.1102,0.1099],
[0.2949,0.1135,0.1133],
[0.5787,0.2420,0.2544],
[0.5019,0.2800,0.2818],
[0.4142,0.2518,0.2536],
[0.2518,0.1629,0.1565],
[0.5032,0.2965,0.2888],
[0.4987,0.2764,0.2721],
[1.2427,0.8276,0.8378],
[5.9475,0.7041,0.7026],
[1.5523,0.5729,0.5729],
[1.4388,0.4994,0.4961],
[2.3632,0.7972,0.7951],
[5.7294,2.3540,2.3470],
[0.4353,0.1463,0.1382],
[0.3627,0.1325,0.1334],
[0.4214,0.1420,0.1406],
[1.4605,0.5591,0.5534],
[2.5890,2.1074,2.1023],
[0.2560,0.1601,0.1592],
[0.5807,0.1927,0.1964],
[0.8045,0.2173,0.2202],
[1.0320,0.6704,0.6718],
[1.6656,0.9118,0.9003],
[1.6751,0.9396,0.9062],
[0.2671,0.1728,0.1730],
[0.4445,0.1802,0.1823],
[0.2588,0.1354,0.1346],
[0.4247,0.1536,0.1526],
[0.8441,0.3354,0.3201],
[0.2188,0.0809,0.0808],
[0.2217,0.0787,0.0777],
[0.3362,0.2414,0.2417]
]
}
57 changes: 57 additions & 0 deletions crunchy-bridge-for-analytics/run.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
#!/bin/bash

# disable cache manager to prevent auto-loading of files
$SUPERUSERCONNCMD -c 'ALTER SYSTEM SET crunchy_query_engine.enable_cache_manager TO off;'
$SUPERUSERCONNCMD -c 'SELECT pg_reload_conf();'
$SUPERUSERCONNCMD -c 'SHOW crunchy_query_engine.enable_cache_manager;'

# make sure there are no leftover files locally available
$APPCONNCMD -c 'select crunchy_file_cache.remove(path) FROM crunchy_file_cache.list();'

# create the table
$APPCONNCMD -f create.sql

# Initialize the results array
declare -A results

# Function to run queries and store results
run_queries() {
local run_number=$1
local query_index=0

while read -r query; do
time_ms=$($APPCONNCMD -c '\timing' -c "$query" | grep 'Time' | awk '{print $2}')
time_sec=$(echo "scale=4; $time_ms / 1000" | bc)
time_sec=$(printf "%.4f" $time_sec) # Ensure leading zero

# Append the result to the appropriate query's results array
if [ $run_number -eq 0 ]; then
results[$query_index]="$time_sec"
else
results[$query_index]+=",$time_sec"
fi
query_index=$((query_index + 1))
done < queries.sql
}

# Run queries for the first time without file downloaded locally, add to results
run_queries 0

# Enable loading the file locally
$SUPERUSERCONNCMD -c 'ALTER SYSTEM SET crunchy_query_engine.enable_cache_manager TO on;'
$SUPERUSERCONNCMD -c 'SELECT pg_reload_conf();'
$SUPERUSERCONNCMD -c 'SHOW crunchy_query_engine.enable_cache_manager;'

# Synchronously load the file to the local machine
$APPCONNCMD -c "SELECT * FROM crunchy_file_cache.add('s3://clickhouse-public-datasets/hits_compatible/hits.parquet');"

# Run queries for the second and third times with file locally available
run_queries 1
run_queries 2

# Format the results
echo '"result": ['
for ((i = 0; i < ${#results[@]}; i++)); do
echo "[${results[$i]}],"
done | sed '$ s/,$//'
echo "]"
Loading

0 comments on commit a0a2aee

Please sign in to comment.