Skip to content

Commit

Permalink
Add files via upload
Browse files Browse the repository at this point in the history
  • Loading branch information
jovanpop-msft authored Oct 19, 2020
1 parent d737f03 commit 5e1605d
Show file tree
Hide file tree
Showing 2 changed files with 527 additions and 0 deletions.
257 changes: 257 additions & 0 deletions Notebooks/TSQL/Jupiter/_content/tutorials/covid-ecdc.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,257 @@
{
"metadata": {
"kernelspec": {
"name": "SQL",
"display_name": "SQL",
"language": "sql"
},
"language_info": {
"name": "sql",
"version": ""
}
},
"nbformat_minor": 2,
"nbformat": 4,
"cells": [
{
"cell_type": "markdown",
"source": [
"# Analyze ECDC COVID data using Azure Synapse serverless SQL pool\n",
"\n",
"In this notebook, you will see how you can analyze the distribution of COVID cases reported in Serbia (Europe) using Synapse SQL endpoint in Synapse Analytics. Synapse SQL engine is the perfect choice for ad-hoc data analytics for the data analysts with T-SQL skills. The data set is placed on [Azure storage](https://azure.microsoft.com/en-us/services/open-datasets/catalog/ecdc-covid-19-cases/ \"https://azure.microsoft.com/en-us/services/open-datasets/catalog/ecdc-covid-19-cases/\") and formatted as parquet. \n",
"\n",
"## Explore your data\n",
"\n",
"As a first step we need to explore data in the file place in Azure storage using `OPENROWSET` function:"
],
"metadata": {
"azdata_cell_guid": "b0bfbef2-8271-48da-be46-9d102c04ae3e"
}
},
{
"cell_type": "code",
"source": [
"select top 10 *\r\n",
"from openrowset(bulk 'https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases/latest/ecdc_cases.parquet',\r\n",
" format='parquet') as a"
],
"metadata": {
"azdata_cell_guid": "fef36ba3-39d7-4dda-a544-29d301e85724"
},
"outputs": [],
"execution_count": null
},
{
"cell_type": "markdown",
"source": [
"Here we can see that some of the columns interesting for analysis are `DATE_REP` and `CASES`. I would like to analyze number of cases reported in Serbia, so I would need to filter the results using `GEO_ID` column.\n",
"\n",
"We are not sure what is `geo_id` value for Serbia, so we will find all distinct countries and `geo_id` values where country is something like Serbia:"
],
"metadata": {
"azdata_cell_guid": "892b5348-a006-45bb-bbab-9d3eb935c643"
}
},
{
"cell_type": "code",
"source": [
"select distinct countries_and_territories, geo_id\r\n",
"from openrowset(bulk 'https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases/latest/ecdc_cases.parquet',\r\n",
" format='parquet') as a\r\n",
"where countries_and_territories like '%ser%'"
],
"metadata": {
"azdata_cell_guid": "d9a34e8f-8ee8-498a-87c0-d0210ba08bd0"
},
"outputs": [],
"execution_count": null
},
{
"cell_type": "markdown",
"source": [
"Since we see that `GEO_ID` for Serbia is `RS`, we can find dayly number of cases in Serbia:"
],
"metadata": {
"azdata_cell_guid": "934198a6-a06b-42cb-a027-3f30614ca0f6"
}
},
{
"cell_type": "code",
"source": [
"select DATE_REP, CASES\r\n",
"from openrowset(bulk 'https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases/latest/ecdc_cases.parquet',\r\n",
" format='parquet') as a\r\n",
"where geo_id = 'RS'\r\n",
"order by date_rep"
],
"metadata": {
"azdata_cell_guid": "c399cbd1-bf74-4ea4-8fa4-72c486270ae7"
},
"outputs": [],
"execution_count": null
},
{
"cell_type": "markdown",
"source": [
"We can show this in the chart to see trend analysis of reported COVID cases in Serbia. By looking at this chart, we can see that the peek is somewhere between 15th and 20th April and the peak in the second wave is second half of July.\n",
"\n",
"The points on time series charts are shown per daily basis. This might lead to daily variation, so you might want to show the graph with average values calculated in the window with +/- 1-2 days. T-SQL enables you to easily calculate average values if you specify time window:\n",
"\n",
"```\n",
"AVG(CASES) OVER(order by date_rep ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING )\n",
"```\n",
"\n",
"We need to specify how to locally order data and number of preceding/following rows that AVG function should use to calculate the average value within the window. The time series query that uses average values is shown on the following code:"
],
"metadata": {
"azdata_cell_guid": "91e78e94-3aa0-46a4-ba9c-2b0a046f0f6c"
}
},
{
"cell_type": "code",
"source": [
"select DATE_REP,\r\n",
" CASES_AVG = AVG(CASES) OVER(ORDER BY date_rep ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING )\r\n",
"from openrowset(bulk 'https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases/latest/ecdc_cases.parquet', format='parquet') as a\r\n",
"where geo_id = 'RS'\r\n",
"order by date_rep"
],
"metadata": {
"azdata_cell_guid": "75f965dd-4565-4561-9dcd-fbf517bb5250"
},
"outputs": [],
"execution_count": null
},
{
"cell_type": "markdown",
"source": [
"We can also show cumulative values to see increase of the number of cases over time (this is known as running total):"
],
"metadata": {
"azdata_cell_guid": "3d861e0d-a2a6-4599-a84d-620479c66c68"
}
},
{
"cell_type": "code",
"source": [
"select DATE_REP,\r\n",
" CUMULATIVE = SUM(CASES) OVER (ORDER BY date_rep)\r\n",
"from openrowset(bulk 'https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases/latest/ecdc_cases.parquet',\r\n",
" format='parquet') as a\r\n",
"where geo_id = 'RS'\r\n",
"order by date_rep"
],
"metadata": {
"azdata_cell_guid": "f7b7457e-2e58-43f3-908e-0e564b262a7b"
},
"outputs": [],
"execution_count": null
},
{
"cell_type": "markdown",
"source": [
"If we switch to chart we can see cumulative number of cases that are reported since the first COVID case.\n",
"\n",
"SQL language enables us to easily lookup number of reported cases couple of days after or before using LAG and LEAD functions. the following query will return number of cases reported 7 days ago:"
],
"metadata": {
"azdata_cell_guid": "2dc5f582-3371-41fb-9e83-293a20876beb"
}
},
{
"cell_type": "code",
"source": [
"select TOP 10 date_rep,\r\n",
" cases,\r\n",
" prev = LAG(CASES, 7) OVER(partition by geo_id order by date_rep )\r\n",
"from openrowset(bulk 'https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases/latest/ecdc_cases.parquet',\r\n",
" format='parquet') as a\r\n",
"where geo_id = 'RS'\r\n",
"order by date_rep desc;"
],
"metadata": {
"azdata_cell_guid": "3ce22b19-d0c8-4e33-ac99-65616680fbb7"
},
"outputs": [],
"execution_count": null
},
{
"cell_type": "markdown",
"source": [
"You can notice in the result that prev column lag 7 days to the current column. Now we can easily compare the difference between the current number of reported cases of the number of reported cases reported or percent of increase:\n",
"\n",
"```\n",
"WoW% = (cases - prev) / prev\n",
" = cases/prev - 1\n",
"```\n",
"\n",
"Instead of simple comparison of current and previous value, we can make this more reliable and first calculate the average values in the 7-day windows and then calculate increase using these values:"
],
"metadata": {
"azdata_cell_guid": "6a888617-355b-4b9d-ac22-352aa0661a20"
}
},
{
"cell_type": "code",
"source": [
"with ecdc as (\r\n",
" select\r\n",
" date_rep,\r\n",
" cases = AVG(CASES) OVER(partition by geo_id order by date_rep ROWS BETWEEN 7 PRECEDING AND CURRENT ROW ),\r\n",
" prev = AVG(CASES) OVER(partition by geo_id order by date_rep ROWS BETWEEN 14 PRECEDING AND 7 PRECEDING )\r\n",
" from\r\n",
" openrowset(bulk 'https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases/latest/ecdc_cases.parquet',\r\n",
" format='parquet') as a\r\n",
" where\r\n",
" geo_id = 'RS'\r\n",
")\r\n",
"select date_rep, cases, prev, [WoW%] = 100*(1.0*cases/prev - 1)\r\n",
"from ecdc\r\n",
"where prev > 10\r\n",
"order by date_rep desc;"
],
"metadata": {
"azdata_cell_guid": "c5828567-0d08-4f43-bfd1-34c78e643343"
},
"outputs": [],
"execution_count": null
},
{
"cell_type": "markdown",
"source": [
"This query will calculate the average number of cases in 7-day window and calculate week over week change.\n",
"\n",
"We can go step further and use the same query to run analysis across all countries in the world to calculate weekly changes and find the countries with the highest increase of COVID cases compared to the previous week."
],
"metadata": {
"azdata_cell_guid": "08ea9412-2c60-4a7c-affc-2eb6115f20b3"
}
},
{
"cell_type": "code",
"source": [
"with weekly_cases as (\r\n",
" select geo_id, date_rep, country = countries_and_territories,\r\n",
" current_avg = AVG(CASES) OVER(partition by geo_id order by date_rep ROWS BETWEEN 7 PRECEDING AND CURRENT ROW ),\r\n",
" prev_avg = AVG(CASES) OVER(partition by geo_id order by date_rep ROWS BETWEEN 14 PRECEDING AND 7 PRECEDING )\r\n",
" from openrowset(bulk 'https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases/latest/ecdc_cases.parquet',\r\n",
" format='parquet') as a \r\n",
")\r\n",
"select top 10 \r\n",
" country, \r\n",
" current_avg,\r\n",
" prev_avg, \r\n",
" [WoW%] = CAST((100*(1.* current_avg / prev_avg - 1)) AS smallint)\r\n",
"from weekly_cases\r\n",
"where date_rep = CONVERT(date, DATEADD(DAY, -1, GETDATE()), 23)\r\n",
"and prev_avg > 100\r\n",
"order by (1. * current_avg / prev_avg -1) desc"
],
"metadata": {
"azdata_cell_guid": "f338436e-d98d-4c1b-8c61-3ad838059cc0"
},
"outputs": [],
"execution_count": null
}
]
}
Loading

0 comments on commit 5e1605d

Please sign in to comment.