diff --git a/demos/wind-turbine/README.md b/demos/wind-turbine/README.md
index 95349b6b..d5c1f025 100644
--- a/demos/wind-turbine/README.md
+++ b/demos/wind-turbine/README.md
@@ -1,39 +1,68 @@
-# Wind Turbine (Spark Demo)
+# Wind Turbine (Spark - Livy - Sparkmagic)
+
+In this demonstration, you use Spark to explore a dataset and train a Gradient-Boosted Tree (GBT) regressor that
+leverages various features, such as wind speed and direction, to estimate the power output of a wind turbine.
![wind-farm](images/wind-farm.jpg)
-Welcome! In this experiment we delve into the world of wind turbines
-and harness the power of machine learning to predict their energy production!
-In this demonstration, we will be using Spark to explore the training dataset
-and train a Gradient-Boosted Tree (GBT) regressor that will utilize various
-features, such as wind speed and direction, to estimate the power output of a
-wind turbine.
+Wind turbines hold tremendous potential as a sustainable source of energy, capable of supplying a substantial portion
+of the world's power needs. However, the inherent unpredictability of power generation poses a challenge when it comes
+to optimizing this process.
-Wind turbines hold tremendous potential as a sustainable source of energy,
-capable of supplying a substantial portion of the world's power needs. However,
-the inherent unpredictability of power generation poses a challenge when it
-comes to optimizing this process.
+Fortunately, you have a powerful tool at our disposal: Machine Learning (ML). By leveraging advanced algorithms and data
+analysis, you can develop models that accurately predict the power production of wind turbines. This enables you to
+optimize the power generation process and overcome the challenges associated with its ingrained variability.
-Fortunately, we have a powerful tool at our disposal: machine learning. By
-leveraging advanced algorithms and data analysis, we can develop models that
-accurately predict the power production of wind turbines. This enables us to
-optimize the power generation process and overcome the challenges associated
-with its ingrained variability.
+1. [What You'll Need](#what-youll-need)
+1. [Procedure](#procedure)
+1. [How it Works](#how-it-works)
+1. [References](#references)
## What You'll Need
-To complete the tutorial follow the steps below:
+For this tutorial, ensure you have:
+
+- Access to an HPE Ezmeral Unified Analytics cluster.
+
+## Procedure
+
+To complete this tutorial follow the steps below:
+
+1. Login to your Ezmeral Unified Analytics (EzUA) cluster, using your credentials.
+1. Create a new Notebook server using the `jupyter-data-science` image. Request at least `4Gi` of memory for the
+ Notebook server.
+1. Connect to the Notebook server and clone the repository locally.
+1. Navigate to the tutorial's directory (`ezua-tutorials/demos/wind-turbine`).
+1. Launch a new terminal window and create a new conda environment using the specified `environment.yaml` file:
+
+ ```bash
+ conda env create -f environment.yaml
+ ```
+
+1. Add the new conda environment as an ipykernel:
+
+ ```bash
+ python -m ipykernel install --user --name=wind-turbine
+ ```
+
+1. Refresh your browser tab to access the updated environment.
+1. Launch the `wind-turbine.ipynb` notebook file and follow the instructions. Make sure to select the `wind-turbine`
+ environment kernel.
-1. Login to you EzAF cluster.
-1. Create a new notebook server using the `jupyter-data-science` image.
-1. Clone the repository locally.
-1. Launch the `wind-turbine.ipynb` notebook file and follow the instructions.
+## How it Works
-> It is recommended to create a Notebook server with more than 3Gi of memory
+In this tutorial, you use Livy and Sparkmagic to remotely execute Python code in a Spark cluster. Livy is an open-source
+REST service that enables remote and interactive analytics on Apache Spark clusters. It provides a way to interact with
+Spark clusters programmatically using a REST API, allowing you to submit Spark jobs, run interactive queries, and manage
+Spark sessions.
-> If you created your notebook server using the `jupyter-data-science` image
-> you should be good to go. However, you can always create a separate conda
-> environment for this tutorial using the provided `environment.yaml` file.
-> To create a separate enviroment run `conda env create -f environment.yaml`.
+To communicate with Livy and manage your sessions you use Sparkmagic, an open-source tool that provides a Jupyter kernel
+extension. Sparkmagic integrates with Livy, to provide the underlying communication layer between the Jupyter kernel and
+the Spark cluster.
+## References
+1. [Spark: Unified engine for large-scale data analytics](https://spark.apache.org/)
+1. [Livy: A REST Service for Apache Spark](https://livy.apache.org/)
+1. [Sparkmagic: Jupyter magics and kernels for working with remote Spark clusters](https://github.com/jupyter-incubator/sparkmagic)
+1. [Wind Turbine Scada Dataset](https://www.kaggle.com/datasets/berkerisen/wind-turbine-scada-dataset/data)
diff --git a/demos/wind-turbine/images/spark-interactive-ui.png b/demos/wind-turbine/images/spark-interactive-ui.png
new file mode 100644
index 00000000..39512b38
Binary files /dev/null and b/demos/wind-turbine/images/spark-interactive-ui.png differ
diff --git a/demos/wind-turbine/wind-turbine.ipynb b/demos/wind-turbine/wind-turbine.ipynb
index 4c231058..e09e0297 100644
--- a/demos/wind-turbine/wind-turbine.ipynb
+++ b/demos/wind-turbine/wind-turbine.ipynb
@@ -7,30 +7,30 @@
"tags": []
},
"source": [
- "# Wind Turbine (Spark Demo)\n",
+ "# Wind Turbines\n",
"\n",
- "Welcome! In this experiment we delve into the world of wind turbines\n",
- "and harness the power of machine learning to predict their energy production!\n",
- "In this demonstration, we will be using Spark to explore and augment the training\n",
- "dataset and train a Gradient-Boosted Tree (GBT) regressor that will utilize various\n",
- "features, such as wind speed and direction, to estimate the power output of a\n",
- "wind turbine.\n",
+ "In this tutorial you delve into the world of wind turbines and harness the power of Machine Learning (ML) to predict\n",
+ "their energy production. To this end, you use Spark to explore and augment a training dataset, train a Gradient-Boosted\n",
+ "Tree (GBT) regressor, and predict the power output of a wind turbine.\n",
"\n",
"
\n",
"
(Image generated by Stable Diffusion)
\n",
"
\n",
"\n",
+ "Wind turbines hold tremendous potential as a sustainable source of energy, capable of supplying a substantial portion of\n",
+ "the world's power needs. However, the inherent unpredictability of power generation poses a challenge when it comes to\n",
+ "optimizing this process.\n",
"\n",
- "Wind turbines hold tremendous potential as a sustainable source of energy,\n",
- "capable of supplying a substantial portion of the world's power needs. However,\n",
- "the inherent unpredictability of power generation poses a challenge when it\n",
- "comes to optimizing this process.\n",
+ "Fortunately, you have a powerful tool at your disposal: Machine Learning. By leveraging advanced algorithms and data\n",
+ "analysis, you can develop models that accurately predict the power production of wind turbines. This enables you to\n",
+ "optimize the power generation process and overcome the challenges associated with its ingrained variability.\n",
"\n",
- "Fortunately, we have a powerful tool at our disposal: machine learning. By\n",
- "leveraging advanced algorithms and data analysis, we can develop models that\n",
- "accurately predict the power production of wind turbines. This enables us to\n",
- "optimize the power generation process and overcome the challenges associated\n",
- "with its ingrained variability."
+ "## Table of Contents:\n",
+ "\n",
+ "* [Create a Spark Interactive Session](#create-a-spark-interactive-session)\n",
+ "* [Load the Dataset](#load-the-dataset)\n",
+ "* [Data Exploration](#data-exploration)\n",
+ "* [Training a GBT Regression](#training-a-gbt-regressor)"
]
},
{
@@ -42,38 +42,31 @@
"source": [
"## Create a Spark Interactive Session\n",
"\n",
- "Let's begin! In this demo you'll be using Livy to create and manage an interactive\n",
- "Spark session. Livy is an open-source REST service that enables remote and interactive\n",
- "analytics on Apache Spark clusters. It provides a way to interact with Spark clusters\n",
- "programmatically using a REST API, allowing you to submit Spark jobs, run interactive\n",
- "queries, and manage Spark sessions.\n",
- "\n",
- "First you need to connect to the Livy endpoint and create a new Spark interactive session.\n",
- "This session will allow you to interact with Spark using your familiar notebook\n",
- "environment, and execute Spark code to perform data processing tasks in an interactive manner.\n",
+ "Let's begin! In this demo you use Livy to create and manage an interactive Spark session. Livy is an open-source REST\n",
+ "service that enables remote and interactive analytics on Apache Spark clusters. It provides a way to interact with Spark\n",
+ "clusters programmatically using a REST API, allowing you to submit Spark jobs, run interactive queries, and manage Spark\n",
+ "sessions.\n",
"\n",
- "The Spark interactive session is particularly useful for exploratory data analysis,\n",
- "prototyping, and iterative development. It allows you to interactively work with large datasets,\n",
- "perform transformations, apply analytical operations, and build machine learning models using\n",
- "Spark's distributed computing capabilities. This is exactly what you'll do in this Notebook!\n",
+ "First, you need to connect to the Livy endpoint and create a new Spark interactive session. The Spark interactive\n",
+ "session is particularly useful for exploratory data analysis, prototyping, and iterative development. It allows you to\n",
+ "interactively work with large datasets, perform transformations, apply analytical operations, and build machine learning\n",
+ "models using Spark's distributed computing capabilities. This is exactly what you do in this Notebook!\n",
"\n",
- "To communicate with Livy and manage your sessions you'll be using sparkmagic, an open-source\n",
- "tool that provides a Jupyter kernel extension. Sparkmagic integrates with Livy, to provide\n",
- "the underlying communication layer between the Jupyter kernel and the Spark cluster.\n",
+ "To communicate with Livy and manage your sessions you use Sparkmagic, an open-source tool that provides a Jupyter kernel\n",
+ "extension. Sparkmagic integrates with Livy, to provide the underlying communication layer between the Jupyter kernel and\n",
+ "the Spark cluster.\n",
"\n",
"Execute the cell below and:\n",
"\n",
- "1. Select `Add Endpoint`\n",
- "1. Select `Basic_Access`, paste your Livy endpoint and authenticate with your credentials\n",
- "1. Select `Create Session`\n",
- "1. Provide a name select `python` and click `Create Session`\n",
+ "1. Select `Add Endpoint`.\n",
+ "1. Select `Single Sign-On`, keep existing Livy endpoint (should look like http://livy-0.livy-svc.spark.svc.cluster.local:8998).\n",
+ "1. Select `Create Session`.\n",
+ "1. Provide a name, select the `python` language, and click `Create Session`.\n",
"\n",
- "When your session is ready the `Manage Sessions` pane will become active, providing you the session ID.\n",
- "The session state will become `idle` which means that you are good to go!\n",
+ "Give it a few minutes for your session to initialize. Once ready, the Manage Sessions pane will activate, displaying\n",
+ "your session ID. When the session state turns to idle, you're all set. This process can take up to five minutes.\n",
"\n",
- "> To configure Sparkmagic, you can make use of a config.json file located at ~/.sparkmagic/config.json.\n",
- "> If you're running this notebook in a server that was created using the jupyter-data-science image,\n",
- "> the default settings should suffice, and no additional configuration is required."
+ "> To further configure Sparkmagic, you can make use of a config.json file located at ~/.sparkmagic/config.json."
]
},
{
@@ -96,16 +89,48 @@
"tags": []
},
"source": [
- "You are now prepared to embark on your first interaction with Livy.\n",
- "Whenever you initiate a cell with the `%%spark` magic command, the code within\n",
- "that cell will not be executed by your IPython kernel. Instead, it will be executed\n",
- "within a Spark context managed by Livy. Rest assured, Livy will handle everything\n",
- "seamlessly and provide you with a response containing the desired results.\n",
+ "You can also use the EzUA UI to view the logs and code executions:\n",
+ "\n",
+ "![spark-interactive-ui](images/spark-interactive-ui.png)\n",
+ "\n",
+ "You are now prepared to embark on your first interaction with Livy. Whenever you initiate a cell with the `%%spark`\n",
+ "magic command, the code within that cell is not executed locally. Instead, it is executed within a Spark context managed\n",
+ "by Livy. Livy handles everything seamlessly and provide you with a response containing the results.\n",
"\n",
- "Moreover, with Sparkmagic at your side, you can leave the networking part to it. Sparkmagic\n",
- "takes care of all the intricate details, effortlessly creating and sending requests to\n",
- "the Livy server. Finally, it seamlessly renders the response within the Jupyter user interface,\n",
- "ensuring a smooth and hassle-free experience."
+ "With Sparkmagic at your side, you can leave the networking part to it. Sparkmagic takes care of all the intricate\n",
+ "details, effortlessly creating and sending requests to the Livy server. Finally, it seamlessly renders the response\n",
+ "within the Jupyter user interface, ensuring a smooth and hassle-free experience.\n",
+ "\n",
+ "Before you begin, let's make sure that the dataset is copied in the right location:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "e78f2f94-4ad9-41c9-a7f6-cef590163e62",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import os\n",
+ "import shutil\n",
+ "\n",
+ "# create the `auto-spark` folder if it does not exist\n",
+ "os.makedirs(f\"{os.getenv('HOME')}/shared/auto-spark/\", exist_ok=True)\n",
+ "\n",
+ "# copy the file to the right location\n",
+ "dataset_file = \"dataset/T1.csv\"\n",
+ "destination = f\"{os.getenv('HOME')}/shared/auto-spark/T1.csv\"\n",
+ "shutil.copyfile(dataset_file, destination)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d478d922-8640-43a6-8802-c6bc577f508d",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "Next, let's import the libraries you use inside the Spark session and get a handle on it."
]
},
{
@@ -177,11 +202,9 @@
"source": [
"## Load the Dataset\n",
"\n",
- "It's time to load the dataset, which comes in a convenient CSV format.\n",
- "You'll leverage the spark session you created to load it as a DataFrame. Once loaded,\n",
- "you can delve into its contents by examining the first five rows and its schema.\n",
- "Additionally, you'll get an idea of the dataset's size by printing the number of\n",
- "examples available to us."
+ "It's time to load the dataset, which comes in a convenient CSV format. You leverage the spark session you created to\n",
+ "load it as a DataFrame. Once loaded, you can delve into its contents by examining the first five rows and its schema. \n",
+ "Additionally, you get an idea of the dataset's size by printing the number of examples available to us."
]
},
{
@@ -194,7 +217,9 @@
"outputs": [],
"source": [
"%%spark\n",
- "spark_df = spark.read.csv('file:///mounts/shared-volume/shared/spark/T1.csv', header=True, inferSchema=True)"
+ "spark_df = spark.read.csv(\n",
+ " 'file:///mounts/shared-volume/shared/auto-spark/T1.csv',\n",
+ " header=True, inferSchema=True)"
]
},
{
@@ -204,10 +229,9 @@
"tags": []
},
"source": [
- "Next, you should cache the dataset in memory. The `cache()` method is used to persist\n",
- "or cache the contents of a DataFrame, Dataset, or RDD (Resilient Distributed Dataset)\n",
- "in memory. Caching data in memory can significantly improve the performance of iterative\n",
- "algorithms or repeated computations by avoiding the need to recompute or fetch the data\n",
+ "Next, you should cache the dataset in memory. The `cache()` method is used to persist or cache the contents of a\n",
+ "DataFrame, Dataset, or RDD (Resilient Distributed Dataset) in memory. Caching data in memory can significantly improve\n",
+ "the performance of iterative algorithms or repeated computations by avoiding the need to recompute or fetch the data\n",
"from disk."
]
},
@@ -244,11 +268,10 @@
"tags": []
},
"source": [
- "In this experiment, the objective is to predict the power production of a wind turbine\n",
- "(`lv activepower (kw)`) based on the dataset's other features. These features include\n",
- "the current date and hour, the wind speed, and the wind direction. By analyzing the\n",
- "relationships between these variables, you aim to uncover valuable insights and create\n",
- "a predictive model that can estimate the power output of the turbine."
+ "In this experiment, the objective is to predict the power production of a wind turbine (`lv activepower (kw)`) based on\n",
+ "the other features. These features include the current date and hour, the wind speed, and the wind direction. By\n",
+ "analyzing the relationships between these variables, you aim to uncover valuable insights and create a predictive model\n",
+ "that can estimate the power output of the turbine."
]
},
{
@@ -260,8 +283,8 @@
"source": [
"## Data exploration\n",
"\n",
- "Let's start exploring and transforming the dataset. First step, separate the `date/time`\n",
- "column into two separate columns, one for the month and one for the hour of the day."
+ "Let's start exploring and transforming the dataset. First step, separate the `date/time` column into two distinct\n",
+ "columns, one for the month and one for the hour of the day."
]
},
{
@@ -292,9 +315,9 @@
"tags": []
},
"source": [
- "Moving forward, let's examine some essential statistical characteristics of our features,\n",
- "specifically the mean and standard deviation. However, it's important to note that these statistics\n",
- "are relevant only for the wind speed, theoretical power curve, and active power variables."
+ "Moving forward, let's examine some essential statistical characteristics of the features, specifically the mean and\n",
+ "standard deviation. However, it's important to note that these statistics are relevant only for the wind speed,\n",
+ "theoretical power curve, and active power variables."
]
},
{
@@ -318,10 +341,9 @@
"tags": []
},
"source": [
- "Let's extract a random sample from the dataset and start creating a visual model.\n",
- "By visualizing the data, we can gain a deeper understanding of its patterns, trends, and relationships.\n",
- "Through this visual exploration, we'll uncover valuable insights that can guide us in our analysis\n",
- "and decision-making processes."
+ "Let's extract a random sample from the dataset and start creating a visual model. By visualizing the data, you can gain\n",
+ "a deeper understanding of its patterns, trends, and relationships. Through this visual exploration, you uncover valuable\n",
+ "insights that can guide you through your analysis and decision-making processes."
]
},
{
@@ -405,6 +427,7 @@
"sample_df[columns].corr()\n",
"plt.clf()\n",
"sns.pairplot(sample_df[columns], markers='*');\n",
+ "\n",
"%matplot plt"
]
},
@@ -627,7 +650,7 @@
" sns.boxplot(x=df[each])\n",
" # plt.title(each)\n",
" i += 1\n",
- " \n",
+ "\n",
"%matplot plt"
]
},
@@ -753,11 +776,10 @@
"source": [
"# Training a GBT Regressor\n",
"\n",
- "You are now ready to train our GBT regresson. To begin, you'll carefully specify\n",
- "the features and the label for our model. Then, we'll split the dataset into training\n",
- "and test subsets to ensure robust evaluation of our model's performance.\n",
- "Finally, we'll initiate the training process, allowing our GBT regressor to learn\n",
- "from the training data and make accurate predictions."
+ "You are now ready to train the GBT regresson. To begin, you carefully specify the features and the label for the model.\n",
+ "Then, you split the dataset into training and test subsets to ensure robust evaluation of the model's performance.\n",
+ "Finally, you initiate the training process, allowing the GBT regressor to learn from the training data, and generate\n",
+ "accurate predictions."
]
},
{
@@ -837,7 +859,7 @@
"outputs": [],
"source": [
"%%spark\n",
- "gbm_model.write().overwrite().save(\"file:///mounts/shared-volume/user/spark/GBM.model\")"
+ "gbm_model.write().overwrite().save('file:////mounts/shared-volume/user/auto-spark/GBM.model')"
]
},
{
@@ -850,7 +872,7 @@
"outputs": [],
"source": [
"%%spark\n",
- "gbm_model_dtap = GBTRegressionModel.load(\"file:///mounts/shared-volume/user/spark/GBM.model\")"
+ "gbm_model_dtap = GBTRegressionModel.load(\"file:///mounts/shared-volume/user/auto-spark/GBM.model\")"
]
},
{
@@ -919,9 +941,9 @@
],
"metadata": {
"kernelspec": {
- "display_name": "Python 3 (ipykernel)",
+ "display_name": "wind-turbine",
"language": "python",
- "name": "python3"
+ "name": "wind-turbine"
},
"language_info": {
"codemirror_mode": {
diff --git a/demos/wind-turbine/wind-turbine.py b/demos/wind-turbine/wind-turbine.py
deleted file mode 100644
index 337999ec..00000000
--- a/demos/wind-turbine/wind-turbine.py
+++ /dev/null
@@ -1,377 +0,0 @@
-import pyspark
-import numpy as np
-import pandas as pd
-import matplotlib.pyplot as plt
-import seaborn as sns
-
-from math import radians
-from warnings import filterwarnings
-
-from pyspark.sql import SparkSession
-from pyspark.conf import SparkConf
-from pyspark import SparkContext
-from pyspark.sql.functions import substring
-from pyspark.sql.types import IntegerType
-from pyspark.sql import functions as F
-from pyspark.ml.feature import VectorAssembler
-from pyspark.ml.regression import GBTRegressor
-from pyspark.ml.regression import GBTRegressionModel
-from pyspark.ml.evaluation import RegressionEvaluator
-from py4j.java_gateway import java_import
-
-
-sns.set_style('white')
-filterwarnings('ignore')
-
-java_import(spark._sc._jvm, "org.apache.spark.sql.api.python.*")
-
-# Reading Dataset
-spark_df = spark.read.csv(
- 'file:///mounts/shared-volume/spark/T1.csv',
- header=True, inferSchema=True)
-
-# Caching the dataset
-spark_df.cache()
-
-# Converting all the column names to lower case
-spark_df = spark_df.toDF(*[c.lower() for c in spark_df.columns])
-
-print('Show the first 5 rows')
-print(spark_df.show(5))
-
-print('What are the variable data types?')
-print(spark_df.printSchema())
-
-print('How many observations do we have?')
-print(spark_df.count())
-
-# Extracting a substring from columns to create month and hour variables
-spark_df = spark_df.withColumn("month", substring("date/time", 4, 2))
-spark_df = spark_df.withColumn("hour", substring("date/time", 12, 2))
-
-# Converting string month and hour variables to integer
-spark_df = spark_df.withColumn('month', spark_df.month.cast(IntegerType()))
-spark_df = spark_df.withColumn('hour', spark_df.hour.cast(IntegerType()))
-
-print(spark_df.show(5))
-
-# Exploratory Data Analysis
-pd.options.display.float_format = '{:.2f}'.format
-spark_df.select('wind speed (m/s)', 'theoretical_power_curve (kwh)',
- 'lv activepower (kw)').toPandas().describe()
-
-# Taking a random sample from the big data
-sample_df = spark_df.sample(withReplacement=False,
- fraction=0.1, seed=42).toPandas()
-
-# Visualizing the distributions with the sample data
-columns = ['wind speed (m/s)', 'wind direction (deg)', 'month', 'hour',
- 'theoretical_power_curve (kwh)', 'lv activepower (kw)']
-i = 1
-plt.figure(figsize=(10, 12))
-for each in columns:
- plt.subplot(3, 2, i)
- sample_df[each].plot.hist(bins=12)
- plt.title(each)
- i += 1
-
-# Question: Is there any difference between the months
-# for average power production?
-
-plt.clf()
-
-# Average power production by month
-monthly = (spark_df.groupby('month')
- .mean('lv activepower (kw)')
- .sort('avg(lv activepower (kw))')
- .toPandas())
-sns.barplot(x='month', y='avg(lv activepower (kw))', data=monthly)
-plt.title('Months and Average Power Production')
-print(monthly)
-
-# Question: Is there any difference between the hours for average power
-# production?
-
-plt.clf()
-
-# Average power production by hour
-hourly = (spark_df.groupby('hour')
- .mean('lv activepower (kw)')
- .sort('avg(lv activepower (kw))')
- .toPandas())
-sns.barplot(x='hour', y='avg(lv activepower (kw))', data=hourly)
-plt.title('Hours and Average Power Production')
-print(hourly)
-
-# Question: Is there any correlation between the wind speed,
-# wind direction and power production?
-pd.set_option('display.max_columns', None)
-sample_df[columns].corr()
-plt.clf()
-sns.pairplot(sample_df[columns], markers='*')
-
-# Question: What is the average power production level for
-# different wind speeds?
-# Finding average power production for 5 m/s wind speed increments
-wind_speed = []
-avg_power = []
-for i in [0, 5, 10, 15, 20]:
- avg_value = (spark_df.filter((spark_df['wind speed (m/s)'] > i)
- & (spark_df['wind speed (m/s)'] <= i+5))
- .agg({'lv activepower (kw)': 'mean'})
- .collect()[0][0])
- avg_power.append(avg_value)
- wind_speed.append(str(i) + '-' + str(i+5))
-
-plt.clf()
-sns.barplot(x=wind_speed, y=avg_power, color='orange')
-plt.title('Avg Power Production for 5 m/s Wind Speed Increments')
-plt.xlabel('Wind Speed')
-plt.ylabel('Average Power Production')
-
-# Question: What is the power production for different wind directions and
-# speeds?
-
-# Creating the polar diagram
-plt.clf()
-plt.figure(figsize=(8, 8))
-ax = plt.subplot(111, polar=True)
-# Inside circles are the wind speed and marker color and size represents the
-# amount of power production
-sns.scatterplot(x=[radians(x) for x in sample_df['wind direction (deg)']],
- y=sample_df['wind speed (m/s)'],
- size=sample_df['lv activepower (kw)'],
- hue=sample_df['lv activepower (kw)'],
- alpha=0.7, legend=None)
-# Setting the polar diagram's top represents the North
-ax.set_theta_zero_location('N')
-# Setting -1 to start the wind direction clockwise
-ax.set_theta_direction(-1)
-# Setting wind speed labels in a better position to see
-ax.set_rlabel_position(110)
-plt.title('Wind Speed - Wind Direction - Power Production Diagram')
-plt.ylabel(None)
-
-# Question: Does the manufacturer's theoritical power production curve fit
-# well with the real production?
-
-plt.clf()
-plt.figure(figsize=(10, 6))
-sns.scatterplot(x='wind speed (m/s)', y='lv activepower (kw)', color='orange',
- label='Real Production', alpha=0.5, data=sample_df)
-sns.lineplot(x='wind speed (m/s)', y='theoretical_power_curve (kwh)',
- color='blue', label='Theoritical Production', data=sample_df)
-plt.title('Wind Speed and Power Production Chart')
-plt.ylabel('Power Production (kw)')
-
-# Question: What is the wind speed threshold value for zero theorical power?
-# Filter the big data where the real and theoritical power productions are
-# equal to 0.
-zero_theo_power = (spark_df.filter(
- (spark_df['lv activepower (kw)'] == 0)
- & (spark_df['theoretical_power_curve (kwh)'] == 0))
- .toPandas())
-
-print(zero_theo_power[['wind speed (m/s)', 'theoretical_power_curve (kwh)',
- 'lv activepower (kw)']].sample(5))
-plt.clf()
-# Let's see the wind speed distribution for 0 power production
-zero_theo_power['wind speed (m/s)'].hist()
-plt.title('Wind Speed Distribution for 0 Power Production')
-plt.xlabel('Wind speed (m/s)')
-plt.ylabel('Counts for 0 Power Production')
-
-# Question: Why there aren't any power production in some observations while
-# the wind speed is higher than 3 m/s?
-
-# Observations for the wind speed > 3m/s and power production = 0,
-# While theoritically there should be power production
-zero_power = spark_df.filter((spark_df['lv activepower (kw)'] == 0)
- & (spark_df['theoretical_power_curve (kwh)'] != 0)
- & (spark_df['wind speed (m/s)'] > 3)).toPandas()
-print(zero_power.head())
-print('No of Observations (while Wind Speed > 3 m/s'
- ' and Power Production = 0): ', len(zero_power))
-
-plt.clf()
-zero_power['wind speed (m/s)'].plot.hist(bins=8)
-plt.xlabel('Wind Speed (m/s)')
-plt.ylabel('Counts for Zero Production')
-plt.title('Wind Speed Counts for Zero Power Production')
-plt.xticks(ticks=np.arange(4, 18, 2))
-
-# Let's see the monthly distribution for zero power production.
-
-plt.clf()
-sns.countplot(zero_power['month'])
-
-# Excluding the observations meeting the filter criterias
-spark_df = spark_df.filter(~((spark_df['lv activepower (kw)'] == 0)
- & (spark_df['theoretical_power_curve (kwh)'] != 0)
- & (spark_df['wind speed (m/s)'] > 3)))
-spark_df.show(20)
-
-# Question: Is there any other outliers?
-
-columns = ['wind speed (m/s)', 'wind direction (deg)',
- 'theoretical_power_curve (kwh)', 'lv activepower (kw)']
-i = 1
-plt.clf()
-plt.figure(figsize=(20, 3))
-for each in columns:
- df = spark_df.select(each).toPandas()
- plt.subplot(1, 4, i)
- # plt.boxplot(df)
- sns.boxplot(x=df[each])
- plt.title(each)
- i += 1
-
-# We will find the upper and lower threshold values for the wind speed data,
-# and analyze the outliers.
-
-# Create a pandas df for visualization
-wind_speed = spark_df.select('wind speed (m/s)').toPandas()
-
-# Defining the quantiles and interquantile range
-Q1 = wind_speed['wind speed (m/s)'].quantile(0.25)
-Q3 = wind_speed['wind speed (m/s)'].quantile(0.75)
-IQR = Q3-Q1
-# Defining the lower and upper threshold values
-lower = Q1 - 1.5*IQR
-upper = Q3 + 1.5*IQR
-
-print('Quantile (0.25): ', Q1, ' Quantile (0.75): ', Q3)
-print('Lower threshold: ', lower, ' Upper threshold: ', upper)
-
-# Fancy indexing for outliers
-outlier_tf = ((wind_speed['wind speed (m/s)'] < lower)
- | (wind_speed['wind speed (m/s)'] > upper))
-
-print('Total Number of Outliers: ',
- len(wind_speed['wind speed (m/s)'][outlier_tf]))
-print('--'*15)
-print('Some Examples of Outliers:')
-print(wind_speed['wind speed (m/s)'][outlier_tf].sample(10))
-
-# Out of 47033, there is only 407 observations while the wind speed is over
-# 19 m/s.
-# Now Lets see average power production for these high wind speed
-(spark_df.select('wind speed (m/s)', 'lv activepower (kw)')
- .filter(spark_df['wind speed (m/s)'] >= 19)
- .agg({'lv activepower (kw)': 'mean'})
- .show())
-
-spark_df = spark_df.withColumn('wind speed (m/s)',
- F.when(F.col('wind speed (m/s)') > 19.447, 19)
- .otherwise(F.col('wind speed (m/s)')))
-spark_df.count()
-
-# Question: What are the general criterias for power production?
-
-# High level power production
-(spark_df.filter(
- ((spark_df['month'] == 3) | (spark_df['month'] == 8) | (spark_df['month'] == 11))
- & ((spark_df['hour'] >= 16) | (spark_df['hour'] <= 24))
- & ((spark_df['wind direction (deg)'] > 0) | (spark_df['wind direction (deg)'] < 90))
- & ((spark_df['wind direction (deg)'] > 180) | (spark_df['wind direction (deg)'] < 225)))
- .agg({'lv activepower (kw)': 'mean'})
- .show())
-
-# Low level power production
-(spark_df.filter(
- (spark_df['month'] == 7)
- & ((spark_df['hour'] >= 9) | (spark_df['hour'] <= 11))
- & ((spark_df['wind direction (deg)'] > 90) | (spark_df['wind direction (deg)'] < 160)))
- .agg({'lv activepower (kw)': 'mean'})
- .show())
-
-
-# Data Preparation for ML Algorithms
-# Preparing the independent variables (Features)
-# Converting lv activepower (kw) variable as label
-spark_df = spark_df.withColumn('label', spark_df['lv activepower (kw)'])
-
-# Defining the variables to be used
-variables = ['month', 'hour', 'wind speed (m/s)', 'wind direction (deg)']
-vectorAssembler = VectorAssembler(inputCols=variables,
- outputCol='features')
-va_df = vectorAssembler.transform(spark_df)
-
-# Combining features and label column
-final_df = va_df.select('features', 'label')
-final_df.show(10)
-
-# Train Test Split
-splits = final_df.randomSplit([0.8, 0.2])
-train_df = splits[0]
-test_df = splits[1]
-
-print('Train dataset: ', train_df.count())
-print('Test dataset : ', test_df.count())
-
-# Creating the Initial Model
-# Creating the gbm regressor object
-gbm = GBTRegressor(featuresCol='features', labelCol='label')
-
-# Training the model with train data
-gbm_model = gbm.fit(train_df)
-
-# Predicting using the test data
-y_pred = gbm_model.transform(test_df)
-
-# Initial look at the target and predicted values
-y_pred.select('label', 'prediction').show(20)
-
-# Store our model in user shared volume
-gbm_model.write().overwrite().save(
- 'file:///mounts/shared-volume/spark/GBM.model')
-
-gbm_model_dtap = GBTRegressionModel.load(
- "file:///mounts/shared-volume/spark/GBM.model")
-
-# Let's evaluate our model's success
-# Initial model success
-evaluator = RegressionEvaluator(predictionCol='prediction', labelCol='label')
-
-print('R2:\t', evaluator.evaluate(y_pred, {evaluator.metricName: 'r2'}))
-print('MAE:\t', evaluator.evaluate(y_pred, {evaluator.metricName: 'mae'}))
-print('RMSE:\t', evaluator.evaluate(y_pred, {evaluator.metricName: 'rmse'}))
-
-# Comparing Real, Theoritical and Predicted Power Productions
-
-# Converting sample_df back to Spark dataframe
-eva_df = spark.createDataFrame(sample_df)
-
-# Converting lv activepower (kw) variable as label
-eva_df = eva_df.withColumn('label', eva_df['lv activepower (kw)'])
-
-# Defining the variables to be used
-variables = ['month', 'hour', 'wind speed (m/s)', 'wind direction (deg)']
-vectorAssembler = VectorAssembler(inputCols=variables, outputCol='features')
-vec_df = vectorAssembler.transform(eva_df)
-
-# Combining features and label column
-vec_df = vec_df.select('features', 'label')
-
-# Using ML model to predict
-preds = gbm_model.transform(vec_df)
-preds_df = preds.select('label', 'prediction').toPandas()
-
-# Compining dataframes to compare
-frames = [sample_df[['wind speed (m/s)', 'theoretical_power_curve (kwh)']],
- preds_df]
-sample_data = pd.concat(frames, axis=1)
-
-plt.clf()
-# Visualizing real, theoritical and predicted power production
-plt.figure(figsize=(10, 7))
-sns.scatterplot(x='wind speed (m/s)', y='label', alpha=0.5,
- label='Real Power', data=sample_data)
-sns.scatterplot(x='wind speed (m/s)', y='prediction', alpha=0.7,
- label='Predicted Power', marker='o', data=sample_data)
-sns.lineplot(x='wind speed (m/s)', y='theoretical_power_curve (kwh)',
- label='Theoritical Power', color='purple', data=sample_data)
-plt.title('Wind Turbine Power Production Prediction')
-plt.ylabel('Power Production (kw)')
-plt.legend()
-
diff --git a/tutorials/feast/README.md b/tutorials/feast/README.md
new file mode 100644
index 00000000..d4f2032c
--- /dev/null
+++ b/tutorials/feast/README.md
@@ -0,0 +1,40 @@
+# Bike Sharing (Feast)
+
+This tutorial explores how to leverage Feast for generating training data and enhancing online model inference. In this
+use-case, your goal is to train a ride-sharing driver satisfaction prediction model using a training dataset built
+using Feast.
+
+1. [What You'll Need](#what-youll-need)
+1. [Procedure](#procedure)
+1. [References](#references)
+
+## What You'll Need
+
+For this tutorial, ensure you have:
+
+- Access to an HPE Ezmeral Unified Analytics cluster.
+
+## Procedure
+
+To complete the tutorial follow the steps below:
+
+1. Login to your Ezmeral Unified Analytics (EzUA) cluster, using your credentials.
+1. Create a new Notebook server using the `jupyter-data-science` image. Request at least `4Gi` of memory for the
+ Notebook server.
+1. Connect to the Notebook server and clone the repository locally.
+1. Navigate to the tutorial's directory (ezua-tutorials/tutorials/feast).
+1. Lauch a new terminal window and create a new conda environment using the specified `environment.yaml` file:
+ ```
+ conda env create -f environment.yaml
+ ```
+1. Add the new conda environment as an ipykernel:
+ ```
+ python -m ipykernel install --user --name=ride-sharing
+ ```
+1. Refresh your browser tab to access the updated environment.
+1. Launch the `ride-sharing.ipynb` Notebook and execute the code cells. Make sure to select the `ride-sharing`
+ environment kernel.
+
+## References
+
+1. [Feast: Open Source Feature Store for Production ML](https://feast.dev/)
\ No newline at end of file
diff --git a/tutorials/mlflow/environment.yaml b/tutorials/mlflow/environment.yaml
index 90b1defe..6ba99172 100644
--- a/tutorials/mlflow/environment.yaml
+++ b/tutorials/mlflow/environment.yaml
@@ -1,7 +1,6 @@
name: bike-sharing
channels:
- conda-forge
- - defaults
dependencies:
- python=3.8
- graphviz [conda-forge]