diff --git a/demos/wind-turbine/README.md b/demos/wind-turbine/README.md index 95349b6b..d5c1f025 100644 --- a/demos/wind-turbine/README.md +++ b/demos/wind-turbine/README.md @@ -1,39 +1,68 @@ -# Wind Turbine (Spark Demo) +# Wind Turbine (Spark - Livy - Sparkmagic) + +In this demonstration, you use Spark to explore a dataset and train a Gradient-Boosted Tree (GBT) regressor that +leverages various features, such as wind speed and direction, to estimate the power output of a wind turbine. ![wind-farm](images/wind-farm.jpg) -Welcome! In this experiment we delve into the world of wind turbines -and harness the power of machine learning to predict their energy production! -In this demonstration, we will be using Spark to explore the training dataset -and train a Gradient-Boosted Tree (GBT) regressor that will utilize various -features, such as wind speed and direction, to estimate the power output of a -wind turbine. +Wind turbines hold tremendous potential as a sustainable source of energy, capable of supplying a substantial portion +of the world's power needs. However, the inherent unpredictability of power generation poses a challenge when it comes +to optimizing this process. -Wind turbines hold tremendous potential as a sustainable source of energy, -capable of supplying a substantial portion of the world's power needs. However, -the inherent unpredictability of power generation poses a challenge when it -comes to optimizing this process. +Fortunately, you have a powerful tool at our disposal: Machine Learning (ML). By leveraging advanced algorithms and data +analysis, you can develop models that accurately predict the power production of wind turbines. This enables you to +optimize the power generation process and overcome the challenges associated with its ingrained variability. -Fortunately, we have a powerful tool at our disposal: machine learning. By -leveraging advanced algorithms and data analysis, we can develop models that -accurately predict the power production of wind turbines. This enables us to -optimize the power generation process and overcome the challenges associated -with its ingrained variability. +1. [What You'll Need](#what-youll-need) +1. [Procedure](#procedure) +1. [How it Works](#how-it-works) +1. [References](#references) ## What You'll Need -To complete the tutorial follow the steps below: +For this tutorial, ensure you have: + +- Access to an HPE Ezmeral Unified Analytics cluster. + +## Procedure + +To complete this tutorial follow the steps below: + +1. Login to your Ezmeral Unified Analytics (EzUA) cluster, using your credentials. +1. Create a new Notebook server using the `jupyter-data-science` image. Request at least `4Gi` of memory for the + Notebook server. +1. Connect to the Notebook server and clone the repository locally. +1. Navigate to the tutorial's directory (`ezua-tutorials/demos/wind-turbine`). +1. Launch a new terminal window and create a new conda environment using the specified `environment.yaml` file: + + ```bash + conda env create -f environment.yaml + ``` + +1. Add the new conda environment as an ipykernel: + + ```bash + python -m ipykernel install --user --name=wind-turbine + ``` + +1. Refresh your browser tab to access the updated environment. +1. Launch the `wind-turbine.ipynb` notebook file and follow the instructions. Make sure to select the `wind-turbine` + environment kernel. -1. Login to you EzAF cluster. -1. Create a new notebook server using the `jupyter-data-science` image. -1. Clone the repository locally. -1. Launch the `wind-turbine.ipynb` notebook file and follow the instructions. +## How it Works -> It is recommended to create a Notebook server with more than 3Gi of memory +In this tutorial, you use Livy and Sparkmagic to remotely execute Python code in a Spark cluster. Livy is an open-source +REST service that enables remote and interactive analytics on Apache Spark clusters. It provides a way to interact with +Spark clusters programmatically using a REST API, allowing you to submit Spark jobs, run interactive queries, and manage +Spark sessions. -> If you created your notebook server using the `jupyter-data-science` image -> you should be good to go. However, you can always create a separate conda -> environment for this tutorial using the provided `environment.yaml` file. -> To create a separate enviroment run `conda env create -f environment.yaml`. +To communicate with Livy and manage your sessions you use Sparkmagic, an open-source tool that provides a Jupyter kernel +extension. Sparkmagic integrates with Livy, to provide the underlying communication layer between the Jupyter kernel and +the Spark cluster. +## References +1. [Spark: Unified engine for large-scale data analytics](https://spark.apache.org/) +1. [Livy: A REST Service for Apache Spark](https://livy.apache.org/) +1. [Sparkmagic: Jupyter magics and kernels for working with remote Spark clusters](https://github.com/jupyter-incubator/sparkmagic) +1. [Wind Turbine Scada Dataset](https://www.kaggle.com/datasets/berkerisen/wind-turbine-scada-dataset/data) diff --git a/demos/wind-turbine/images/spark-interactive-ui.png b/demos/wind-turbine/images/spark-interactive-ui.png new file mode 100644 index 00000000..39512b38 Binary files /dev/null and b/demos/wind-turbine/images/spark-interactive-ui.png differ diff --git a/demos/wind-turbine/wind-turbine.ipynb b/demos/wind-turbine/wind-turbine.ipynb index 4c231058..e09e0297 100644 --- a/demos/wind-turbine/wind-turbine.ipynb +++ b/demos/wind-turbine/wind-turbine.ipynb @@ -7,30 +7,30 @@ "tags": [] }, "source": [ - "# Wind Turbine (Spark Demo)\n", + "# Wind Turbines\n", "\n", - "Welcome! In this experiment we delve into the world of wind turbines\n", - "and harness the power of machine learning to predict their energy production!\n", - "In this demonstration, we will be using Spark to explore and augment the training\n", - "dataset and train a Gradient-Boosted Tree (GBT) regressor that will utilize various\n", - "features, such as wind speed and direction, to estimate the power output of a\n", - "wind turbine.\n", + "In this tutorial you delve into the world of wind turbines and harness the power of Machine Learning (ML) to predict\n", + "their energy production. To this end, you use Spark to explore and augment a training dataset, train a Gradient-Boosted\n", + "Tree (GBT) regressor, and predict the power output of a wind turbine.\n", "\n", "
\n", "

(Image generated by Stable Diffusion)

\n", "
\n", "\n", + "Wind turbines hold tremendous potential as a sustainable source of energy, capable of supplying a substantial portion of\n", + "the world's power needs. However, the inherent unpredictability of power generation poses a challenge when it comes to\n", + "optimizing this process.\n", "\n", - "Wind turbines hold tremendous potential as a sustainable source of energy,\n", - "capable of supplying a substantial portion of the world's power needs. However,\n", - "the inherent unpredictability of power generation poses a challenge when it\n", - "comes to optimizing this process.\n", + "Fortunately, you have a powerful tool at your disposal: Machine Learning. By leveraging advanced algorithms and data\n", + "analysis, you can develop models that accurately predict the power production of wind turbines. This enables you to\n", + "optimize the power generation process and overcome the challenges associated with its ingrained variability.\n", "\n", - "Fortunately, we have a powerful tool at our disposal: machine learning. By\n", - "leveraging advanced algorithms and data analysis, we can develop models that\n", - "accurately predict the power production of wind turbines. This enables us to\n", - "optimize the power generation process and overcome the challenges associated\n", - "with its ingrained variability." + "## Table of Contents:\n", + "\n", + "* [Create a Spark Interactive Session](#create-a-spark-interactive-session)\n", + "* [Load the Dataset](#load-the-dataset)\n", + "* [Data Exploration](#data-exploration)\n", + "* [Training a GBT Regression](#training-a-gbt-regressor)" ] }, { @@ -42,38 +42,31 @@ "source": [ "## Create a Spark Interactive Session\n", "\n", - "Let's begin! In this demo you'll be using Livy to create and manage an interactive\n", - "Spark session. Livy is an open-source REST service that enables remote and interactive\n", - "analytics on Apache Spark clusters. It provides a way to interact with Spark clusters\n", - "programmatically using a REST API, allowing you to submit Spark jobs, run interactive\n", - "queries, and manage Spark sessions.\n", - "\n", - "First you need to connect to the Livy endpoint and create a new Spark interactive session.\n", - "This session will allow you to interact with Spark using your familiar notebook\n", - "environment, and execute Spark code to perform data processing tasks in an interactive manner.\n", + "Let's begin! In this demo you use Livy to create and manage an interactive Spark session. Livy is an open-source REST\n", + "service that enables remote and interactive analytics on Apache Spark clusters. It provides a way to interact with Spark\n", + "clusters programmatically using a REST API, allowing you to submit Spark jobs, run interactive queries, and manage Spark\n", + "sessions.\n", "\n", - "The Spark interactive session is particularly useful for exploratory data analysis,\n", - "prototyping, and iterative development. It allows you to interactively work with large datasets,\n", - "perform transformations, apply analytical operations, and build machine learning models using\n", - "Spark's distributed computing capabilities. This is exactly what you'll do in this Notebook!\n", + "First, you need to connect to the Livy endpoint and create a new Spark interactive session. The Spark interactive\n", + "session is particularly useful for exploratory data analysis, prototyping, and iterative development. It allows you to\n", + "interactively work with large datasets, perform transformations, apply analytical operations, and build machine learning\n", + "models using Spark's distributed computing capabilities. This is exactly what you do in this Notebook!\n", "\n", - "To communicate with Livy and manage your sessions you'll be using sparkmagic, an open-source\n", - "tool that provides a Jupyter kernel extension. Sparkmagic integrates with Livy, to provide\n", - "the underlying communication layer between the Jupyter kernel and the Spark cluster.\n", + "To communicate with Livy and manage your sessions you use Sparkmagic, an open-source tool that provides a Jupyter kernel\n", + "extension. Sparkmagic integrates with Livy, to provide the underlying communication layer between the Jupyter kernel and\n", + "the Spark cluster.\n", "\n", "Execute the cell below and:\n", "\n", - "1. Select `Add Endpoint`\n", - "1. Select `Basic_Access`, paste your Livy endpoint and authenticate with your credentials\n", - "1. Select `Create Session`\n", - "1. Provide a name select `python` and click `Create Session`\n", + "1. Select `Add Endpoint`.\n", + "1. Select `Single Sign-On`, keep existing Livy endpoint (should look like http://livy-0.livy-svc.spark.svc.cluster.local:8998).\n", + "1. Select `Create Session`.\n", + "1. Provide a name, select the `python` language, and click `Create Session`.\n", "\n", - "When your session is ready the `Manage Sessions` pane will become active, providing you the session ID.\n", - "The session state will become `idle` which means that you are good to go!\n", + "Give it a few minutes for your session to initialize. Once ready, the Manage Sessions pane will activate, displaying\n", + "your session ID. When the session state turns to idle, you're all set. This process can take up to five minutes.\n", "\n", - "> To configure Sparkmagic, you can make use of a config.json file located at ~/.sparkmagic/config.json.\n", - "> If you're running this notebook in a server that was created using the jupyter-data-science image,\n", - "> the default settings should suffice, and no additional configuration is required." + "> To further configure Sparkmagic, you can make use of a config.json file located at ~/.sparkmagic/config.json." ] }, { @@ -96,16 +89,48 @@ "tags": [] }, "source": [ - "You are now prepared to embark on your first interaction with Livy.\n", - "Whenever you initiate a cell with the `%%spark` magic command, the code within\n", - "that cell will not be executed by your IPython kernel. Instead, it will be executed\n", - "within a Spark context managed by Livy. Rest assured, Livy will handle everything\n", - "seamlessly and provide you with a response containing the desired results.\n", + "You can also use the EzUA UI to view the logs and code executions:\n", + "\n", + "![spark-interactive-ui](images/spark-interactive-ui.png)\n", + "\n", + "You are now prepared to embark on your first interaction with Livy. Whenever you initiate a cell with the `%%spark`\n", + "magic command, the code within that cell is not executed locally. Instead, it is executed within a Spark context managed\n", + "by Livy. Livy handles everything seamlessly and provide you with a response containing the results.\n", "\n", - "Moreover, with Sparkmagic at your side, you can leave the networking part to it. Sparkmagic\n", - "takes care of all the intricate details, effortlessly creating and sending requests to\n", - "the Livy server. Finally, it seamlessly renders the response within the Jupyter user interface,\n", - "ensuring a smooth and hassle-free experience." + "With Sparkmagic at your side, you can leave the networking part to it. Sparkmagic takes care of all the intricate\n", + "details, effortlessly creating and sending requests to the Livy server. Finally, it seamlessly renders the response\n", + "within the Jupyter user interface, ensuring a smooth and hassle-free experience.\n", + "\n", + "Before you begin, let's make sure that the dataset is copied in the right location:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e78f2f94-4ad9-41c9-a7f6-cef590163e62", + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "import shutil\n", + "\n", + "# create the `auto-spark` folder if it does not exist\n", + "os.makedirs(f\"{os.getenv('HOME')}/shared/auto-spark/\", exist_ok=True)\n", + "\n", + "# copy the file to the right location\n", + "dataset_file = \"dataset/T1.csv\"\n", + "destination = f\"{os.getenv('HOME')}/shared/auto-spark/T1.csv\"\n", + "shutil.copyfile(dataset_file, destination)" + ] + }, + { + "cell_type": "markdown", + "id": "d478d922-8640-43a6-8802-c6bc577f508d", + "metadata": { + "tags": [] + }, + "source": [ + "Next, let's import the libraries you use inside the Spark session and get a handle on it." ] }, { @@ -177,11 +202,9 @@ "source": [ "## Load the Dataset\n", "\n", - "It's time to load the dataset, which comes in a convenient CSV format.\n", - "You'll leverage the spark session you created to load it as a DataFrame. Once loaded,\n", - "you can delve into its contents by examining the first five rows and its schema.\n", - "Additionally, you'll get an idea of the dataset's size by printing the number of\n", - "examples available to us." + "It's time to load the dataset, which comes in a convenient CSV format. You leverage the spark session you created to\n", + "load it as a DataFrame. Once loaded, you can delve into its contents by examining the first five rows and its schema. \n", + "Additionally, you get an idea of the dataset's size by printing the number of examples available to us." ] }, { @@ -194,7 +217,9 @@ "outputs": [], "source": [ "%%spark\n", - "spark_df = spark.read.csv('file:///mounts/shared-volume/shared/spark/T1.csv', header=True, inferSchema=True)" + "spark_df = spark.read.csv(\n", + " 'file:///mounts/shared-volume/shared/auto-spark/T1.csv',\n", + " header=True, inferSchema=True)" ] }, { @@ -204,10 +229,9 @@ "tags": [] }, "source": [ - "Next, you should cache the dataset in memory. The `cache()` method is used to persist\n", - "or cache the contents of a DataFrame, Dataset, or RDD (Resilient Distributed Dataset)\n", - "in memory. Caching data in memory can significantly improve the performance of iterative\n", - "algorithms or repeated computations by avoiding the need to recompute or fetch the data\n", + "Next, you should cache the dataset in memory. The `cache()` method is used to persist or cache the contents of a\n", + "DataFrame, Dataset, or RDD (Resilient Distributed Dataset) in memory. Caching data in memory can significantly improve\n", + "the performance of iterative algorithms or repeated computations by avoiding the need to recompute or fetch the data\n", "from disk." ] }, @@ -244,11 +268,10 @@ "tags": [] }, "source": [ - "In this experiment, the objective is to predict the power production of a wind turbine\n", - "(`lv activepower (kw)`) based on the dataset's other features. These features include\n", - "the current date and hour, the wind speed, and the wind direction. By analyzing the\n", - "relationships between these variables, you aim to uncover valuable insights and create\n", - "a predictive model that can estimate the power output of the turbine." + "In this experiment, the objective is to predict the power production of a wind turbine (`lv activepower (kw)`) based on\n", + "the other features. These features include the current date and hour, the wind speed, and the wind direction. By\n", + "analyzing the relationships between these variables, you aim to uncover valuable insights and create a predictive model\n", + "that can estimate the power output of the turbine." ] }, { @@ -260,8 +283,8 @@ "source": [ "## Data exploration\n", "\n", - "Let's start exploring and transforming the dataset. First step, separate the `date/time`\n", - "column into two separate columns, one for the month and one for the hour of the day." + "Let's start exploring and transforming the dataset. First step, separate the `date/time` column into two distinct\n", + "columns, one for the month and one for the hour of the day." ] }, { @@ -292,9 +315,9 @@ "tags": [] }, "source": [ - "Moving forward, let's examine some essential statistical characteristics of our features,\n", - "specifically the mean and standard deviation. However, it's important to note that these statistics\n", - "are relevant only for the wind speed, theoretical power curve, and active power variables." + "Moving forward, let's examine some essential statistical characteristics of the features, specifically the mean and\n", + "standard deviation. However, it's important to note that these statistics are relevant only for the wind speed,\n", + "theoretical power curve, and active power variables." ] }, { @@ -318,10 +341,9 @@ "tags": [] }, "source": [ - "Let's extract a random sample from the dataset and start creating a visual model.\n", - "By visualizing the data, we can gain a deeper understanding of its patterns, trends, and relationships.\n", - "Through this visual exploration, we'll uncover valuable insights that can guide us in our analysis\n", - "and decision-making processes." + "Let's extract a random sample from the dataset and start creating a visual model. By visualizing the data, you can gain\n", + "a deeper understanding of its patterns, trends, and relationships. Through this visual exploration, you uncover valuable\n", + "insights that can guide you through your analysis and decision-making processes." ] }, { @@ -405,6 +427,7 @@ "sample_df[columns].corr()\n", "plt.clf()\n", "sns.pairplot(sample_df[columns], markers='*');\n", + "\n", "%matplot plt" ] }, @@ -627,7 +650,7 @@ " sns.boxplot(x=df[each])\n", " # plt.title(each)\n", " i += 1\n", - " \n", + "\n", "%matplot plt" ] }, @@ -753,11 +776,10 @@ "source": [ "# Training a GBT Regressor\n", "\n", - "You are now ready to train our GBT regresson. To begin, you'll carefully specify\n", - "the features and the label for our model. Then, we'll split the dataset into training\n", - "and test subsets to ensure robust evaluation of our model's performance.\n", - "Finally, we'll initiate the training process, allowing our GBT regressor to learn\n", - "from the training data and make accurate predictions." + "You are now ready to train the GBT regresson. To begin, you carefully specify the features and the label for the model.\n", + "Then, you split the dataset into training and test subsets to ensure robust evaluation of the model's performance.\n", + "Finally, you initiate the training process, allowing the GBT regressor to learn from the training data, and generate\n", + "accurate predictions." ] }, { @@ -837,7 +859,7 @@ "outputs": [], "source": [ "%%spark\n", - "gbm_model.write().overwrite().save(\"file:///mounts/shared-volume/user/spark/GBM.model\")" + "gbm_model.write().overwrite().save('file:////mounts/shared-volume/user/auto-spark/GBM.model')" ] }, { @@ -850,7 +872,7 @@ "outputs": [], "source": [ "%%spark\n", - "gbm_model_dtap = GBTRegressionModel.load(\"file:///mounts/shared-volume/user/spark/GBM.model\")" + "gbm_model_dtap = GBTRegressionModel.load(\"file:///mounts/shared-volume/user/auto-spark/GBM.model\")" ] }, { @@ -919,9 +941,9 @@ ], "metadata": { "kernelspec": { - "display_name": "Python 3 (ipykernel)", + "display_name": "wind-turbine", "language": "python", - "name": "python3" + "name": "wind-turbine" }, "language_info": { "codemirror_mode": { diff --git a/demos/wind-turbine/wind-turbine.py b/demos/wind-turbine/wind-turbine.py deleted file mode 100644 index 337999ec..00000000 --- a/demos/wind-turbine/wind-turbine.py +++ /dev/null @@ -1,377 +0,0 @@ -import pyspark -import numpy as np -import pandas as pd -import matplotlib.pyplot as plt -import seaborn as sns - -from math import radians -from warnings import filterwarnings - -from pyspark.sql import SparkSession -from pyspark.conf import SparkConf -from pyspark import SparkContext -from pyspark.sql.functions import substring -from pyspark.sql.types import IntegerType -from pyspark.sql import functions as F -from pyspark.ml.feature import VectorAssembler -from pyspark.ml.regression import GBTRegressor -from pyspark.ml.regression import GBTRegressionModel -from pyspark.ml.evaluation import RegressionEvaluator -from py4j.java_gateway import java_import - - -sns.set_style('white') -filterwarnings('ignore') - -java_import(spark._sc._jvm, "org.apache.spark.sql.api.python.*") - -# Reading Dataset -spark_df = spark.read.csv( - 'file:///mounts/shared-volume/spark/T1.csv', - header=True, inferSchema=True) - -# Caching the dataset -spark_df.cache() - -# Converting all the column names to lower case -spark_df = spark_df.toDF(*[c.lower() for c in spark_df.columns]) - -print('Show the first 5 rows') -print(spark_df.show(5)) - -print('What are the variable data types?') -print(spark_df.printSchema()) - -print('How many observations do we have?') -print(spark_df.count()) - -# Extracting a substring from columns to create month and hour variables -spark_df = spark_df.withColumn("month", substring("date/time", 4, 2)) -spark_df = spark_df.withColumn("hour", substring("date/time", 12, 2)) - -# Converting string month and hour variables to integer -spark_df = spark_df.withColumn('month', spark_df.month.cast(IntegerType())) -spark_df = spark_df.withColumn('hour', spark_df.hour.cast(IntegerType())) - -print(spark_df.show(5)) - -# Exploratory Data Analysis -pd.options.display.float_format = '{:.2f}'.format -spark_df.select('wind speed (m/s)', 'theoretical_power_curve (kwh)', - 'lv activepower (kw)').toPandas().describe() - -# Taking a random sample from the big data -sample_df = spark_df.sample(withReplacement=False, - fraction=0.1, seed=42).toPandas() - -# Visualizing the distributions with the sample data -columns = ['wind speed (m/s)', 'wind direction (deg)', 'month', 'hour', - 'theoretical_power_curve (kwh)', 'lv activepower (kw)'] -i = 1 -plt.figure(figsize=(10, 12)) -for each in columns: - plt.subplot(3, 2, i) - sample_df[each].plot.hist(bins=12) - plt.title(each) - i += 1 - -# Question: Is there any difference between the months -# for average power production? - -plt.clf() - -# Average power production by month -monthly = (spark_df.groupby('month') - .mean('lv activepower (kw)') - .sort('avg(lv activepower (kw))') - .toPandas()) -sns.barplot(x='month', y='avg(lv activepower (kw))', data=monthly) -plt.title('Months and Average Power Production') -print(monthly) - -# Question: Is there any difference between the hours for average power -# production? - -plt.clf() - -# Average power production by hour -hourly = (spark_df.groupby('hour') - .mean('lv activepower (kw)') - .sort('avg(lv activepower (kw))') - .toPandas()) -sns.barplot(x='hour', y='avg(lv activepower (kw))', data=hourly) -plt.title('Hours and Average Power Production') -print(hourly) - -# Question: Is there any correlation between the wind speed, -# wind direction and power production? -pd.set_option('display.max_columns', None) -sample_df[columns].corr() -plt.clf() -sns.pairplot(sample_df[columns], markers='*') - -# Question: What is the average power production level for -# different wind speeds? -# Finding average power production for 5 m/s wind speed increments -wind_speed = [] -avg_power = [] -for i in [0, 5, 10, 15, 20]: - avg_value = (spark_df.filter((spark_df['wind speed (m/s)'] > i) - & (spark_df['wind speed (m/s)'] <= i+5)) - .agg({'lv activepower (kw)': 'mean'}) - .collect()[0][0]) - avg_power.append(avg_value) - wind_speed.append(str(i) + '-' + str(i+5)) - -plt.clf() -sns.barplot(x=wind_speed, y=avg_power, color='orange') -plt.title('Avg Power Production for 5 m/s Wind Speed Increments') -plt.xlabel('Wind Speed') -plt.ylabel('Average Power Production') - -# Question: What is the power production for different wind directions and -# speeds? - -# Creating the polar diagram -plt.clf() -plt.figure(figsize=(8, 8)) -ax = plt.subplot(111, polar=True) -# Inside circles are the wind speed and marker color and size represents the -# amount of power production -sns.scatterplot(x=[radians(x) for x in sample_df['wind direction (deg)']], - y=sample_df['wind speed (m/s)'], - size=sample_df['lv activepower (kw)'], - hue=sample_df['lv activepower (kw)'], - alpha=0.7, legend=None) -# Setting the polar diagram's top represents the North -ax.set_theta_zero_location('N') -# Setting -1 to start the wind direction clockwise -ax.set_theta_direction(-1) -# Setting wind speed labels in a better position to see -ax.set_rlabel_position(110) -plt.title('Wind Speed - Wind Direction - Power Production Diagram') -plt.ylabel(None) - -# Question: Does the manufacturer's theoritical power production curve fit -# well with the real production? - -plt.clf() -plt.figure(figsize=(10, 6)) -sns.scatterplot(x='wind speed (m/s)', y='lv activepower (kw)', color='orange', - label='Real Production', alpha=0.5, data=sample_df) -sns.lineplot(x='wind speed (m/s)', y='theoretical_power_curve (kwh)', - color='blue', label='Theoritical Production', data=sample_df) -plt.title('Wind Speed and Power Production Chart') -plt.ylabel('Power Production (kw)') - -# Question: What is the wind speed threshold value for zero theorical power? -# Filter the big data where the real and theoritical power productions are -# equal to 0. -zero_theo_power = (spark_df.filter( - (spark_df['lv activepower (kw)'] == 0) - & (spark_df['theoretical_power_curve (kwh)'] == 0)) - .toPandas()) - -print(zero_theo_power[['wind speed (m/s)', 'theoretical_power_curve (kwh)', - 'lv activepower (kw)']].sample(5)) -plt.clf() -# Let's see the wind speed distribution for 0 power production -zero_theo_power['wind speed (m/s)'].hist() -plt.title('Wind Speed Distribution for 0 Power Production') -plt.xlabel('Wind speed (m/s)') -plt.ylabel('Counts for 0 Power Production') - -# Question: Why there aren't any power production in some observations while -# the wind speed is higher than 3 m/s? - -# Observations for the wind speed > 3m/s and power production = 0, -# While theoritically there should be power production -zero_power = spark_df.filter((spark_df['lv activepower (kw)'] == 0) - & (spark_df['theoretical_power_curve (kwh)'] != 0) - & (spark_df['wind speed (m/s)'] > 3)).toPandas() -print(zero_power.head()) -print('No of Observations (while Wind Speed > 3 m/s' - ' and Power Production = 0): ', len(zero_power)) - -plt.clf() -zero_power['wind speed (m/s)'].plot.hist(bins=8) -plt.xlabel('Wind Speed (m/s)') -plt.ylabel('Counts for Zero Production') -plt.title('Wind Speed Counts for Zero Power Production') -plt.xticks(ticks=np.arange(4, 18, 2)) - -# Let's see the monthly distribution for zero power production. - -plt.clf() -sns.countplot(zero_power['month']) - -# Excluding the observations meeting the filter criterias -spark_df = spark_df.filter(~((spark_df['lv activepower (kw)'] == 0) - & (spark_df['theoretical_power_curve (kwh)'] != 0) - & (spark_df['wind speed (m/s)'] > 3))) -spark_df.show(20) - -# Question: Is there any other outliers? - -columns = ['wind speed (m/s)', 'wind direction (deg)', - 'theoretical_power_curve (kwh)', 'lv activepower (kw)'] -i = 1 -plt.clf() -plt.figure(figsize=(20, 3)) -for each in columns: - df = spark_df.select(each).toPandas() - plt.subplot(1, 4, i) - # plt.boxplot(df) - sns.boxplot(x=df[each]) - plt.title(each) - i += 1 - -# We will find the upper and lower threshold values for the wind speed data, -# and analyze the outliers. - -# Create a pandas df for visualization -wind_speed = spark_df.select('wind speed (m/s)').toPandas() - -# Defining the quantiles and interquantile range -Q1 = wind_speed['wind speed (m/s)'].quantile(0.25) -Q3 = wind_speed['wind speed (m/s)'].quantile(0.75) -IQR = Q3-Q1 -# Defining the lower and upper threshold values -lower = Q1 - 1.5*IQR -upper = Q3 + 1.5*IQR - -print('Quantile (0.25): ', Q1, ' Quantile (0.75): ', Q3) -print('Lower threshold: ', lower, ' Upper threshold: ', upper) - -# Fancy indexing for outliers -outlier_tf = ((wind_speed['wind speed (m/s)'] < lower) - | (wind_speed['wind speed (m/s)'] > upper)) - -print('Total Number of Outliers: ', - len(wind_speed['wind speed (m/s)'][outlier_tf])) -print('--'*15) -print('Some Examples of Outliers:') -print(wind_speed['wind speed (m/s)'][outlier_tf].sample(10)) - -# Out of 47033, there is only 407 observations while the wind speed is over -# 19 m/s. -# Now Lets see average power production for these high wind speed -(spark_df.select('wind speed (m/s)', 'lv activepower (kw)') - .filter(spark_df['wind speed (m/s)'] >= 19) - .agg({'lv activepower (kw)': 'mean'}) - .show()) - -spark_df = spark_df.withColumn('wind speed (m/s)', - F.when(F.col('wind speed (m/s)') > 19.447, 19) - .otherwise(F.col('wind speed (m/s)'))) -spark_df.count() - -# Question: What are the general criterias for power production? - -# High level power production -(spark_df.filter( - ((spark_df['month'] == 3) | (spark_df['month'] == 8) | (spark_df['month'] == 11)) - & ((spark_df['hour'] >= 16) | (spark_df['hour'] <= 24)) - & ((spark_df['wind direction (deg)'] > 0) | (spark_df['wind direction (deg)'] < 90)) - & ((spark_df['wind direction (deg)'] > 180) | (spark_df['wind direction (deg)'] < 225))) - .agg({'lv activepower (kw)': 'mean'}) - .show()) - -# Low level power production -(spark_df.filter( - (spark_df['month'] == 7) - & ((spark_df['hour'] >= 9) | (spark_df['hour'] <= 11)) - & ((spark_df['wind direction (deg)'] > 90) | (spark_df['wind direction (deg)'] < 160))) - .agg({'lv activepower (kw)': 'mean'}) - .show()) - - -# Data Preparation for ML Algorithms -# Preparing the independent variables (Features) -# Converting lv activepower (kw) variable as label -spark_df = spark_df.withColumn('label', spark_df['lv activepower (kw)']) - -# Defining the variables to be used -variables = ['month', 'hour', 'wind speed (m/s)', 'wind direction (deg)'] -vectorAssembler = VectorAssembler(inputCols=variables, - outputCol='features') -va_df = vectorAssembler.transform(spark_df) - -# Combining features and label column -final_df = va_df.select('features', 'label') -final_df.show(10) - -# Train Test Split -splits = final_df.randomSplit([0.8, 0.2]) -train_df = splits[0] -test_df = splits[1] - -print('Train dataset: ', train_df.count()) -print('Test dataset : ', test_df.count()) - -# Creating the Initial Model -# Creating the gbm regressor object -gbm = GBTRegressor(featuresCol='features', labelCol='label') - -# Training the model with train data -gbm_model = gbm.fit(train_df) - -# Predicting using the test data -y_pred = gbm_model.transform(test_df) - -# Initial look at the target and predicted values -y_pred.select('label', 'prediction').show(20) - -# Store our model in user shared volume -gbm_model.write().overwrite().save( - 'file:///mounts/shared-volume/spark/GBM.model') - -gbm_model_dtap = GBTRegressionModel.load( - "file:///mounts/shared-volume/spark/GBM.model") - -# Let's evaluate our model's success -# Initial model success -evaluator = RegressionEvaluator(predictionCol='prediction', labelCol='label') - -print('R2:\t', evaluator.evaluate(y_pred, {evaluator.metricName: 'r2'})) -print('MAE:\t', evaluator.evaluate(y_pred, {evaluator.metricName: 'mae'})) -print('RMSE:\t', evaluator.evaluate(y_pred, {evaluator.metricName: 'rmse'})) - -# Comparing Real, Theoritical and Predicted Power Productions - -# Converting sample_df back to Spark dataframe -eva_df = spark.createDataFrame(sample_df) - -# Converting lv activepower (kw) variable as label -eva_df = eva_df.withColumn('label', eva_df['lv activepower (kw)']) - -# Defining the variables to be used -variables = ['month', 'hour', 'wind speed (m/s)', 'wind direction (deg)'] -vectorAssembler = VectorAssembler(inputCols=variables, outputCol='features') -vec_df = vectorAssembler.transform(eva_df) - -# Combining features and label column -vec_df = vec_df.select('features', 'label') - -# Using ML model to predict -preds = gbm_model.transform(vec_df) -preds_df = preds.select('label', 'prediction').toPandas() - -# Compining dataframes to compare -frames = [sample_df[['wind speed (m/s)', 'theoretical_power_curve (kwh)']], - preds_df] -sample_data = pd.concat(frames, axis=1) - -plt.clf() -# Visualizing real, theoritical and predicted power production -plt.figure(figsize=(10, 7)) -sns.scatterplot(x='wind speed (m/s)', y='label', alpha=0.5, - label='Real Power', data=sample_data) -sns.scatterplot(x='wind speed (m/s)', y='prediction', alpha=0.7, - label='Predicted Power', marker='o', data=sample_data) -sns.lineplot(x='wind speed (m/s)', y='theoretical_power_curve (kwh)', - label='Theoritical Power', color='purple', data=sample_data) -plt.title('Wind Turbine Power Production Prediction') -plt.ylabel('Power Production (kw)') -plt.legend() - diff --git a/tutorials/feast/README.md b/tutorials/feast/README.md new file mode 100644 index 00000000..d4f2032c --- /dev/null +++ b/tutorials/feast/README.md @@ -0,0 +1,40 @@ +# Bike Sharing (Feast) + +This tutorial explores how to leverage Feast for generating training data and enhancing online model inference. In this +use-case, your goal is to train a ride-sharing driver satisfaction prediction model using a training dataset built +using Feast. + +1. [What You'll Need](#what-youll-need) +1. [Procedure](#procedure) +1. [References](#references) + +## What You'll Need + +For this tutorial, ensure you have: + +- Access to an HPE Ezmeral Unified Analytics cluster. + +## Procedure + +To complete the tutorial follow the steps below: + +1. Login to your Ezmeral Unified Analytics (EzUA) cluster, using your credentials. +1. Create a new Notebook server using the `jupyter-data-science` image. Request at least `4Gi` of memory for the + Notebook server. +1. Connect to the Notebook server and clone the repository locally. +1. Navigate to the tutorial's directory (ezua-tutorials/tutorials/feast). +1. Lauch a new terminal window and create a new conda environment using the specified `environment.yaml` file: + ``` + conda env create -f environment.yaml + ``` +1. Add the new conda environment as an ipykernel: + ``` + python -m ipykernel install --user --name=ride-sharing + ``` +1. Refresh your browser tab to access the updated environment. +1. Launch the `ride-sharing.ipynb` Notebook and execute the code cells. Make sure to select the `ride-sharing` + environment kernel. + +## References + +1. [Feast: Open Source Feature Store for Production ML](https://feast.dev/) \ No newline at end of file diff --git a/tutorials/mlflow/environment.yaml b/tutorials/mlflow/environment.yaml index 90b1defe..6ba99172 100644 --- a/tutorials/mlflow/environment.yaml +++ b/tutorials/mlflow/environment.yaml @@ -1,7 +1,6 @@ name: bike-sharing channels: - conda-forge - - defaults dependencies: - python=3.8 - graphviz [conda-forge]