Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add time series tutorial #1738

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions docs/ml_training.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,8 @@ Understand how machine learning models can be trained from within Flyte, with an
- Word embedding and topic modelling on lee background corpus with Gensim
* - {doc}`Forecast Sales Using Rossmann Store Sales <auto_examples/forecasting_sales/index>`
- Forecast sales data with data-parallel distributed training using Horovod on Spark.
* - {doc}`Time Series Modeling <auto_examples/time_series_modeling/index>`
- Train models for making forecasts on time series data.
```

```{toctree}
Expand All @@ -28,4 +30,5 @@ auto_examples/house_price_prediction/index
auto_examples/mnist_classifier/index
auto_examples/nlp_processing/index
auto_examples/forecasting_sales/index
auto_examples/time_series_modeling/index
```
2 changes: 2 additions & 0 deletions docs/tutorials.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,8 @@ Train machine learning models from using your framework of choice.
- Word embedding and topic modelling on lee background corpus with Gensim
* - {doc}`Sales Forecasting <auto_examples/forecasting_sales/index>`
- Use the Rossmann Store data to forecast sales with distributed training using Horovod on Spark.
* - {doc}`Time Series Modeling <auto_examples/time_series_modeling/index>`
- Train models for making forecasts on time series data.
```

## 🛠 Feature Engineering
Expand Down
31 changes: 31 additions & 0 deletions examples/time_series_modeling/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
FROM python:3.8-slim-buster
LABEL org.opencontainers.image.source https://github.com/flyteorg/flytesnacks

WORKDIR /root
ENV VENV /opt/venv
ENV LANG C.UTF-8
ENV LC_ALL C.UTF-8
ENV PYTHONPATH /root

# This is necessary for opencv to work
RUN apt-get update && apt-get install -y libsm6 libxext6 libxrender-dev ffmpeg build-essential curl

WORKDIR /root

ENV VENV /opt/venv
# Virtual environment
RUN python3 -m venv ${VENV}
ENV PATH="${VENV}/bin:$PATH"

# Install Python dependencies
COPY requirements.in /root
RUN pip install -r /root/requirements.in
RUN pip freeze

# Copy the actual code
COPY . /root

# This tag is supplied by the build script and will be used to determine the version
# when registering tasks, workflows, and launch plans
ARG tag
ENV FLYTE_INTERNAL_IMAGE $tag
45 changes: 45 additions & 0 deletions examples/time_series_modeling/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
(time_series_modeling)=

# Time Series Modeling

```{eval-rst}
.. tags:: Advanced, MachineLearning
```

Time series data is fundamentally different from Independent and Identically
Distributed (IID) data, which is commonly used in many machine learning tasks.
Here are a few key differences:

1. **Temporal Dependency**: In time series data, observations are ordered
chronologically and exhibit temporal dependencies. Each data point is related
to its past and future values. This sequential nature is crucial for
forecasting and trend analysis. In contrast, IID data assumes that each
observation is independent of others.
2. **Non-stationarity**: Time series often display trends, seasonality, or cyclic
patterns that evolve over time. This non-stationarity means that statistical
properties like mean and variance can change, making analysis more complex. IID
data, by definition, maintains constant statistical properties.
3. **Autocorrelation**: Time series data frequently shows autocorrelation, where
an observation is correlated with its own past values. This feature is essential
for many time series models but is not the case for IID data.
4. **Importance of Order**: The sequence of observations in time series data is
critical and cannot be shuffled without losing information. In IID data, the
order of observations is assumed to be irrelevant.
5. **Inference is Focused on Forecasting**: Time series analysis often aims to
predict future values based on historical patterns, whereas many machine
learning tasks with IID data focus on classification or regression without
a temporal component.
6. **Specific Modeling Techniques**: Time series data requires specialized
modeling techniques like ARIMA, Prophet, or RNNs that can capture temporal
dynamics. These models are not typically used with IID data.

Understanding these differences is crucial for selecting appropriate analysis
methods and interpreting results in time series modeling tasks.

Below are examples demonstrating how to use Flyte to train time series models.

## Examples

```{auto-examples-toc}
neural_prophet
```
4 changes: 4 additions & 0 deletions examples/time_series_modeling/requirements.in
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
flytekit>=1.7.0
wheel
matplotlib
flytekitplugins-deck-standard
Empty file.
116 changes: 116 additions & 0 deletions examples/time_series_modeling/time_series_modeling/neural_prophet.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
# %% [markdown]
# # Train a Neural Prophet Model
#
# This script demonstrates how to train a model for time series forecasting
# using the [neural prophet](https://neuralprophet.com/) library.

# %% [markdown]
# ## Imports and Setup
#
# First, we import necessary libraries to run the training workflow.

import pandas as pd
from flytekit import Deck, ImageSpec, current_context, task, workflow
from flytekit.types.file import FlyteFile

# %% [markdown]
# ## Define an ImageSpec
#
# For reproducibility, we create an `ImageSpec` object with required packages
# for our tasks.

image = ImageSpec(
name="neuralprophet",
packages=[
"neuralprophet",
"matplotlib",
"ipython",
"pandas",
"pyarrow",
],
# This registry is for a local flyte demo cluster. Replace this with your
# own registry, e.g. `docker.io/<username>/<imagename>`
registry="localhost:30000",
)

# %% [markdown]
# ## Data Loading Task
#
# This task loads the time series data from the specified URL. In this case,
# we use a hard-coded URL for a sample dataset that ships with the neural prophet.

URL = "https://github.com/ourownstory/neuralprophet-data/raw/main/kaggle-energy/datasets/tutorial01.csv"


@task(container_image=image)
def load_data() -> pd.DataFrame:
return pd.read_csv(URL)


# %% [markdown]
# ## Model Training Task
#
# This task trains the Neural Prophet model on the loaded data.
# We train the model in the hourly frequency for ten epochs.


@task(container_image=image)
def train_model(df: pd.DataFrame) -> FlyteFile:
from neuralprophet import NeuralProphet, save

working_dir = current_context().working_directory
model = NeuralProphet()
model.fit(df, freq="H", epochs=10)
model_fp = f"{working_dir}/model.np"
save(model, model_fp)
return FlyteFile(model_fp)


# %% [markdown]
# ## Forecasting Task
#
# This task loads the trained model, makes predictions, and visualizes the
# results using a Flyte Deck.


@task(
container_image=image,
enable_deck=True,
)
def make_forecast(df: pd.DataFrame, model_file: FlyteFile) -> pd.DataFrame:
from neuralprophet import load

model_file.download()
model = load(model_file.path)

# Create a new dataframe reaching 365 into the future
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Create a new dataframe reaching 365 into the future
# Create a new dataframe reaching 365 days into the future

# for our forecast, n_historic_predictions also shows historic data
df_future = model.make_future_dataframe(
df,
n_historic_predictions=True,
periods=365,
)

# Predict the future
forecast = model.predict(df_future)

# Plot on a Flyte Deck
fig = model.plot(forecast)
Deck("Forecast", fig.to_html())

return forecast


# %% [markdown]
# ## Main Workflow
#
# Finally, this workflow orchestrates the entire process: loading data,
# training the model, and making forecasts.


@workflow
def main() -> pd.DataFrame:
df = load_data()
model_file = train_model(df)
forecast = make_forecast(df, model_file)
return forecast
Loading