Skip to content

Aadik1ng/ML-Pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Stock Price Prediction Pipeline

This project aims to build a machine learning pipeline for predicting stock prices using various features extracted from historical data. The pipeline consists of several components, including data ingestion, data visualization, feature engineering, and model building.

MLOps Integration

In this project, we have integrated MLOps practices to enhance the efficiency, reproducibility, and collaboration of our machine learning workflow. The primary tools utilized for MLOps in this project are DVC (Data Version Control) and MLflow.

DVC (Data Version Control)

DVC is an open-source version control system designed specifically for managing machine learning projects. It allows data scientists and engineers to track changes in data, models, and experiments, similar to how Git manages code. Here are some key features and benefits of using DVC:

  • Data Versioning: DVC enables us to version our datasets, ensuring that we can easily revert to previous versions if needed.
  • Pipeline Management: DVC allows us to define and manage complex data pipelines.
  • Integration with Git: DVC integrates seamlessly with Git, allowing us to keep our data and code in sync.

Example of DVC Usage

dvc init  # Initialize DVC in the project
dvc add data/raw_data.csv  # Add the raw data file to DVC
git add data/raw_data.csv.dvc data/.gitignore  # Track the DVC file in Git

MLflow

MLflow is a powerful platform for managing the machine learning lifecycle, including experimentation, reproducibility, and deployment. It provides tools for tracking experiments, packaging code into reproducible runs, and sharing and deploying models. Key features of MLflow include:

  • Experiment Tracking: MLflow allows us to log parameters, metrics, and artifacts for each experiment run.
  • Model Registry: MLflow provides a centralized model registry, enabling us to manage and version our machine learning models.
  • Integration with Various Frameworks: MLflow supports various machine learning libraries and frameworks, making it flexible for different use cases.

Example of MLflow Usage

import mlflow

with mlflow.start_run():
    mlflow.log_param("model_type", "RandomForest")
    mlflow.log_metric("mse", mean_squared_error(y_true, y_pred))
    mlflow.sklearn.log_model(model, "model")

Project Structure

The project is organized into several key scripts, each responsible for a specific aspect of the pipeline:

  • Data Ingestion: data_ingestion.py
  • Data Visualization: data_visualization.py
  • Feature Engineering: feature_engineering.py
  • Model Building: model_building.py

Data Ingestion

The data_ingestion.py script is responsible for loading and transforming the dataset. It includes robust logging to track the data loading process.

import pandas as pd
import logging

logger = logging.getLogger('data_ingestion')

def load_data(data_url: str) -> pd.DataFrame:
    df = pd.read_csv(data_url)
    logger.debug('Data loaded successfully.')
    return df

Data Visualization

In data_visualization.py, we create insightful visualizations to understand the data better.

import seaborn as sns
import matplotlib.pyplot as plt

def visualize_distribution(df: pd.DataFrame, stock_name: str):
    sns.histplot(df[df['Stock_Name'] == stock_name]['Price'], kde=True)
    plt.title(f'Distribution of {stock_name} Prices')
    plt.show()

Feature Engineering

The feature_engineering.py script handles the creation of new features that enhance model performance.

from sklearn.preprocessing import LabelEncoder

def convert_stock_names_to_ids(df: pd.DataFrame, column: str) -> pd.DataFrame:
    le = LabelEncoder()
    df[column] = le.fit_transform(df[column])
    return df

Model Building

Finally, in model_building.py, we build and evaluate our models.

from sklearn.ensemble import RandomForestRegressor

def train_model(X_train, y_train):
    model = RandomForestRegressor()
    model.fit(X_train, y_train)
    return model

Conclusion

By integrating DVC and MLflow into our stock price prediction pipeline, we enhance our ability to manage data and track experiments efficiently. This project not only demonstrates the power of MLOps but also serves as a robust framework for predicting stock prices. Join us in exploring the exciting world of financial predictions!

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages