[Python] Examples using Python API for AI model training (lakesoul-io…

…#327) * init python/examples Signed-off-by: zenghua <[email protected]> * Update README.md Signed-off-by: zenghua <[email protected]> * Update README.md Signed-off-by: zenghua <[email protected]> * update import titanic by pyspark Signed-off-by: zenghua <[email protected]> * add titianic, imdb, food101 examples. * update examples README file. * update requirements.txt * update lakesoul jar version. * update README for python env. * update whl file instructions. * Update README.md * update README.md and requirements.txt. * update README.md. --------- Signed-off-by: zenghua <[email protected]> Co-authored-by: zenghua <[email protected]> Co-authored-by: Sun Kai <[email protected]>
xuchen-plus · Sep 15, 2023 · a930c09 · a930c09
1 parent ff85b49
commit a930c09
Show file tree

Hide file tree

Showing 22 changed files with 1,254 additions and 0 deletions.
diff --git a/python/examples/README.md b/python/examples/README.md
@@ -0,0 +1,54 @@
+# LakeSoul Python Examples
+
+## Prerequisites
+
+### Deploy Docker compose env
+
+```bash
+cd docker/lakesoul-docker-compose-env
+docker compose up -d
+```
+
+### Pull spark image
+
+```bash
+docker pull bitnami/spark:3.3.1
+```
+
+### Download LakeSoul release jar
+1. download [maven-package-upload.zip](https://github.com/lakesoul-io/LakeSoul/suites/16162659724/artifacts/922875223).
+2. unzip the zip file and extract `lakesoul-spark-2.3.0-spark-3.3-SNAPSHOT.jar` from `maven-package-upload/lakesoul-spark/target/`.
+
+### Download LakeSoul wheel file
+For users of Python 3.8, Python 3.9, and Python 3.10, we have prepared different wheel files for each version. Please download the appropriate one based on your requirements.
+* For Python 3.8 users: [lakesoul-1.0.0b0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl](https://dmetasoul-bucket.obs.cn-southwest-2.myhuaweicloud.com/releases/lakesoul/python/v1.0/lakesoul-1.0.0b0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl)
+* For Python 3.9 users: [lakesoul-1.0.0b0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl](https://dmetasoul-bucket.obs.cn-southwest-2.myhuaweicloud.com/releases/lakesoul/python/v1.0/lakesoul-1.0.0b0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl)
+* For Python 3.10 users: [lakesoul-1.0.0b0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl](https://dmetasoul-bucket.obs.cn-southwest-2.myhuaweicloud.com/releases/lakesoul/python/v1.0/lakesoul-1.0.0b0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl)
+
+Assuming we are using Python 3.8, we can down load the wheel file as below
+```bash
+wget https://dmetasoul-bucket.obs.cn-southwest-2.myhuaweicloud.com/releases/lakesoul/python/v1.0/lakesoul-1.0.0b0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
+```
+
+### Install python virtual enviroment
+```bash 
+conda create -n lakesoul_test python=3.8
+conda acitvate lakesoul_test
+# replace ${PWD} with your working directory
+pip install -r requirements.txt
+```
+
+## Run Examples
+Before running the examples, please export the LakeSoul environment variables by executing the command:
+
+```bash
+source lakesoul_env.sh
+```
+
+Afterwards, we can test the examples using the instructions below.
+
+| Project                              | Dataset                              | Base Model                                | 
+|:-------------------------------------|:-------------------------------------|:------------------------------------------|
+| [Titanic](./titanic/) | [Kaggle Titanic Dataset](https://www.kaggle.com/competitions/titanic) | `DNN` |
+| [IMDB Sentiment Analysis](./imdb/) | [Hugginface IMDB dataset](https://huggingface.co/datasets/imdb/tree/refs%2Fconvert%2Fparquet/plain_text/train) | [distilbert-base-uncased](https://huggingface.co/distilbert-base-uncased) |
+| [Food Image Search](./food101/) | [Hugginface Food101 dataset](https://huggingface.co/datasets/food101/tree/refs%2Fconvert%2Fparquet) | [clip-ViT-B-32](https://huggingface.co/sentence-transformers/clip-ViT-B-32) |
diff --git a/python/examples/food101/README.md b/python/examples/food101/README.md
@@ -0,0 +1,31 @@
+# Image search on Food101 dataset
+## Introduction
+Validate the inference task of a multimodal model using the Food 101 dataset with LakeSoul. Assumming our current working directory is `LakeSoul/python/examples/`.
+
+## Prepare dataset
+We can download data from [Hugginface Food101 dataset](https://huggingface.co/datasets/food101/tree/refs%2Fconvert%2Fparquet) into `imdb/dataset/` directory.
+
+
+## Import data into LakeSoul
+```shell
+export lakesoul_jar=lakesoul-spark-2.3.0-spark-3.3-SNAPSHOT.jar
+sudo docker run --rm -ti --net lakesoul-docker-compose-env_default \
+-v $PWD/"${lakesoul_jar}":/opt/spark/work-dir/jars/"${lakesoul_jar}" \
+-v $PWD/../../python/lakesoul/:/opt/bitnami/spark/lakesoul \
+-v $PWD/lakesoul.properties:/opt/spark/work-dir/lakesoul.properties \
+-v $PWD/food101:/opt/spark/work-dir/food101 \
+--env lakesoul_home=/opt/spark/work-dir/lakesoul.properties \
+bitnami/spark:3.3.1 spark-submit --jars /opt/spark/work-dir/jars/"${lakesoul_jar}" --driver-memory 16G --executor-memory 16G --master "local[4]" --conf spark.pyspark.python=./venv/bin/python3 /opt/spark/work-dir/food101/import_data.py
+```
+
+## Vectorizing pictures in LakeSoul
+```shell
+python food101/embedding.py clip_dataset > food101/embs.tsv
+```
+
+## Search for images
+Since it is a food dataset, you can try food-related keywords.
+
+```shell
+python food101/search.py food101/embs.tsv 5
+```
diff --git a/python/examples/food101/embedding.py b/python/examples/food101/embedding.py
@@ -0,0 +1,62 @@
+# SPDX-FileCopyrightText: 2023 LakeSoul Contributors
+#
+# SPDX-License-Identifier: Apache-2.0
+
+import sys
+import os
+import torch
+import datasets
+import lakesoul.huggingface
+import pandas as pd
+
+from io import BytesIO
+from tqdm import tqdm
+from PIL import Image
+from sentence_transformers import SentenceTransformer, util
+
+
+def batchify(dataset, batch_size):
+    batch = []
+    for i, item in enumerate(dataset):
+        record = {
+            "ids": i,
+            "image_bytes": item["image_bytes"],
+            "image_path": item["image_path"]
+        }
+        batch.append(record)
+
+        if len(batch) == batch_size:
+            yield batch
+            batch = []
+
+    # Handle the remaining records that don't fill up a full batch
+    if len(batch) > 0:
+        yield batch
+
+if __name__ == '__main__':
+    data_source = sys.argv[1]
+    device = 'cuda'
+    base_url = 'http://im-api.dmetasoul.com/food101'
+    img_model = SentenceTransformer('clip-ViT-B-32')
+    max_images = 10000
+    max_images = -1
+
+    img_id = 0
+    dataset = datasets.IterableDataset.from_lakesoul(data_source)
+    for batch in batchify(dataset, batch_size=4):
+        ids = list(range(img_id, img_id+len(batch)))
+        urls = [f"{base_url}/{row['image_path']}" for row in batch]
+        images = [Image.open(BytesIO(row['image_bytes'])).convert('RGB') for row in batch]
+        try:
+            embs = img_model.encode(images, device=device, 
+                convert_to_numpy=True, show_progress_bar=False, normalize_embeddings=True)
+            embs = embs.tolist()
+        except Exception as e:
+            continue
+
+        img_id += len(batch)
+        for _id, _url, _emb in zip(ids, urls, embs):
+            print(_id, _url, ' '.join(map(str, _emb)), sep='\t')
+
+        if max_images > 0 and img_id > max_images:
+            break
diff --git a/python/examples/food101/import_data.py b/python/examples/food101/import_data.py
@@ -0,0 +1,49 @@
+# SPDX-FileCopyrightText: 2023 LakeSoul Contributors
+#
+# SPDX-License-Identifier: Apache-2.0
+
+from pyspark.sql import SparkSession
+from pyspark.sql.functions import lit, col
+from lakesoul.spark import LakeSoulTable
+
+if __name__ == "__main__":
+    spark = SparkSession.builder \
+        .master("local[4]") \
+        .config("spark.driver.memoryOverhead", "1500m") \
+        .config("spark.sql.extensions", "com.dmetasoul.lakesoul.sql.LakeSoulSparkSessionExtension") \
+        .config("spark.sql.catalog.lakesoul", "org.apache.spark.sql.lakesoul.catalog.LakeSoulCatalog") \
+        .config("spark.sql.defaultCatalog", "lakesoul") \
+        .config("spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
+        .config("spark.hadoop.fs.s3a.buffer.dir", "/opt/spark/work-dir/s3a") \
+        .config("spark.hadoop.fs.s3a.path.style.access", "true") \
+        .config("spark.hadoop.fs.s3a.endpoint", "http://minio:9000") \
+        .config("spark.hadoop.fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider") \
+        .config("spark.hadoop.fs.s3a.access.key", "minioadmin1") \
+        .config("spark.hadoop.fs.s3a.secret.key", "minioadmin1") \
+        .getOrCreate()
+    spark.sparkContext.setLogLevel("ERROR")
+
+    dataset_table = "clip_dataset"
+    tablePath = "s3://lakesoul-test-bucket/clip_dataset"
+
+    print("Debug -- Show tables before importing data")
+    spark.sql("show tables").show()
+    spark.sql("drop table if exists clip_dataset")
+    dataset_table = "clip_dataset"
+
+    for i in range(8):
+        fileName = f"{i:04d}"
+        filePath = f"/opt/spark/work-dir/food101/dataset/{fileName}.parquet"
+        df = spark.read.format("parquet").load(filePath)
+        df.withColumn("range", lit(fileName))\
+            .write.mode("append").format("lakesoul")\
+            .option("rangePartitions", "range")\
+            .option("shortTableName", dataset_table)\
+            .save(tablePath)
+        print("write dataset partion:", fileName)
+
+    print("Debug -- Show tables after importing data")
+    spark.sql("show tables").show()
+    LakeSoulTable.forName(spark, dataset_table).toDF().show(20)
+
+    spark.stop()
diff --git a/python/examples/food101/search.py b/python/examples/food101/search.py
@@ -0,0 +1,60 @@
+# SPDX-FileCopyrightText: 2023 LakeSoul Contributors
+#
+# SPDX-License-Identifier: Apache-2.0
+
+import sys
+
+import numpy as np
+from sentence_transformers import SentenceTransformer, util
+
+
+def load_model():
+    return SentenceTransformer('clip-ViT-B-32-multilingual-v1')
+
+def get_image_embs(emb_file):
+    embs = []
+    imgs = {}
+    with open(emb_file, 'r', encoding='utf8') as f:
+        for line in f:
+            line = line.strip('\r\n')
+            if not line:
+                continue
+            row_id, img_url, img_emb = line.split('\t')[:3]
+            img_emb = list(map(float, img_emb.split(' ')))
+            img_id = len(embs)
+            imgs[img_id] = {'url':img_url}
+            embs.append(img_emb)
+    return imgs, np.array(embs, dtype=np.float32)
+
+def get_query_emb(query, model):
+    query_emb = model.encode([query], convert_to_numpy=True, show_progress_bar=False)
+    return query_emb
+
+def search(query_emb, img_embs, imgs, k=3):
+    hits = util.semantic_search(query_emb, img_embs, top_k=k)[0]
+    results = []
+    for hit in hits:
+        img_id = hit['corpus_id']
+        score = hit['score']
+        if img_id not in imgs:
+            continue
+        img_url = imgs[img_id]['url']
+        results.append({'score': score, 'image': img_url})
+    return results
+
+
+if __name__ == '__main__':
+    emb_file = sys.argv[1]
+    top_k = int(sys.argv[2])
+
+    model = load_model()
+    print("Model loaded successfully")
+    imgs, img_embs = get_image_embs(emb_file)
+    print("Vector database loaded successfully", img_embs.shape)
+
+    while True:
+        query = input("Please enter a keyword to search for images:").strip()
+        query_emb = get_query_emb(query, model)
+        results = search(query_emb, img_embs, imgs, k=top_k)
+        print(results)
+        print("*"*80)
diff --git a/python/examples/imdb/README.md b/python/examples/imdb/README.md
@@ -0,0 +1,34 @@
+# Text classification on IMDB dataset
+## Introduction
+Demonstrate the capability of fine-tuning a BERT model using the HuggingFace Trainer API on the IMDB dataset available on LakeSoul's data source. Assumming our current working directory is `LakeSoul/python/examples/`.
+
+## Prepare data
+We can download data from [Hugginface IMDB dataset](https://huggingface.co/datasets/imdb/tree/refs%2Fconvert%2Fparquet/plain_text/train) into `imdb/dataset/` directory.
+
+## Import data into LakeSoul
+```shell
+export lakesoul_jar=lakesoul-spark-2.3.0-spark-3.3-SNAPSHOT.jar
+sudo docker run --rm -ti --net lakesoul-docker-compose-env_default \
+-v $PWD/"${lakesoul_jar}":/opt/spark/work-dir/jars/"${lakesoul_jar}" \
+-v $PWD/../../python/lakesoul/:/opt/bitnami/spark/lakesoul \
+-v $PWD/lakesoul.properties:/opt/spark/work-dir/lakesoul.properties \
+-v $PWD/imdb:/opt/spark/work-dir/imdb \
+--env lakesoul_home=/opt/spark/work-dir/lakesoul.properties \
+bitnami/spark:3.3.1 spark-submit --jars /opt/spark/work-dir/jars/"${lakesoul_jar}" --driver-memory 16G --executor-memory 16G --master "local[4]" --conf spark.pyspark.python=./venv/bin/python3 /opt/spark/work-dir/imdb/import_data.py
+```
+
+## Train model using HuggingFace Trainer API
+```shell
+conda activate lakesoul_test
+python imdb/train.py 
+```
+
+## Inference the trained model
+```shell 
+python imdb/inference.py
+```
+
+##  Reference:
+1. https://huggingface.co/docs/transformers/tasks/sequence_classification
+2. https://huggingface.co/datasets/imdb
+