Skip to content

Commit

Permalink
[Python] Examples using Python API for AI model training (lakesoul-io…
Browse files Browse the repository at this point in the history
…#327)

* init python/examples

Signed-off-by: zenghua <[email protected]>

* Update README.md

Signed-off-by: zenghua <[email protected]>

* Update README.md

Signed-off-by: zenghua <[email protected]>

* update import titanic by pyspark

Signed-off-by: zenghua <[email protected]>

* add titianic, imdb, food101 examples.

* update examples README file.

* update requirements.txt

* update lakesoul jar version.

* update README for python env.

* update whl file instructions.

* Update README.md

* update README.md and requirements.txt.

* update README.md.

---------

Signed-off-by: zenghua <[email protected]>
Co-authored-by: zenghua <[email protected]>
Co-authored-by: Sun Kai <[email protected]>
  • Loading branch information
3 people authored Sep 15, 2023
1 parent ff85b49 commit a930c09
Show file tree
Hide file tree
Showing 22 changed files with 1,254 additions and 0 deletions.
54 changes: 54 additions & 0 deletions python/examples/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
# LakeSoul Python Examples

## Prerequisites

### Deploy Docker compose env

```bash
cd docker/lakesoul-docker-compose-env
docker compose up -d
```

### Pull spark image

```bash
docker pull bitnami/spark:3.3.1
```

### Download LakeSoul release jar
1. download [maven-package-upload.zip](https://github.com/lakesoul-io/LakeSoul/suites/16162659724/artifacts/922875223).
2. unzip the zip file and extract `lakesoul-spark-2.3.0-spark-3.3-SNAPSHOT.jar` from `maven-package-upload/lakesoul-spark/target/`.

### Download LakeSoul wheel file
For users of Python 3.8, Python 3.9, and Python 3.10, we have prepared different wheel files for each version. Please download the appropriate one based on your requirements.
* For Python 3.8 users: [lakesoul-1.0.0b0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl](https://dmetasoul-bucket.obs.cn-southwest-2.myhuaweicloud.com/releases/lakesoul/python/v1.0/lakesoul-1.0.0b0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl)
* For Python 3.9 users: [lakesoul-1.0.0b0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl](https://dmetasoul-bucket.obs.cn-southwest-2.myhuaweicloud.com/releases/lakesoul/python/v1.0/lakesoul-1.0.0b0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl)
* For Python 3.10 users: [lakesoul-1.0.0b0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl](https://dmetasoul-bucket.obs.cn-southwest-2.myhuaweicloud.com/releases/lakesoul/python/v1.0/lakesoul-1.0.0b0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl)

Assuming we are using Python 3.8, we can down load the wheel file as below
```bash
wget https://dmetasoul-bucket.obs.cn-southwest-2.myhuaweicloud.com/releases/lakesoul/python/v1.0/lakesoul-1.0.0b0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
```

### Install python virtual enviroment
```bash
conda create -n lakesoul_test python=3.8
conda acitvate lakesoul_test
# replace ${PWD} with your working directory
pip install -r requirements.txt
```

## Run Examples
Before running the examples, please export the LakeSoul environment variables by executing the command:

```bash
source lakesoul_env.sh
```

Afterwards, we can test the examples using the instructions below.

| Project | Dataset | Base Model |
|:-------------------------------------|:-------------------------------------|:------------------------------------------|
| [Titanic](./titanic/) | [Kaggle Titanic Dataset](https://www.kaggle.com/competitions/titanic) | `DNN` |
| [IMDB Sentiment Analysis](./imdb/) | [Hugginface IMDB dataset](https://huggingface.co/datasets/imdb/tree/refs%2Fconvert%2Fparquet/plain_text/train) | [distilbert-base-uncased](https://huggingface.co/distilbert-base-uncased) |
| [Food Image Search](./food101/) | [Hugginface Food101 dataset](https://huggingface.co/datasets/food101/tree/refs%2Fconvert%2Fparquet) | [clip-ViT-B-32](https://huggingface.co/sentence-transformers/clip-ViT-B-32) |
31 changes: 31 additions & 0 deletions python/examples/food101/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# Image search on Food101 dataset
## Introduction
Validate the inference task of a multimodal model using the Food 101 dataset with LakeSoul. Assumming our current working directory is `LakeSoul/python/examples/`.

## Prepare dataset
We can download data from [Hugginface Food101 dataset](https://huggingface.co/datasets/food101/tree/refs%2Fconvert%2Fparquet) into `imdb/dataset/` directory.


## Import data into LakeSoul
```shell
export lakesoul_jar=lakesoul-spark-2.3.0-spark-3.3-SNAPSHOT.jar
sudo docker run --rm -ti --net lakesoul-docker-compose-env_default \
-v $PWD/"${lakesoul_jar}":/opt/spark/work-dir/jars/"${lakesoul_jar}" \
-v $PWD/../../python/lakesoul/:/opt/bitnami/spark/lakesoul \
-v $PWD/lakesoul.properties:/opt/spark/work-dir/lakesoul.properties \
-v $PWD/food101:/opt/spark/work-dir/food101 \
--env lakesoul_home=/opt/spark/work-dir/lakesoul.properties \
bitnami/spark:3.3.1 spark-submit --jars /opt/spark/work-dir/jars/"${lakesoul_jar}" --driver-memory 16G --executor-memory 16G --master "local[4]" --conf spark.pyspark.python=./venv/bin/python3 /opt/spark/work-dir/food101/import_data.py
```

## Vectorizing pictures in LakeSoul
```shell
python food101/embedding.py clip_dataset > food101/embs.tsv
```

## Search for images
Since it is a food dataset, you can try food-related keywords.

```shell
python food101/search.py food101/embs.tsv 5
```
62 changes: 62 additions & 0 deletions python/examples/food101/embedding.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# SPDX-FileCopyrightText: 2023 LakeSoul Contributors
#
# SPDX-License-Identifier: Apache-2.0

import sys
import os
import torch
import datasets
import lakesoul.huggingface
import pandas as pd

from io import BytesIO
from tqdm import tqdm
from PIL import Image
from sentence_transformers import SentenceTransformer, util


def batchify(dataset, batch_size):
batch = []
for i, item in enumerate(dataset):
record = {
"ids": i,
"image_bytes": item["image_bytes"],
"image_path": item["image_path"]
}
batch.append(record)

if len(batch) == batch_size:
yield batch
batch = []

# Handle the remaining records that don't fill up a full batch
if len(batch) > 0:
yield batch

if __name__ == '__main__':
data_source = sys.argv[1]
device = 'cuda'
base_url = 'http://im-api.dmetasoul.com/food101'
img_model = SentenceTransformer('clip-ViT-B-32')
max_images = 10000
max_images = -1

img_id = 0
dataset = datasets.IterableDataset.from_lakesoul(data_source)
for batch in batchify(dataset, batch_size=4):
ids = list(range(img_id, img_id+len(batch)))
urls = [f"{base_url}/{row['image_path']}" for row in batch]
images = [Image.open(BytesIO(row['image_bytes'])).convert('RGB') for row in batch]
try:
embs = img_model.encode(images, device=device,
convert_to_numpy=True, show_progress_bar=False, normalize_embeddings=True)
embs = embs.tolist()
except Exception as e:
continue

img_id += len(batch)
for _id, _url, _emb in zip(ids, urls, embs):
print(_id, _url, ' '.join(map(str, _emb)), sep='\t')

if max_images > 0 and img_id > max_images:
break
49 changes: 49 additions & 0 deletions python/examples/food101/import_data.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# SPDX-FileCopyrightText: 2023 LakeSoul Contributors
#
# SPDX-License-Identifier: Apache-2.0

from pyspark.sql import SparkSession
from pyspark.sql.functions import lit, col
from lakesoul.spark import LakeSoulTable

if __name__ == "__main__":
spark = SparkSession.builder \
.master("local[4]") \
.config("spark.driver.memoryOverhead", "1500m") \
.config("spark.sql.extensions", "com.dmetasoul.lakesoul.sql.LakeSoulSparkSessionExtension") \
.config("spark.sql.catalog.lakesoul", "org.apache.spark.sql.lakesoul.catalog.LakeSoulCatalog") \
.config("spark.sql.defaultCatalog", "lakesoul") \
.config("spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
.config("spark.hadoop.fs.s3a.buffer.dir", "/opt/spark/work-dir/s3a") \
.config("spark.hadoop.fs.s3a.path.style.access", "true") \
.config("spark.hadoop.fs.s3a.endpoint", "http://minio:9000") \
.config("spark.hadoop.fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider") \
.config("spark.hadoop.fs.s3a.access.key", "minioadmin1") \
.config("spark.hadoop.fs.s3a.secret.key", "minioadmin1") \
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")

dataset_table = "clip_dataset"
tablePath = "s3://lakesoul-test-bucket/clip_dataset"

print("Debug -- Show tables before importing data")
spark.sql("show tables").show()
spark.sql("drop table if exists clip_dataset")
dataset_table = "clip_dataset"

for i in range(8):
fileName = f"{i:04d}"
filePath = f"/opt/spark/work-dir/food101/dataset/{fileName}.parquet"
df = spark.read.format("parquet").load(filePath)
df.withColumn("range", lit(fileName))\
.write.mode("append").format("lakesoul")\
.option("rangePartitions", "range")\
.option("shortTableName", dataset_table)\
.save(tablePath)
print("write dataset partion:", fileName)

print("Debug -- Show tables after importing data")
spark.sql("show tables").show()
LakeSoulTable.forName(spark, dataset_table).toDF().show(20)

spark.stop()
60 changes: 60 additions & 0 deletions python/examples/food101/search.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# SPDX-FileCopyrightText: 2023 LakeSoul Contributors
#
# SPDX-License-Identifier: Apache-2.0

import sys

import numpy as np
from sentence_transformers import SentenceTransformer, util


def load_model():
return SentenceTransformer('clip-ViT-B-32-multilingual-v1')

def get_image_embs(emb_file):
embs = []
imgs = {}
with open(emb_file, 'r', encoding='utf8') as f:
for line in f:
line = line.strip('\r\n')
if not line:
continue
row_id, img_url, img_emb = line.split('\t')[:3]
img_emb = list(map(float, img_emb.split(' ')))
img_id = len(embs)
imgs[img_id] = {'url':img_url}
embs.append(img_emb)
return imgs, np.array(embs, dtype=np.float32)

def get_query_emb(query, model):
query_emb = model.encode([query], convert_to_numpy=True, show_progress_bar=False)
return query_emb

def search(query_emb, img_embs, imgs, k=3):
hits = util.semantic_search(query_emb, img_embs, top_k=k)[0]
results = []
for hit in hits:
img_id = hit['corpus_id']
score = hit['score']
if img_id not in imgs:
continue
img_url = imgs[img_id]['url']
results.append({'score': score, 'image': img_url})
return results


if __name__ == '__main__':
emb_file = sys.argv[1]
top_k = int(sys.argv[2])

model = load_model()
print("Model loaded successfully")
imgs, img_embs = get_image_embs(emb_file)
print("Vector database loaded successfully", img_embs.shape)

while True:
query = input("Please enter a keyword to search for images:").strip()
query_emb = get_query_emb(query, model)
results = search(query_emb, img_embs, imgs, k=top_k)
print(results)
print("*"*80)
34 changes: 34 additions & 0 deletions python/examples/imdb/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# Text classification on IMDB dataset
## Introduction
Demonstrate the capability of fine-tuning a BERT model using the HuggingFace Trainer API on the IMDB dataset available on LakeSoul's data source. Assumming our current working directory is `LakeSoul/python/examples/`.

## Prepare data
We can download data from [Hugginface IMDB dataset](https://huggingface.co/datasets/imdb/tree/refs%2Fconvert%2Fparquet/plain_text/train) into `imdb/dataset/` directory.

## Import data into LakeSoul
```shell
export lakesoul_jar=lakesoul-spark-2.3.0-spark-3.3-SNAPSHOT.jar
sudo docker run --rm -ti --net lakesoul-docker-compose-env_default \
-v $PWD/"${lakesoul_jar}":/opt/spark/work-dir/jars/"${lakesoul_jar}" \
-v $PWD/../../python/lakesoul/:/opt/bitnami/spark/lakesoul \
-v $PWD/lakesoul.properties:/opt/spark/work-dir/lakesoul.properties \
-v $PWD/imdb:/opt/spark/work-dir/imdb \
--env lakesoul_home=/opt/spark/work-dir/lakesoul.properties \
bitnami/spark:3.3.1 spark-submit --jars /opt/spark/work-dir/jars/"${lakesoul_jar}" --driver-memory 16G --executor-memory 16G --master "local[4]" --conf spark.pyspark.python=./venv/bin/python3 /opt/spark/work-dir/imdb/import_data.py
```

## Train model using HuggingFace Trainer API
```shell
conda activate lakesoul_test
python imdb/train.py
```

## Inference the trained model
```shell
python imdb/inference.py
```

## Reference:
1. https://huggingface.co/docs/transformers/tasks/sequence_classification
2. https://huggingface.co/datasets/imdb

Loading

0 comments on commit a930c09

Please sign in to comment.