forked from lakesoul-io/LakeSoul
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[Python] Examples using Python API for AI model training (lakesoul-io…
…#327) * init python/examples Signed-off-by: zenghua <[email protected]> * Update README.md Signed-off-by: zenghua <[email protected]> * Update README.md Signed-off-by: zenghua <[email protected]> * update import titanic by pyspark Signed-off-by: zenghua <[email protected]> * add titianic, imdb, food101 examples. * update examples README file. * update requirements.txt * update lakesoul jar version. * update README for python env. * update whl file instructions. * Update README.md * update README.md and requirements.txt. * update README.md. --------- Signed-off-by: zenghua <[email protected]> Co-authored-by: zenghua <[email protected]> Co-authored-by: Sun Kai <[email protected]>
- Loading branch information
1 parent
ff85b49
commit a930c09
Showing
22 changed files
with
1,254 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,54 @@ | ||
# LakeSoul Python Examples | ||
|
||
## Prerequisites | ||
|
||
### Deploy Docker compose env | ||
|
||
```bash | ||
cd docker/lakesoul-docker-compose-env | ||
docker compose up -d | ||
``` | ||
|
||
### Pull spark image | ||
|
||
```bash | ||
docker pull bitnami/spark:3.3.1 | ||
``` | ||
|
||
### Download LakeSoul release jar | ||
1. download [maven-package-upload.zip](https://github.com/lakesoul-io/LakeSoul/suites/16162659724/artifacts/922875223). | ||
2. unzip the zip file and extract `lakesoul-spark-2.3.0-spark-3.3-SNAPSHOT.jar` from `maven-package-upload/lakesoul-spark/target/`. | ||
|
||
### Download LakeSoul wheel file | ||
For users of Python 3.8, Python 3.9, and Python 3.10, we have prepared different wheel files for each version. Please download the appropriate one based on your requirements. | ||
* For Python 3.8 users: [lakesoul-1.0.0b0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl](https://dmetasoul-bucket.obs.cn-southwest-2.myhuaweicloud.com/releases/lakesoul/python/v1.0/lakesoul-1.0.0b0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl) | ||
* For Python 3.9 users: [lakesoul-1.0.0b0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl](https://dmetasoul-bucket.obs.cn-southwest-2.myhuaweicloud.com/releases/lakesoul/python/v1.0/lakesoul-1.0.0b0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl) | ||
* For Python 3.10 users: [lakesoul-1.0.0b0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl](https://dmetasoul-bucket.obs.cn-southwest-2.myhuaweicloud.com/releases/lakesoul/python/v1.0/lakesoul-1.0.0b0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl) | ||
|
||
Assuming we are using Python 3.8, we can down load the wheel file as below | ||
```bash | ||
wget https://dmetasoul-bucket.obs.cn-southwest-2.myhuaweicloud.com/releases/lakesoul/python/v1.0/lakesoul-1.0.0b0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl | ||
``` | ||
|
||
### Install python virtual enviroment | ||
```bash | ||
conda create -n lakesoul_test python=3.8 | ||
conda acitvate lakesoul_test | ||
# replace ${PWD} with your working directory | ||
pip install -r requirements.txt | ||
``` | ||
|
||
## Run Examples | ||
Before running the examples, please export the LakeSoul environment variables by executing the command: | ||
|
||
```bash | ||
source lakesoul_env.sh | ||
``` | ||
|
||
Afterwards, we can test the examples using the instructions below. | ||
|
||
| Project | Dataset | Base Model | | ||
|:-------------------------------------|:-------------------------------------|:------------------------------------------| | ||
| [Titanic](./titanic/) | [Kaggle Titanic Dataset](https://www.kaggle.com/competitions/titanic) | `DNN` | | ||
| [IMDB Sentiment Analysis](./imdb/) | [Hugginface IMDB dataset](https://huggingface.co/datasets/imdb/tree/refs%2Fconvert%2Fparquet/plain_text/train) | [distilbert-base-uncased](https://huggingface.co/distilbert-base-uncased) | | ||
| [Food Image Search](./food101/) | [Hugginface Food101 dataset](https://huggingface.co/datasets/food101/tree/refs%2Fconvert%2Fparquet) | [clip-ViT-B-32](https://huggingface.co/sentence-transformers/clip-ViT-B-32) | |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
# Image search on Food101 dataset | ||
## Introduction | ||
Validate the inference task of a multimodal model using the Food 101 dataset with LakeSoul. Assumming our current working directory is `LakeSoul/python/examples/`. | ||
|
||
## Prepare dataset | ||
We can download data from [Hugginface Food101 dataset](https://huggingface.co/datasets/food101/tree/refs%2Fconvert%2Fparquet) into `imdb/dataset/` directory. | ||
|
||
|
||
## Import data into LakeSoul | ||
```shell | ||
export lakesoul_jar=lakesoul-spark-2.3.0-spark-3.3-SNAPSHOT.jar | ||
sudo docker run --rm -ti --net lakesoul-docker-compose-env_default \ | ||
-v $PWD/"${lakesoul_jar}":/opt/spark/work-dir/jars/"${lakesoul_jar}" \ | ||
-v $PWD/../../python/lakesoul/:/opt/bitnami/spark/lakesoul \ | ||
-v $PWD/lakesoul.properties:/opt/spark/work-dir/lakesoul.properties \ | ||
-v $PWD/food101:/opt/spark/work-dir/food101 \ | ||
--env lakesoul_home=/opt/spark/work-dir/lakesoul.properties \ | ||
bitnami/spark:3.3.1 spark-submit --jars /opt/spark/work-dir/jars/"${lakesoul_jar}" --driver-memory 16G --executor-memory 16G --master "local[4]" --conf spark.pyspark.python=./venv/bin/python3 /opt/spark/work-dir/food101/import_data.py | ||
``` | ||
|
||
## Vectorizing pictures in LakeSoul | ||
```shell | ||
python food101/embedding.py clip_dataset > food101/embs.tsv | ||
``` | ||
|
||
## Search for images | ||
Since it is a food dataset, you can try food-related keywords. | ||
|
||
```shell | ||
python food101/search.py food101/embs.tsv 5 | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,62 @@ | ||
# SPDX-FileCopyrightText: 2023 LakeSoul Contributors | ||
# | ||
# SPDX-License-Identifier: Apache-2.0 | ||
|
||
import sys | ||
import os | ||
import torch | ||
import datasets | ||
import lakesoul.huggingface | ||
import pandas as pd | ||
|
||
from io import BytesIO | ||
from tqdm import tqdm | ||
from PIL import Image | ||
from sentence_transformers import SentenceTransformer, util | ||
|
||
|
||
def batchify(dataset, batch_size): | ||
batch = [] | ||
for i, item in enumerate(dataset): | ||
record = { | ||
"ids": i, | ||
"image_bytes": item["image_bytes"], | ||
"image_path": item["image_path"] | ||
} | ||
batch.append(record) | ||
|
||
if len(batch) == batch_size: | ||
yield batch | ||
batch = [] | ||
|
||
# Handle the remaining records that don't fill up a full batch | ||
if len(batch) > 0: | ||
yield batch | ||
|
||
if __name__ == '__main__': | ||
data_source = sys.argv[1] | ||
device = 'cuda' | ||
base_url = 'http://im-api.dmetasoul.com/food101' | ||
img_model = SentenceTransformer('clip-ViT-B-32') | ||
max_images = 10000 | ||
max_images = -1 | ||
|
||
img_id = 0 | ||
dataset = datasets.IterableDataset.from_lakesoul(data_source) | ||
for batch in batchify(dataset, batch_size=4): | ||
ids = list(range(img_id, img_id+len(batch))) | ||
urls = [f"{base_url}/{row['image_path']}" for row in batch] | ||
images = [Image.open(BytesIO(row['image_bytes'])).convert('RGB') for row in batch] | ||
try: | ||
embs = img_model.encode(images, device=device, | ||
convert_to_numpy=True, show_progress_bar=False, normalize_embeddings=True) | ||
embs = embs.tolist() | ||
except Exception as e: | ||
continue | ||
|
||
img_id += len(batch) | ||
for _id, _url, _emb in zip(ids, urls, embs): | ||
print(_id, _url, ' '.join(map(str, _emb)), sep='\t') | ||
|
||
if max_images > 0 and img_id > max_images: | ||
break |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,49 @@ | ||
# SPDX-FileCopyrightText: 2023 LakeSoul Contributors | ||
# | ||
# SPDX-License-Identifier: Apache-2.0 | ||
|
||
from pyspark.sql import SparkSession | ||
from pyspark.sql.functions import lit, col | ||
from lakesoul.spark import LakeSoulTable | ||
|
||
if __name__ == "__main__": | ||
spark = SparkSession.builder \ | ||
.master("local[4]") \ | ||
.config("spark.driver.memoryOverhead", "1500m") \ | ||
.config("spark.sql.extensions", "com.dmetasoul.lakesoul.sql.LakeSoulSparkSessionExtension") \ | ||
.config("spark.sql.catalog.lakesoul", "org.apache.spark.sql.lakesoul.catalog.LakeSoulCatalog") \ | ||
.config("spark.sql.defaultCatalog", "lakesoul") \ | ||
.config("spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \ | ||
.config("spark.hadoop.fs.s3a.buffer.dir", "/opt/spark/work-dir/s3a") \ | ||
.config("spark.hadoop.fs.s3a.path.style.access", "true") \ | ||
.config("spark.hadoop.fs.s3a.endpoint", "http://minio:9000") \ | ||
.config("spark.hadoop.fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider") \ | ||
.config("spark.hadoop.fs.s3a.access.key", "minioadmin1") \ | ||
.config("spark.hadoop.fs.s3a.secret.key", "minioadmin1") \ | ||
.getOrCreate() | ||
spark.sparkContext.setLogLevel("ERROR") | ||
|
||
dataset_table = "clip_dataset" | ||
tablePath = "s3://lakesoul-test-bucket/clip_dataset" | ||
|
||
print("Debug -- Show tables before importing data") | ||
spark.sql("show tables").show() | ||
spark.sql("drop table if exists clip_dataset") | ||
dataset_table = "clip_dataset" | ||
|
||
for i in range(8): | ||
fileName = f"{i:04d}" | ||
filePath = f"/opt/spark/work-dir/food101/dataset/{fileName}.parquet" | ||
df = spark.read.format("parquet").load(filePath) | ||
df.withColumn("range", lit(fileName))\ | ||
.write.mode("append").format("lakesoul")\ | ||
.option("rangePartitions", "range")\ | ||
.option("shortTableName", dataset_table)\ | ||
.save(tablePath) | ||
print("write dataset partion:", fileName) | ||
|
||
print("Debug -- Show tables after importing data") | ||
spark.sql("show tables").show() | ||
LakeSoulTable.forName(spark, dataset_table).toDF().show(20) | ||
|
||
spark.stop() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,60 @@ | ||
# SPDX-FileCopyrightText: 2023 LakeSoul Contributors | ||
# | ||
# SPDX-License-Identifier: Apache-2.0 | ||
|
||
import sys | ||
|
||
import numpy as np | ||
from sentence_transformers import SentenceTransformer, util | ||
|
||
|
||
def load_model(): | ||
return SentenceTransformer('clip-ViT-B-32-multilingual-v1') | ||
|
||
def get_image_embs(emb_file): | ||
embs = [] | ||
imgs = {} | ||
with open(emb_file, 'r', encoding='utf8') as f: | ||
for line in f: | ||
line = line.strip('\r\n') | ||
if not line: | ||
continue | ||
row_id, img_url, img_emb = line.split('\t')[:3] | ||
img_emb = list(map(float, img_emb.split(' '))) | ||
img_id = len(embs) | ||
imgs[img_id] = {'url':img_url} | ||
embs.append(img_emb) | ||
return imgs, np.array(embs, dtype=np.float32) | ||
|
||
def get_query_emb(query, model): | ||
query_emb = model.encode([query], convert_to_numpy=True, show_progress_bar=False) | ||
return query_emb | ||
|
||
def search(query_emb, img_embs, imgs, k=3): | ||
hits = util.semantic_search(query_emb, img_embs, top_k=k)[0] | ||
results = [] | ||
for hit in hits: | ||
img_id = hit['corpus_id'] | ||
score = hit['score'] | ||
if img_id not in imgs: | ||
continue | ||
img_url = imgs[img_id]['url'] | ||
results.append({'score': score, 'image': img_url}) | ||
return results | ||
|
||
|
||
if __name__ == '__main__': | ||
emb_file = sys.argv[1] | ||
top_k = int(sys.argv[2]) | ||
|
||
model = load_model() | ||
print("Model loaded successfully") | ||
imgs, img_embs = get_image_embs(emb_file) | ||
print("Vector database loaded successfully", img_embs.shape) | ||
|
||
while True: | ||
query = input("Please enter a keyword to search for images:").strip() | ||
query_emb = get_query_emb(query, model) | ||
results = search(query_emb, img_embs, imgs, k=top_k) | ||
print(results) | ||
print("*"*80) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,34 @@ | ||
# Text classification on IMDB dataset | ||
## Introduction | ||
Demonstrate the capability of fine-tuning a BERT model using the HuggingFace Trainer API on the IMDB dataset available on LakeSoul's data source. Assumming our current working directory is `LakeSoul/python/examples/`. | ||
|
||
## Prepare data | ||
We can download data from [Hugginface IMDB dataset](https://huggingface.co/datasets/imdb/tree/refs%2Fconvert%2Fparquet/plain_text/train) into `imdb/dataset/` directory. | ||
|
||
## Import data into LakeSoul | ||
```shell | ||
export lakesoul_jar=lakesoul-spark-2.3.0-spark-3.3-SNAPSHOT.jar | ||
sudo docker run --rm -ti --net lakesoul-docker-compose-env_default \ | ||
-v $PWD/"${lakesoul_jar}":/opt/spark/work-dir/jars/"${lakesoul_jar}" \ | ||
-v $PWD/../../python/lakesoul/:/opt/bitnami/spark/lakesoul \ | ||
-v $PWD/lakesoul.properties:/opt/spark/work-dir/lakesoul.properties \ | ||
-v $PWD/imdb:/opt/spark/work-dir/imdb \ | ||
--env lakesoul_home=/opt/spark/work-dir/lakesoul.properties \ | ||
bitnami/spark:3.3.1 spark-submit --jars /opt/spark/work-dir/jars/"${lakesoul_jar}" --driver-memory 16G --executor-memory 16G --master "local[4]" --conf spark.pyspark.python=./venv/bin/python3 /opt/spark/work-dir/imdb/import_data.py | ||
``` | ||
|
||
## Train model using HuggingFace Trainer API | ||
```shell | ||
conda activate lakesoul_test | ||
python imdb/train.py | ||
``` | ||
|
||
## Inference the trained model | ||
```shell | ||
python imdb/inference.py | ||
``` | ||
|
||
## Reference: | ||
1. https://huggingface.co/docs/transformers/tasks/sequence_classification | ||
2. https://huggingface.co/datasets/imdb | ||
|
Oops, something went wrong.