ViAVSP-LLM (Vietnamese Audio-Visual Speech Processing incorporated with LLM)

This is the PyTorch code for Vietnamese Automatic Speech Recognition Utilizing Auditory and Visual Data. This code is developed on the code of VSP-LLM.

Introduction

We propose ViAVSP-LLM—a novel framework that harnesses the powerful context modeling capabilities of large language models (LLMs) to advance Vietnamese audio-visual speech processing. By employing a self-supervised visual speech model, our approach maps input video directly into the latent space of an LLM, enabling a seamless integration of visual and linguistic data. To address the issue of redundant information in input frames, we introduce a deduplication technique that effectively reduces the embedded audio-visual features. Coupled with Low-Rank Adaptation (LoRA), this method allows ViAVSP-LLM to be trained in a computationally efficient manner, optimizing both performance and resource utilization.

Results

Model	VASR Test WER (%)	VASR Test CER (%)	Config	Checkpoint
ViAVSP-LLM (base)	17.28	10.56	config	huggingface
ViAVSP-LLM (final)	12.03	7.2	config	huggingface

Demo

Try our ViAVSP-LLM demo on HuggingFace.

Installation

Create an environment with python==3.9.19

conda create -n vasr python=3.9.19 -y
conda activate vasr

Install torch, torchvision and torchaudio

pip install torch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 --index-url https://download.pytorch.org/whl/cu118

Install fairseq
```
cd src/libs
pip install -e fairseq
```

Install other requirements

cd ../..
pip install -r requirements.txt

Data

Downloading

The VASR dataset is available here for research use under strict ethical guidelines. To ensure the protection of speaker anonymity, prospective users are required to provide their contact information and formally agree to the stipulated terms and conditions before gaining access to the dataset. This process is not only a formality but a crucial step in upholding privacy standards and promoting the responsible and ethical use of sensitive data.

Dataset layout

└── data
    |
    ├── raw
    |   |
    |   └── vasr
    |       |
    |       ├── audio
    |       |   |
    |       |   ├── 000
    |       |   |   |
    |       |   |   ├── 0000016.wav
    |       |   |   ├── ...
    |       |   |   └── <example_id>.wav
    |       |   |
    |       |   ├── ...
    |       |   └── <shard_id>
    |       |
    |       ├── visual
    |       |   |
    |       |   ├── 000
    |       |   |   |
    |       |   |   ├── 0000016.mp4
    |       |   |   ├── ...
    |       |   |   └── <example_id>.mp4
    |       |   |
    |       |   ├── ...
    |       |   └── <shard_id>
    |       |
    |       └── metadata.parquet
    |
    └── processed
        |
        └── vasr
            |
            ├── train.tsv               # List of audio and video path for training
            ├── train.wrd               # List of target label for training
            ├── train.cluster_counts    # List of clusters to deduplication in training
            ├── valid.tsv               # List of audio and video path for validation
            ├── valid.wrd               # List of target label for validation
            ├── valid.cluster_counts    # List of clusters to deduplication in validation
            ├── test.tsv                # List of audio and video path for testing
            ├── test.wrd                # List of target label for testing
            └── test.cluster_counts     # List of clusters to deduplication in testing

Preprocessing

1. Create manifest for splits.

Run the following command to create manifest for training, validation and test splits.

python src/process_data.py \
    --process create_manifest \
    --data_dir data/raw/vasr \
    --split <split> \
    --frac <frac-of-split> \
    --output_dir data/processed/vasr \

split: Split name to extract (train, valid, test).
frac: Percent of the split. Enter 1 for entire split.

2. Extract audio-visual features using AV-HuBERT.

Run the following command to extract audio-visual features.

python src/process_data.py \
    --process dump_feature \
    --tsv_dir path/to/manifest/file.tsv \
    --split <split> \
    --nshard <num_shards> \
    --rank <rank> \
    --feat_dir path/to/output/feature/directory \
    --ckpt_path path/to/AV-Hubert/large_vox_iter5.pt \
    --layer 12 \

split: Split name to process (train, valid, test).
nshard: Number of shards.
rank: Which shard to process (from 0 to nshard)

3. Train K-Means model.

Run the following command to train K-Means model.

python src/process_data.py \
    --process learn_kmeans \
    --feat_dir path/to/feature/directory \
    --split <split> \
    --nshard <nshard> \
    --km_path path/to/output/km_model.km \
    --n_clusters <n_clusters> \
    --percent <percent>

split: Split name to process (train, valid, test).
nshard: Number of shards. Must be consistent with one at Step 2.
n_clusters: Number of clusters inputted to K-Means.
percent: Percent of the split to train K-Means.

4. Get pseudo labels from K-Means model.

Run the following command to obtain pseudo labels from K-Means model.

python src/process_data.py \
    --process dump_label \
    --feat_dir path/to/feature/directory \
    --split <split> \
    --km_path path/to/output/km_model.km \
    --nshard <nshard> \
    --rank <rank> \
    --lab_dir path/to/output/labels/directory

split: Split name to process (train, valid, test).
nshard: Number of shards. Must be consistent with one at Step 2 and 3.
rank: Which shard to process (from 0 to nshard).

5. Count similar frames.

Run the following command to count similar consecutive frames.

python src/process_data.py \
    --process count_clusters \
    --split <split> \
    --nshard <nshard> \
    --lab_dir path/to/output/labels/directory \
    --output_dir path/to/output/directory

split: Split name to process (train, valid, test).
nshard: Number of shards. Must be consistent with one at Step 2, 3 and 4.

Pretrained Backbones

Use these pretrained backbones for training.

Backbone	Checkpoint
AV-HuBERT Large (LSR3 + VoxCeleb2)	link
VinaLLaMA	link

Training

Open the training script (scripts/train.sh) and replace these variables:

# Experiment's name.
EXP=???

# Path to training dataset directory.
DATA_DIR=???

# Path to where experiments will be located.
EXP_DIR=???

# Path to downloaded pre-trained AV-HuBERT.
PRETRAINED_MODEL_PATH=???

# HuggingFace LLaMA repo ID or path to LLaMA checkpoint.
LLM_PATH=???

Run the training script:

$ bash scripts/train.sh

Decoding

Open the decoding script (scripts/decode.sh) and replace these variables:

# Experiment's name.
EXP=???

# Evaluation set to be used.
EVAL_SET=???

# Path to evaluation dataset directory.
DATA_DIR=???

# Path to where experiments will be located.
EXP_DIR=???

# Path to the trained model.
MODEL_PATH=???

# HuggingFace LLaMA repo ID or path to LLaMA checkpoint.
LLM_PATH=???

Run the decoding script:

$ bash scripts/decode.sh

Name		Name	Last commit message	Last commit date
Latest commit History 261 Commits
data		data
docs		docs
fonts		fonts
models		models
notebooks		notebooks
references		references
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py
test_environment.py		test_environment.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ViAVSP-LLM (Vietnamese Audio-Visual Speech Processing incorporated with LLM)

Introduction

Results

Demo

Installation

Data

Downloading

Dataset layout

Preprocessing

1. Create manifest for splits.

2. Extract audio-visual features using AV-HuBERT.

3. Train K-Means model.

4. Get pseudo labels from K-Means model.

5. Count similar frames.

Pretrained Backbones

Training

Decoding

About

Releases

Packages

Contributors 3

Languages

License

tanthinhdt/vietnamese-av-asr

Folders and files

Latest commit

History

Repository files navigation

ViAVSP-LLM (Vietnamese Audio-Visual Speech Processing incorporated with LLM)

Introduction

Results

Demo

Installation

Data

Downloading

Dataset layout

Preprocessing

1. Create manifest for splits.

2. Extract audio-visual features using AV-HuBERT.

3. Train K-Means model.

4. Get pseudo labels from K-Means model.

5. Count similar frames.

Pretrained Backbones

Training

Decoding

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages