data_prep_scripts

History

Name	Name	Last commit message	Last commit date
parent directory ..
README.MD	README.MD	updated structure	Aug 24, 2024
book_indicvoices_length.py	book_indicvoices_length.py	updated structure	Aug 24, 2024
create_indicvoices.py.old	create_indicvoices.py.old	updated structure	Aug 24, 2024
rough.ipynb	rough.ipynb	updated structure	Aug 24, 2024
test_set_analysis.ipynb	test_set_analysis.ipynb	updated structure	Aug 24, 2024
utils_dataset_clean.py	utils_dataset_clean.py	Update utils_dataset_clean.py	Aug 25, 2024
utils_dataset_create_internal_manifest.sh	utils_dataset_create_internal_manifest.sh	updated structure	Aug 24, 2024
utils_dataset_create_iv.py	utils_dataset_create_iv.py	updated structure	Aug 24, 2024
utils_dataset_create_iv_flac.py	utils_dataset_create_iv_flac.py	Update utils_dataset_create_iv_flac.py	Dec 14, 2024
utils_dataset_create_manifest.sh	utils_dataset_create_manifest.sh	updated structure	Aug 24, 2024
utils_dataset_create_manifest_test.sh	utils_dataset_create_manifest_test.sh	updated structure	Aug 24, 2024
utils_dataset_create_verbatim_manifest.py	utils_dataset_create_verbatim_manifest.py	updated structure	Aug 24, 2024
utils_dataset_custom_transforms.py	utils_dataset_custom_transforms.py	updated structure	Aug 24, 2024
utils_dataset_process.sh	utils_dataset_process.sh	updated structure	Aug 24, 2024
utils_tokenizer_create.sh	utils_tokenizer_create.sh	updated structure	Aug 24, 2024
utils_tokenizer_download.sh	utils_tokenizer_download.sh	updated structure	Aug 24, 2024
utils_tokenizer_refine_doc.py	utils_tokenizer_refine_doc.py	updated structure	Aug 24, 2024

README.MD

Extract the tar files so that TGZ folder and language specific folders are on similar level
Run the following command to downsample the audios to 16kHz

find . -type f $ -name "*.wav" $ -print0 | xargs -0 -I {} -P 128 bash -c 'ffmpeg -y -loglevel warning -hide_banner -stats -i $1 -ar $2 -ac $3 "${1%.*}_${2}.wav" && rm $1 && mv "${1%.*}_${2}.wav" $1' -- {} 16000 1

You can alternatively run the following command if you want to downsample only files inside v3 directory

find . -type f $ -wholename "*/v3/*.wav" $ -print0 | xargs -0 -I {} -P 128 bash -c 'ffmpeg -y -loglevel warning -hide_banner -stats -i $1 -ar $2 -ac $3 "${1%.*}_${2}.wav" && rm $1 && mv "${1%.*}_${2}.wav" $1' -- {} 16000 1
Run create_indicvoices.py to build a chunked version of the IndicVoices. Please make sure to change the input and output paths in the script.
Run create_manifest.sh to create manifest files from the processed dataset. Please make sure to change the source and destination paths in the script.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

data_prep_scripts

data_prep_scripts

README.MD

Files

data_prep_scripts

Directory actions

More options

Directory actions

More options

Latest commit

History

data_prep_scripts

Folders and files

parent directory

README.MD