This document contains metadata information for reconstructing the dataset we used for training our models.
The metadata format is similar to NLLB bitext format with some small differences.
The metadata files are tab separated, gzip files. Each file corresponds to one alignment direction.
File naming convention:
- for text, we use 3 letters: e.g.
fra
,eng
,tur
- for audio, we use 2 letters and a 'A': e.g.
frA
,enA
,trA
For example, the direction eng-trA
corresponds to information for reconstructing English text with Turkish speech alignments. Similarly, enA-jpn
corresponds to "English speech with Japanese text", and enA-frA
corresponds to "English speech with French speech".
Each line has 11 columns.
For Audio, the columns correspond to:
- `cc_warc`: The warc file reference containing the public audio url
- `cc_sha`: not used
- `audio_speeh_segment_url`: space separated audio reference. See below.
- `cc_lineno`: not used
- `paragraph_digest`: expected duration of the whole audio file (without start/end frame trimming)
- `sentence_digest`: not used
- `text_lid_score`: not used
- `laser_score`: score of the alignment
- `direction`: direction, e.g. `enA-jpn`
- `side`: side, e.g. `enA` or `jpn`
- `line_no`: alignment number
audio_speech_segment_url
is a space separated audio reference. It has the following format:
<url> <start_frame> <end_frame>
, where start_frame
and end_frame
correspond to the segment that needs to be extracted from the audio file that is referenced at <url>
, resampled at 16000 Hz.
For text, the columns are similar to NLLB format (except being tab separated here):
-
If the metadata comes from Common Crawl:
cc_warc
: the reference to the Common Crawl WET filecc_sha
: the document sha1 in the WET filecc_document_url
: the url of the document referenced in the WET filecc_lineno
: the line number in the document referenced in the WET fileparagraph_digest
: xxhash.xxh3_64_intdigest of the paragraphsentence_digest
: xxhash.xxh3_64_intdigest of the sentencetext_lid_score
: language identification score, when availablelaser_score
: score of the alignmentdirection
: direction, e.g.enA-jpn
side
: side, e.g.enA
orjpn
line_no
: alignment number
-
If the metadata comes from other corpus:
corpus
: corpus namecc_sha
: not usedcc_document_url
: not usedlineno
: line number in the documentparagraph_digest
: xxhash.xxh3_64_intdigest of the paragraphsentence_digest
: xxhash.xxh3_64_intdigest of the sentencetext_lid_score
: language identification score, when availablelaser_score
: score of the alignmentdirection
: direction, e.g.enA-jpn
side
: side, e.g.enA
orjpn
line_no
: alignment number
Update: 30 Nov 2023
We are publishing an extension of the previous speech to speech release.
afA-enA amA-enA arA-enA asA-enA azA-enA beA-enA bgA-enA bnA-enA bsA-enA caA-enA csA-enA cyA-enA daA-enA deA-enA elA-enA enA-esA enA-etA enA-fiA enA-frA enA-gaA enA-glA enA-guA enA-heA enA-hiA enA-hrA enA-huA enA-hyA enA-idA enA-isA enA-itA enA-jaA enA-jvA enA-kaA enA-kiA enA-kkA enA-knA enA-koA enA-kyA enA-lgA enA-loA enA-ltA enA-lvA enA-mkA enA-mlA enA-mnA enA-mrA enA-msA enA-mtA enA-neA enA-nlA enA-noA enA-orA enA-paA enA-pbA enA-plA enA-psA enA-ptA enA-rnA enA-ruA enA-sdA enA-skA enA-slA enA-srA enA-svA enA-swA enA-taA enA-teA enA-tgA enA-thA enA-trA enA-ukA enA-urA enA-uzA enA-viA enA-yoA enA-zhA
Update: 25 Sep 2023
We are publishing updated metadata with the expected duration of the original audio file in the column paragraph_digest
(originally not used for audio).
arb-enA ben-enA cat-enA dan-enA enA-est enA-fin enA-jpn enA-mlt enA-nld enA-pol enA-por enA-ron enA-slk enA-swe enA-swh enA-tur enA-ukr enA-urd enA-vie arA-enA arA-eng beA-enA caA-enA caA-eng csA-enA csA-eng cyA-enA cyA-eng daA-enA daA-eng deA-enA deA-eng enA-esA enA-fiA enA-frA enA-hiA enA-idA enA-itA enA-knA enA-koA enA-mtA enA-nlA enA-plA enA-ptA enA-rnA enA-ruA enA-skA enA-svA enA-swA enA-taA enA-teA enA-tgA enA-thA enA-trA enA-ukA enA-urA enA-uzA enA-viA enA-zhA eng-esA eng-fiA eng-frA eng-hiA eng-idA eng-itA eng-knA eng-koA eng-mtA eng-nlA eng-plA eng-ptA eng-rnA eng-ruA eng-skA eng-swA eng-taA eng-teA eng-tgA eng-thA eng-trA eng-ukA eng-urA eng-uzA eng-viA eng-zhA
You can find the legacy metadata (without duration information) here:
arb-enA ben-enA cat-enA dan-enA enA-est enA-fin enA-jpn enA-mlt enA-nld enA-pol enA-por enA-ron enA-slk enA-swe enA-swh enA-tur enA-ukr enA-urd enA-vie arA-enA arA-eng beA-enA caA-enA caA-eng csA-enA csA-eng cyA-enA cyA-eng daA-enA daA-eng deA-enA deA-eng enA-esA enA-fiA enA-frA enA-hiA enA-idA enA-itA enA-knA enA-koA enA-mtA enA-nlA enA-plA enA-ptA enA-rnA enA-ruA enA-skA enA-svA enA-swA enA-taA enA-teA enA-tgA enA-thA enA-trA enA-ukA enA-urA enA-uzA enA-viA enA-zhA eng-esA eng-fiA eng-frA eng-hiA eng-idA eng-itA eng-knA eng-koA eng-mtA eng-nlA eng-plA eng-ptA eng-rnA eng-ruA eng-skA eng-swA eng-taA eng-teA eng-tgA eng-thA eng-trA eng-ukA eng-urA eng-uzA eng-viA eng-zhA
You can use the wet_lines
script to download and gather aligned text information from the metadata. This script can be found here.
zcat seamless.dataset.metadata.public.enA-swA.tsv.gz | egrep ^crawl-data | tr '\t' ' ' | wet_lines
Based on metadata information it receives from stdin, wet_lines will download the corpora, find the paragraph and print the input with an additional column which corresponds to the text of the paragraph.
In order to retrieve the sentences from these paragraphs, one can use the sentence splitter available here. It will print the input (metadata + paragraph) with an additional column which corresponds to the text of the sentence.
xzcat metadatafile.xz | egrep ^crawl-data | wet_lines | python -c "from sentence_cleaner_splitter.cleaner_splitter import *; split_clean()"