This document contains metadata information for reconstructing the dataset we used for training our models.
The metadata format is similar to NLLB bitext format with some small differences.
The metadata files are tab separated, gzip files. Each file corresponds to one alignment direction.
File naming convention:
- for text, we use 3 letters: e.g.
fra
,eng
,tur
- for audio, we use 2 letters and a 'A': e.g.
frA
,enA
,trA
For example, the direction eng-trA
corresponds to information for reconstructing English text with Turkish speech alignments. Similarly, enA-jpn
corresponds to "English speech with Japanese text", and enA-frA
corresponds to "English speech with French speech".
Each line has 11 columns.
For Audio, the columns correspond to:
- `cc_warc`: The warc file reference containing the public audio url
- `cc_sha`: not used
- `audio_speeh_segment_url`: space separated audio reference. See below.
- `cc_lineno`: not used
- `paragraph_digest`: not used
- `sentence_digest`: not used
- `text_lid_score`: not used
- `laser_score`: score of the alignment
- `direction`: direction, e.g. `enA-jpn`
- `side`: side, e.g. `enA` or `jpn`
- `line_no`: alignment number
audio_speeh_segment_url
is a space separated audio reference. It has the following format:
<url> <start_frame> <end_frame>
, where start_frame
and end_frame
correspond to the segment that needs to be extracted from the audio file that is referenced at <url>
, resampled at 16000 Hz.
For text, the columns are similar to NLLB format (except being tab separated here):
-
If the metadata comes from Common Crawl:
cc_warc
: the reference to the Common Crawl WET filecc_sha
: the document sha1 in the WET filecc_document_url
: the url of the document referenced in the WET filecc_lineno
: the line number in the document referenced in the WET fileparagraph_digest
: xxhash.xxh3_64_intdigest of the paragraphsentence_digest
: xxhash.xxh3_64_intdigest of the sentencetext_lid_score
: language identification score, when availablelaser_score
: score of the alignmentdirection
: direction, e.g.enA-jpn
side
: side, e.g.enA
orjpn
line_no
: alignment number
-
If the metadata comes from other corpus:
corpus
: corpus namecc_sha
: not usedcc_document_url
: not usedlineno
: line number in the documentparagraph_digest
: xxhash.xxh3_64_intdigest of the paragraphsentence_digest
: xxhash.xxh3_64_intdigest of the sentencetext_lid_score
: language identification score, when availablelaser_score
: score of the alignmentdirection
: direction, e.g.enA-jpn
side
: side, e.g.enA
orjpn
line_no
: alignment number
arb-enA ben-enA cat-enA dan-enA enA-est enA-fin enA-jpn enA-mlt enA-nld enA-pol enA-por enA-ron enA-slk enA-swe enA-swh enA-tur enA-ukr enA-urd enA-vie arA-enA arA-eng beA-enA caA-enA caA-eng csA-enA csA-eng cyA-enA cyA-eng daA-enA daA-eng deA-enA deA-eng enA-esA enA-fiA enA-frA enA-hiA enA-idA enA-itA enA-knA enA-koA enA-mtA enA-nlA enA-plA enA-ptA enA-rnA enA-ruA enA-skA enA-svA enA-swA enA-taA enA-teA enA-tgA enA-thA enA-trA enA-ukA enA-urA enA-uzA enA-viA enA-zhA eng-esA eng-fiA eng-frA eng-hiA eng-idA eng-itA eng-knA eng-koA eng-mtA eng-nlA eng-plA eng-ptA eng-rnA eng-ruA eng-skA eng-swA eng-taA eng-teA eng-tgA eng-thA eng-trA eng-ukA eng-urA eng-uzA eng-viA eng-zhA
You can use the wet_lines
script to download and gather aligned text information from the metadata. This script can be found here.
zcat seamless.dataset.metadata.public.enA-swA.tsv.gz | egrep ^crawl-data | tr '\t' ' ' | wet_lines
Based on metadata information it receives from stdin, wet_lines will download the corpora, find the paragraph and print the input with an additional column which corresponds to the text of the paragraph.
In order to retrieve the sentences from these paragraphs, one can use the sentence splitter available here. It will print the input (metadata + paragraph) with an additional column which corresponds to the text of the sentence.
xzcat metadatafile.xz | egrep ^crawl-data | wet_lines | python -c "from sentence_cleaner_splitter.cleaner_splitter import *; split_clean()"