You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Since 2018, the model for Na includes tone-group boundaries. But up till now (Oct. 2018), the model for Na still disregards word boundaries. A look at story-fold cross-validation materials suggests that longer words have somewhat different acoustic properties. So there could be value for phoneme & tone recognition in adding word boundaries to the training.
A first step (suggested by @oadams ) could be to produce separate error rates for short words versus longer words by using the word segmentation in the reference transcription as a guide.
(Suggested label for this Issue: Yongning Na)
The text was updated successfully, but these errors were encountered:
This relates to #214, in that the word boundary in the training corpus is a space.
"it's important that if users want to explictly predict spaces (in character prediction), then that is accounted for. Probably best with a flag to segment_into_chars() or something similar, which would generate special tokens that represent spaces, such as underscores, for training and decoding. These then would get removed as a postprocessing step."
Since 2018, the model for Na includes tone-group boundaries. But up till now (Oct. 2018), the model for Na still disregards word boundaries. A look at story-fold cross-validation materials suggests that longer words have somewhat different acoustic properties. So there could be value for phoneme & tone recognition in adding word boundaries to the training.
A first step (suggested by @oadams ) could be to produce separate error rates for short words versus longer words by using the word segmentation in the reference transcription as a guide.
(Suggested label for this Issue: Yongning Na)
The text was updated successfully, but these errors were encountered: