-
Notifications
You must be signed in to change notification settings - Fork 62
[WIP] additionnal lib/cleanup for French language to improve quality of inputs #635
Conversation
...some of my first commits, and my first steps in JS. reviews are welcome. (...please neither shoot or shout at me ;-D ) Co-Authored-By: Nicolas Panel <[email protected]>
roman + century numerals, roman numerals, full convertion of all ACRONYMES
eleventh, ... first, second,... 1/2 => 1 sur 2
server/lib/cleanup/languages/fr.js
Outdated
.replace(/(^|\s)Ie(r)? s.(\s|\.|,|\?|!|$)/g, ' premier siècle ') | ||
.replace(/(^|\s)II(e|è)(me)? s.(\s|\.|,|\?|!|$)/g, ' deuxième siècle ') | ||
.replace(/(^|\s)III(e|è)(me)? s.(\s|\.|,|\?|!|$)/g, ' troisième siècle ') | ||
.replace(/(^|\s)IV(e|è)(me)? s.(\s|\.|,|\?|!|$)/g, ' quatrième siècle ') | ||
.replace(/(^|\s)V(e|è)(me)? s.(\s|\.|,|\?|!|$)/g, ' cinquième siècle ') | ||
.replace(/(^|\s)VI(e|è)(me)? s.(\s|\.|,|\?|!|$)/g, ' sixième siècle ') | ||
.replace(/(^|\s)VII(e|è)(me)? s.(\s|\.|,|\?|!|$)/g, ' septième siècle ') | ||
.replace(/(^|\s)VIII(e|è)(me)? s.(\s|\.|,|\?|!|$)/g, ' huitième siècle ') | ||
.replace(/(^|\s)(VIIII|IX)(e|è)(me)? s.(\s|\.|,|\?|!|$)/g, ' neuvième siècle ') | ||
.replace(/(^|\s)X(e|è)(me)? s.(\s|\.|,|\?|!|$)/g, ' dixième siècle ') | ||
.replace(/(^|\s)XI(e|è)(me)? s.(\s|\.|,|\?|!|$)/g, ' onzième siècle ') | ||
.replace(/(^|\s)XII(e|è)(me)? s.(\s|\.|,|\?|!|$)/g, ' douxième siècle ') | ||
.replace(/(^|\s)XIII(e|è)(me)? s.(\s|\.|,|\?|!|$)/g, ' treizième siècle ') | ||
.replace(/(^|\s)(XIIII|XIV)(e|è)(me)? s.(\s|\.|,|\?|!|$)/g, ' quatorzième siècle ') | ||
.replace(/(^|\s)XV(e|è)(me)? s.(\s|\.|,|\?|!|$)/g, ' quinzième siècle ') | ||
.replace(/(^|\s)XVI(e|è)(me)? s.(\s|\.|,|\?|!|$)/g, ' seixième siècle ') | ||
.replace(/(^|\s)XVII(e|è)(me)? s.(\s|\.|,|\?|!|$)/g, ' dix-septième siècle ') | ||
.replace(/(^|\s)XVIII(e|è)(me)? s.(\s|\.|,|\?|!|$)/g, ' dix-huitième siècle ') | ||
.replace(/(^|\s)(XIX|XVIIII)(e|è)(me)? s.(\s|\.|,|\?|!|$)/g, ' dix_neuvième siècle ') | ||
.replace(/(^|\s)XX(e|è)(me)? s.(\s|\.|,|\?|!|$)/g, ' vingtième siècle ') | ||
.replace(/(^|\s)XXI(e|è)(me)? s.(\s|\.|,|\?|!|$)/g, ' vingt-et-unième siècle ') | ||
.replace(/(^|\s)XXII(e|è)(me)? s.(\s|\.|,|\?|!|$)/g, ' vingt-deuxième siècle ') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could use a dedicated function like
https://stackoverflow.com/a/9083076 in combination with passing a callback to replace()
(https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/replace)
- You could drop the test for the
s.
suffix. - It would be reusable for the non-ordinal case you handled below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well,
First,
I have my regex to work with :
-- kudos to gregory-r in this post in stackoverflow, licence CC BY-SA
function deromanize(roman) {
var r = 0;
// regular expressions to check if valid Roman Number.
if (!/^M*(?:D?C{0,3}|C[MD])(?:L?X{0,3}|X[CL])(?:V?I{0,3}|I[XV])$/.test(roman))
throw new Error('Invalid Roman Numeral.');
roman.replace(/[MDLV]|C[MD]?|X[CL]?|I[XV]?/g, function(i) {
r += {M:1000, CM:900, D:500, CD:400, C:100, XC:90, L:50, XL:40, X:10, IX:9, V:5, IV:4, I:1}[i];
});
return r;
}
So, I'll put it here to not forget it, because :
Second,
I don't have a clue of how to implement a function in JS ! (I know Pascal and VB ✌️ , but never learned JS 😭... I have to jump in the rabbit hole 🕳️ 🐇 (it is this way) )
Edit: My feeling about that ? "Why JS? because!" xkcd explined it well.
...To however follow, either the white rabbit, or crumbs, including future me, well, this link
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/replace#specifying_a_function_as_the_replacement
could be a good start to look at.
/me : jumps in the 🕳️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
before you jump into that rabbit hole, you might want to check out my other comment regarding the roman numerals, as I don't think we need this at all. I'll also have a look at your other issue and add the documentation to the README for better understanding
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, I don't understand that, collecting sentences, we can NOT "correct and clean" before "validation".
no problem, I'll wait for the issue #636 😸
server/lib/cleanup/languages/fr.js
Outdated
//first, second, etc. | ||
.replace(/(^|\s)1er?s?(\s|\.|,|\?|!|$)/g, ' premier ') | ||
.replace(/(^|\s)1(e|è)res?(\s|\.|,|\?|!|$)/g, ' premier ') | ||
.replace(/(^|\s)2(e|è)?me?s?(\s|\.|,|\?|!|$)/g, ' deuxième ') | ||
.replace(/(^|\s)2n?ds?(\s|\.|,|\?|!|$)/g, ' second ') | ||
.replace(/(^|\s)2n?des?(\s|\.|,|\?|!|$)/g, ' seconde ') | ||
.replace(/(^|\s)3i?(e|è)me?s?(\s|\.|,|\?|!|$)/g, ' troisième ') | ||
.replace(/(^|\s)4i?(e|è)me?s?(\s|\.|,|\?|!|$)/g, ' quatrième ') | ||
.replace(/(^|\s)5i?(e|è)me?s?(\s|\.|,|\?|!|$)/g, ' cinquième ') | ||
.replace(/(^|\s)6i?(e|è)me?s?(\s|\.|,|\?|!|$)/g, ' sixième ') | ||
.replace(/(^|\s)7i?(e|è)me?s?(\s|\.|,|\?|!|$)/g, ' septième ') | ||
.replace(/(^|\s)8i?(e|è)me?s?(\s|\.|,|\?|!|$)/g, ' huitième ') | ||
.replace(/(^|\s)9i?(e|è)me?s?(\s|\.|,|\?|!|$)/g, ' neuvième ') | ||
.replace(/(^|\s)10i?(e|è)me?s?(\s|\.|,|\?|!|$)/g, ' dixième ') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually don't know how to import external library, I think it's a bit out of my league right now.
server/lib/cleanup/languages/fr.js
Outdated
.replace(/(^|\s)ANPE(\s|\.|,|\?|!|$)/g, ' Agence Nationale Pour l\'Emploi ') | ||
.replace(/(^|\s)APL(\s|\.|,|\?|!|$)/g, ' Aide personnalisée au logement ') | ||
.replace(/(^|\s)CDI(\s|\.|,|\?|!|$)/g, ' Contrat à Durée Indéterminée ') | ||
.replace(/(^|\s)CICE(\s|\.|,|\?|!|$)/g, ' Crédit d\'impôt pour la compétitivité et l\'emploi ') | ||
.replace(/(^|\s)DRH(\s|\.|,|\?|!|$)/g, ' Direction des Ressources Humaines ') | ||
.replace(/(^|\s)EDF(\s|\.|,|\?|!|$)/g, ' Electricité de France ') | ||
.replace(/(^|\s)FN(\s|\.|,|\?|!|$)/g, ' Front National ') | ||
.replace(/(^|\s)HLM(\s|\.|,|\?|!|$)/g, ' Habitation à Loyer Modéré ') | ||
.replace(/(^|\s)IGN(\s|\.|,|\?|!|$)/g, ' Institut Géographique National ') | ||
.replace(/(^|\s)INPI(\s|\.|,|\?|!|$)/g, ' Institut National de la Propriété Intellectuelle ') | ||
.replace(/(^|\s)ISF(\s|\.|,|\?|!|$)/g, ' Impôt sur la fortune ') | ||
.replace(/(^|\s)IUT(\s|\.|,|\?|!|$)/g, ' Institut Universitaire de Technologie ') | ||
.replace(/(^|\s)LREM(\s|\.|,|\?|!|$)/g, ' La Réplublique En Marche ') | ||
.replace(/(^|\s)NUPES(\s|\.|,|\?|!|$)/g, ' Nupès ') | ||
.replace(/(^|\s)PHP(\s|\.|,|\?|!|$)/g, ' Protocole Hypertexte Protocolaire ') | ||
.replace(/(^|\s)PMA(\s|\.|,|\?|!|$)/g, ' Procréation médicalement assistée ') | ||
.replace(/(^|\s)PME(\s|\.|,|\?|!|$)/g, ' Petite et Moyenne Entreprise ') | ||
.replace(/(^|\s)RN(\s|\.|,|\?|!|$)/g, ' Rassemblement National ') | ||
.replace(/(^|\s)RSA(\s|\.|,|\?|!|$)/g, ' Revenu de Solidarité Active ') | ||
.replace(/(^|\s)RSA(\s|\.|,|\?|!|$)/g, ' Revenu de Solidarité Active ') | ||
.replace(/(^|\s)RSI(\s|\.|,|\?|!|$)/g, ' Régime Social des Indépendants ') | ||
.replace(/(^|\s)RTE(\s|\.|,|\?|!|$)/g, ' Réseau de Transport d\'Électricité ') | ||
.replace(/(^|\s)SNCF(\s|\.|,|\?|!|$)/g, ' Société Nationale des Chemins de Fer ') | ||
.replace(/(^|\s)TGV(\s|\.|,|\?|!|$)/g, ' Train à Grande Vitesse ') | ||
.replace(/(^|\s)TVA(\s|\.|,|\?|!|$)/g, ' Taxe sur la Valeur Ajoutée ') | ||
.replace(/(^|\s)UDI(\s|\.|,|\?|!|$)/g, ' Union des Démocrates Indépendants ') | ||
.replace(/(^|\s)UMP(\s|\.|,|\?|!|$)/g, ' Union pour un Mouvement Populaire ') | ||
.replace(/(^|\s)USA(\s|\.|,|\?|!|$)/g, ' Etats Unis d\'Amérique ') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the last character was a ?
or a !
I don't think it should be removed. It serves for the intonation purpose and isn't part of the acronym.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Est ce que le commit aef4284 répond au problème ?
EDIT : sorry, replied in FR. Translation : does the commit answer the issue ?
.replace(/(^|\s)USA(\s|\.|,|\?|!|$)/g, ' Etats Unis d\'Amérique ') | ||
|
||
//replace fraction 1/2 => '1 sur 2' | ||
¨.replace(/(^| )(\d+)(\s)?(\/)(\s)?(\d+)(\s|\.|,|\?|!|$)/g, '$2 sur $6') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about this one. This could very well be a date. Il est né le 25/12
qui ne devrait pas être remplacé de la sorte et prend priorité sur la regexp suivante (date format mm/yy)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh shoot ! you're right !
To solve it, we shall test numbers, and see "if it's in date [1..31]/[1..12] (dd/mm) or [1..12]/[00..99] or [1..12]/[1700..2030] (mm/yy(yy)?) range, it shall be a date stamp, otherwise it's a fraction ?!"
Wow! It escalated quickly!
Sometimes, only context will show us that "mettre 1/4 de litre de lait" will go to "mettre un avril de litre de lait"... But it shall be more readable than "mettre de litre de lait".
On the other hand, if we manage poolry date (not checking mm/yy but only mm/yyyy), it will go like : "il est né en 12/88." to "il est né en douze sur quatre-vingt huit." instead of "il est née en décembre quatre-vingt huit."... Again, it's more readable than ""il est né en."
IMHO, [1..31]/[1..12] (dd/mm) or [1..12]/[1700..2030] (mm/yyyy) range shall work fine. The reste will go through 'fraction replacement', and will always be better that 'full removal'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's first wait for issue #636 resolution, 'coz I don't want to work for nothing :-/
server/lib/cleanup/languages/fr.js
Outdated
.replace((^|\s)\d{1,2}\/\d{1,2}\/(\d{2}[^\d]|\d{4})(\s|$), ' ') //date format dd/mm/yy ou dd/mm/yyyy | ||
.replace((^|\s)\d{1,2}\/(\d{2}[^\d]|\d{4})(\s|$), ' ') //date format mm/yy ou mm/yyyy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the impact of creating blank in the middle of a sentence after the replacement of a date, I'm a proponent of either:
- omitting these sentence altogether
- making deep replacement (spelling date, month, year) which may or may not be practicable.
error: 'Sentence should not contain numbers - les phrases ne doivent pas contenir de nombres', | ||
}, { | ||
regex: /[<>+*#@%^[\]()/]/, | ||
error: 'Sentence should not contain symbols - les phrases de doivent pas contenir de symboles \(\*, \#, \(, etc\)', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggested substituting "–" (long hyphen) with the common ("-") counterpart and testing against it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
commit 18fdf37
// as users wouldn't know how to pronounce the uppercase letters. | ||
regex: /[A-Z]{2,}|[A-Z]+\.*[A-Z]+/, | ||
error: 'Sentence should not contain abbreviations - Les phrases ne doivent pas contenir des abréviations ou sigles', | ||
}]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't we missing something about the french apostrophe (´)? I don't know how the model would benefit (or not) from this situation. I'm pretty sure text processing would do the uniformisation (before TTS for example) so I believe we should make it too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
commit 1a0cff5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, sources files like https://github.com/common-voice/commonvoice-fr/blob/master/CommonVoice-Data/data/debats-assemblee-nationale/20130718093000000.txt are using french apostrophe ?!
(it only shows in "ui-monospace" code font, not the usual "Segoe UI" font.
line 2 : J’appelle maintenant, dans le texte de la commission, les articles du projet de loi.
if I change it myself :
line 2 : J'appelle maintenant, dans le texte de la commission, les articles du projet de loi.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so... commit 8d7d973
server/lib/cleanup/languages/fr.js
Outdated
.replace(/(^|\s)TVA(\s|\.|,|\?|!|$)/g, ' Taxe sur la Valeur Ajoutée ') | ||
.replace(/(^|\s)UDI(\s|\.|,|\?|!|$)/g, ' Union des Démocrates Indépendants ') | ||
.replace(/(^|\s)UMP(\s|\.|,|\?|!|$)/g, ' Union pour un Mouvement Populaire ') | ||
.replace(/(^|\s)USA(\s|\.|,|\?|!|$)/g, ' Etats Unis d\'Amérique ') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Side note: Acronyms are a strange case : They are actually spelled as distincts letters (except for NUPES).
But we don't want "STT" to process R.S.A.
as erre et ça
but we currently have nothing to train it that way.
Should we expect a specific future dictionary or just not strip some of them in the first place (but remove the dots)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
⚠️ Firsts commits in GitHub, first pull request ever in a repo, so please neither shoot or shout at me 😅⚠️
No worries, you're doing fine :) Thanks for your contributions!
I have a few comments, but generally I think it's a good thing to update these files.
server/lib/cleanup/languages/fr.js
Outdated
.replace(/(^|\s)(XIX|XVIIII)(\s|\.|,|\?|!|$)/g, ' dix-neuf ') | ||
.replace(/(^|\s)XX(\s|\.|,|\?|!|$)/g, ' vingt ') | ||
.replace(/(^|\s)XXI(\s|\.|,|\?|!|$)/g, ' vingt-et-un ') | ||
.replace(/(^|\s)XXII(\s|\.|,|\?|!|$)/g, ' vingt-deux ') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can remove all of the roman numerals, as they are not allowed from the beginning (see the abbreviation pattern in the validation file).
server/lib/cleanup/languages/fr.js
Outdated
.replace(/(^|\s)7i?(e|è)me?s?(\s|\.|,|\?|!|$)/g, ' septième ') | ||
.replace(/(^|\s)8i?(e|è)me?s?(\s|\.|,|\?|!|$)/g, ' huitième ') | ||
.replace(/(^|\s)9i?(e|è)me?s?(\s|\.|,|\?|!|$)/g, ' neuvième ') | ||
.replace(/(^|\s)10i?(e|è)me?s?(\s|\.|,|\?|!|$)/g, ' dixième ') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These can be removed, as numbers are not allowed anyway in the Sentence Collector.
server/lib/cleanup/languages/fr.js
Outdated
.replace(/(^|\s)TVA(\s|\.|,|\?|!|$)/g, ' Taxe sur la Valeur Ajoutée ') | ||
.replace(/(^|\s)UDI(\s|\.|,|\?|!|$)/g, ' Union des Démocrates Indépendants ') | ||
.replace(/(^|\s)UMP(\s|\.|,|\?|!|$)/g, ' Union pour un Mouvement Populaire ') | ||
.replace(/(^|\s)USA(\s|\.|,|\?|!|$)/g, ' Etats Unis d\'Amérique ') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These can be removed, as abbreviations are not allowed in the validation file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I think I understand wrongly how it works. Let's solve this #636 issue first 😸
.replace(/(^|\s)USA(\s|\.|,|\?|!|$)/g, ' Etats Unis d\'Amérique ') | ||
|
||
//replace fraction 1/2 => '1 sur 2' | ||
¨.replace(/(^| )(\d+)(\s)?(\/)(\s)?(\d+)(\s|\.|,|\?|!|$)/g, '$2 sur $6') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Numbers are not allowed by the validation file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we keep it for "future-proof rules" and/or "bulk load rules if you don't have yours" ?
Let's discuss it here : https://discourse.mozilla.org/t/sentence-collector-cleanup-before-export-vs-cleanup-on-upload/105411/15
server/lib/cleanup/languages/fr.js
Outdated
//dates, digits and numbers fr-FR cleanup | ||
//todo : CONVERT TO TEXT instead of removing it | ||
.replace((^|\s)\d{1,2}\/\d{1,2}\/(\d{2}[^\d]|\d{4})(\s|$), ' ') //date format dd/mm/yy ou dd/mm/yyyy | ||
.replace((^|\s)\d{1,2}\/(\d{2}[^\d]|\d{4})(\s|$), ' ') //date format mm/yy ou mm/yyyy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Numbers are not allowed by the validation file.
Also note https://discourse.mozilla.org/t/sentence-collector-cleanup-before-export-vs-cleanup-on-upload/105411, though that probably doesn't matter too much for this here :) |
of course :) Co-authored-by: Michael Kohler <[email protected]>
Notepad++ > Edit > transform TABS in SPACE. Yeahhhhhh
as requested, only keeping FR and removing EN error description... But adding EN comment, for ^FR readers sake.
We do not remove PUNCTUATION (?!, etc.) after finding them, we just put them back as we found them.
remplacing – (long hyphen) by - (short hyphen)
Normalize ´ (french apostroph) into ' (usual apostroph)
adding an other apostroph found in original source document as https://github.com/common-voice/commonvoice-fr/blob/master/CommonVoice-Data/data/debats-assemblee-nationale/20130718093000000.txt
Any hope to get this in, in one shape or another? |
The Sentence Collector has now been integrated into the Common Voice platform. Therefore I'm archiving this project here. The validation files now live here and I'm sure it would still benefit from the validation rules being added there: https://github.com/common-voice/common-voice/tree/main/server/src/core/sentences/validation. Unfortunately moving this PR over there is way harder than manually recreating it. Would you mind creating a new PR for this? Sorry for the troubles. |
better french input for better french output ?
Hi !
⚠️ Firsts commits in GitHub, first pull request ever in a repo, so please neither shoot or shout at me 😅 ⚠️
This commit comes from different discussion, that I propose you to have a look to understand the context.
and it is somehow connected to :
TL;DR:
I'm trying to improve input quality for french sentence collector, to avoid
A/ garbage IN - garbage OUT, and
B/ avoid to remove in the end (in the CorporaCreator) the garbage that went through collecting, recording, review... to finally being dropped and not being included in the final release for training batchs. Better to remove it as soon as sentence collector.
What have you done ? 😱
To do so, I duplicated a EN to FR file in sentence collector, and modified it according to a previous job made by Nicolas Panel for CorporaCreator (and sadly not commited).
According to discussion (links here... Did you follow it from the list above ?!), it's recommanded that I create a WIP pull request, to allow everyone to comment, throw tomatos and/or additionnal commits to it.
footnote : it can be hard (...it IS hard !!!) to understand REGEX (REGular EXpressions). Do not hesitate to catch up with https://regex101.com/ to understand and test it.