-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Option to respect original casing and encoding #58
Comments
Both ideas are discussed in #58. - skip_cleaning completely skips multiple stages of cleaning names and name parts. - prefixes allows the user to modify the list of prefixes, exposed from the default config.PREFIXES This currently only affects the Name class and is not part of the CLI or a full release while I consider: 1. How to implement a general way to override config lists such as suffixes, conjunctions, titles, etc. It's currently possible only to monkeypatch these, but since we can now adjust prefixes we should extend this to other constants. 2. Whether and how to incorporate any of these changes into the CLI. I'm leaning toward no, but perhaps --skip-cleaning is fine.
Thanks for your comments -- I realize it's been quite a while, but this is helpful feedback. I've digested some of your notes and suggestions into the following:
I'm going to think more about this as I work toward a new release version. (1) and (2): Adding new option to skip cleaningThis addresses (1) and (2). Separating out the details of cleaning — where accents and casing is somewhat entangled — was more than I wanted to take on, however, the new This would allows you to preprocess your own data, for example, with your own regex or with a package such as [clean-text](https://github.com/jfilter/clean-text/tree/main). I have added initial implementations in f56c303 — there is more to consider about the CLI and how to expose the rest of (3): Adding new option to customize prefixesThis isn’t exactly what you’re looking for, but I’ve added an argument to allow prefixes to be specified as part of Name() instantiation. This is a step toward making config more flexible that might be able to contribute to easier changes in default behavior. (4): Not changing two-word raw names due to existing behaviorBecause nominally is targeted to parse full names specifically, I decided to treat even prefixes as first names when the name is presented that way — e.g., both “Van Jones” and “Van Halen” would produce the first name “Van.” An existing test case addresses this. This is a case where there’s simply no correct answer, because one can certainly envision datasets where some full-name fields only have first names. However, clustering last names from these would cause a number of ambiguous first-name/prefix terms to break in a common use case. Della and Van are particularly common first names in an American sample, as I found when I started thinking about #26.
(5): Unsure how to improve handling of Hispanic namesI read this as related to (3) above but I have described Nominally in the README as “a personal name written in Western name order.” I should perhaps add “Anglophonic.” A three-word name in an American employee database is likely to be first, middle, last; but in a database with many Hispanic names, we might expect instead first, last, and last. Unlike prefixes, we would not be able to make a concise list of Hispanic last names that should not be considered middle names. |
I am amazed by the power and usefulness of this library, but Is there a way to respect the original casing? Example:
database: Hervé Le Corre->nominally->Name: herve familyname: le corre
while I'd love to have:
database: Hervé Le Corre->nominally->Name: Herve familyname: Le Corre
in my language for example there are many family names similar to "De Petris" or "de Iuliis" where the "de" particle originally meant "from the family of", and its casing also carries information since generally speaking the lower case "de" was found in aristocratic families only.
Please also note that in the above example Hervé is a grave accented letter e and I would need to preserve these.
Other examples in my sample which could be troublesome:
database: De Petris->nominally->Name: de Lastname: petris
This wrongly assumes that the lastname prefix De is a first name.
Carlos Ruiz Zafón->nominally->Name: carlos ruiz Lastname: zafon
Here Ruiz gets wrongly processed as a middle name and the accented ó gets demoted to a normal o
Marcos Ordóñez->nominally->Name: marcos Lastname: ordonez
Here the n loses its tilde.
The text was updated successfully, but these errors were encountered: