Option to respect original casing and encoding #58

rjalexa · 2023-01-13T12:49:58Z

I am amazed by the power and usefulness of this library, but Is there a way to respect the original casing? Example:
database: Hervé Le Corre->nominally->Name: herve familyname: le corre
while I'd love to have:
database: Hervé Le Corre->nominally->Name: Herve familyname: Le Corre
in my language for example there are many family names similar to "De Petris" or "de Iuliis" where the "de" particle originally meant "from the family of", and its casing also carries information since generally speaking the lower case "de" was found in aristocratic families only.
Please also note that in the above example Hervé is a grave accented letter e and I would need to preserve these.
Other examples in my sample which could be troublesome:
database: De Petris->nominally->Name: de Lastname: petris
This wrongly assumes that the lastname prefix De is a first name.
Carlos Ruiz Zafón->nominally->Name: carlos ruiz Lastname: zafon
Here Ruiz gets wrongly processed as a middle name and the accented ó gets demoted to a normal o
Marcos Ordóñez->nominally->Name: marcos Lastname: ordonez
Here the n loses its tilde.

The text was updated successfully, but these errors were encountered:

Both ideas are discussed in #58. - skip_cleaning completely skips multiple stages of cleaning names and name parts. - prefixes allows the user to modify the list of prefixes, exposed from the default config.PREFIXES This currently only affects the Name class and is not part of the CLI or a full release while I consider: 1. How to implement a general way to override config lists such as suffixes, conjunctions, titles, etc. It's currently possible only to monkeypatch these, but since we can now adjust prefixes we should extend this to other constants. 2. Whether and how to incorporate any of these changes into the CLI. I'm leaning toward no, but perhaps --skip-cleaning is fine.

vaneseltine · 2024-07-30T23:49:51Z

Thanks for your comments -- I realize it's been quite a while, but this is helpful feedback. I've digested some of your notes and suggestions into the following:

Provide an option to respect original casing.
Provide an option to preserve accents. ("Émile Durkheim" can be "Émile" "Durkheim")
Process middle names more accurately.
Allow a two-word prefix and name to ("Van Halen" can be first "" "Van Halen"
Improve handling of Hispanic names.

I'm going to think more about this as I work toward a new release version.

(1) and (2): Adding new option to skip cleaning

This addresses (1) and (2). Separating out the details of cleaning — where accents and casing is somewhat entangled — was more than I wanted to take on, however, the new skip_cleaning argument to [nominally.Name](http://nominally.name/)() skips several steps of cleaning.

This would allows you to preprocess your own data, for example, with your own regex or with a package such as [clean-text](https://github.com/jfilter/clean-text/tree/main).

I have added initial implementations in f56c303 — there is more to consider about the CLI and how to expose the rest of

(3): Adding new option to customize prefixes

This isn’t exactly what you’re looking for, but I’ve added an argument to allow prefixes to be specified as part of Name() instantiation. This is a step toward making config more flexible that might be able to contribute to easier changes in default behavior.

(4): Not changing two-word raw names due to existing behavior

Because nominally is targeted to parse full names specifically, I decided to treat even prefixes as first names when the name is presented that way — e.g., both “Van Jones” and “Van Halen” would produce the first name “Van.” An existing test case addresses this. This is a case where there’s simply no correct answer, because one can certainly envision datasets where some full-name fields only have first names. However, clustering last names from these would cause a number of ambiguous first-name/prefix terms to break in a common use case. Della and Van are particularly common first names in an American sample, as I found when I started thinking about #26.

Della   29,219
Van     22,943
Von      4,608
Del      4,454

(5): Unsure how to improve handling of Hispanic names

I read this as related to (3) above but I have described Nominally in the README as “a personal name written in Western name order.” I should perhaps add “Anglophonic.” A three-word name in an American employee database is likely to be first, middle, last; but in a database with many Hispanic names, we might expect instead first, last, and last. Unlike prefixes, we would not be able to make a concise list of Hispanic last names that should not be considered middle names.

vaneseltine added test__issues.py This issue is included in test/test__issues.py enhancement New feature or request labels Jul 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Option to respect original casing and encoding #58

Option to respect original casing and encoding #58

rjalexa commented Jan 13, 2023

vaneseltine commented Jul 30, 2024

Option to respect original casing and encoding #58

Option to respect original casing and encoding #58

Comments

rjalexa commented Jan 13, 2023

vaneseltine commented Jul 30, 2024

(1) and (2): Adding new option to skip cleaning

(3): Adding new option to customize prefixes

(4): Not changing two-word raw names due to existing behavior

(5): Unsure how to improve handling of Hispanic names