Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to respect original casing and encoding #58

Open
rjalexa opened this issue Jan 13, 2023 · 1 comment
Open

Option to respect original casing and encoding #58

rjalexa opened this issue Jan 13, 2023 · 1 comment
Labels
enhancement New feature or request test__issues.py This issue is included in test/test__issues.py

Comments

@rjalexa
Copy link

rjalexa commented Jan 13, 2023

I am amazed by the power and usefulness of this library, but Is there a way to respect the original casing? Example:
database: Hervé Le Corre->nominally->Name: herve familyname: le corre
while I'd love to have:
database: Hervé Le Corre->nominally->Name: Herve familyname: Le Corre
in my language for example there are many family names similar to "De Petris" or "de Iuliis" where the "de" particle originally meant "from the family of", and its casing also carries information since generally speaking the lower case "de" was found in aristocratic families only.
Please also note that in the above example Hervé is a grave accented letter e and I would need to preserve these.
Other examples in my sample which could be troublesome:
database: De Petris->nominally->Name: de Lastname: petris
This wrongly assumes that the lastname prefix De is a first name.
Carlos Ruiz Zafón->nominally->Name: carlos ruiz Lastname: zafon
Here Ruiz gets wrongly processed as a middle name and the accented ó gets demoted to a normal o
Marcos Ordóñez->nominally->Name: marcos Lastname: ordonez
Here the n loses its tilde.

vaneseltine added a commit that referenced this issue Jul 30, 2024
Both ideas are discussed in #58.

- skip_cleaning completely skips multiple stages of cleaning names
and name parts.
- prefixes allows the user to modify the list of prefixes, exposed
from the default config.PREFIXES

This currently only affects the Name class and is not part of the
CLI or a full release while I consider:

1. How to implement a general way to override config lists such as
suffixes, conjunctions, titles, etc. It's currently possible only
to monkeypatch these, but since we can now adjust prefixes we
should extend this to other constants.
2. Whether and how to incorporate any of these changes into the
CLI. I'm leaning toward no, but perhaps --skip-cleaning is fine.
@vaneseltine
Copy link
Owner

Thanks for your comments -- I realize it's been quite a while, but this is helpful feedback. I've digested some of your notes and suggestions into the following:

  1. Provide an option to respect original casing.
  2. Provide an option to preserve accents. ("Émile Durkheim" can be "Émile" "Durkheim")
  3. Process middle names more accurately.
  4. Allow a two-word prefix and name to ("Van Halen" can be first "" "Van Halen"
  5. Improve handling of Hispanic names.

I'm going to think more about this as I work toward a new release version.

(1) and (2): Adding new option to skip cleaning

This addresses (1) and (2). Separating out the details of cleaning — where accents and casing is somewhat entangled — was more than I wanted to take on, however, the new skip_cleaning argument to [nominally.Name](http://nominally.name/)() skips several steps of cleaning.

This would allows you to preprocess your own data, for example, with your own regex or with a package such as [clean-text](https://github.com/jfilter/clean-text/tree/main).

I have added initial implementations in f56c303 — there is more to consider about the CLI and how to expose the rest of

(3): Adding new option to customize prefixes

This isn’t exactly what you’re looking for, but I’ve added an argument to allow prefixes to be specified as part of Name() instantiation. This is a step toward making config more flexible that might be able to contribute to easier changes in default behavior.

(4): Not changing two-word raw names due to existing behavior

Because nominally is targeted to parse full names specifically, I decided to treat even prefixes as first names when the name is presented that way — e.g., both “Van Jones” and “Van Halen” would produce the first name “Van.” An existing test case addresses this. This is a case where there’s simply no correct answer, because one can certainly envision datasets where some full-name fields only have first names. However, clustering last names from these would cause a number of ambiguous first-name/prefix terms to break in a common use case. Della and Van are particularly common first names in an American sample, as I found when I started thinking about #26.

Della   29,219
Van     22,943
Von      4,608
Del      4,454

(5): Unsure how to improve handling of Hispanic names

I read this as related to (3) above but I have described Nominally in the README as “a personal name written in Western name order.” I should perhaps add “Anglophonic.” A three-word name in an American employee database is likely to be first, middle, last; but in a database with many Hispanic names, we might expect instead first, last, and last. Unlike prefixes, we would not be able to make a concise list of Hispanic last names that should not be considered middle names.

@vaneseltine vaneseltine added test__issues.py This issue is included in test/test__issues.py enhancement New feature or request labels Jul 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request test__issues.py This issue is included in test/test__issues.py
Projects
None yet
Development

No branches or pull requests

2 participants