Genderator is a Python library to process Spanish names (from Spain) to guess their gender.
For this to work, the libray uses the next datasets from Instituto Nacional de Estadística:
- name_surname_ratio: List of words that could be both, a name or a surname, and shows the probability to be a surname.
- names_ine: List of registered names on Spain, with the probability for each one to be a male or a female name.
- surnames_ine: List of registeres surnames on Spain.
The easiest way to install the latest version is by using pip to pull it from PyPI:
pip install genderator
You may also use Git to clone the repository from Github and install it manually:
git clone https://github.com/davidmogar/genderator.git cd genderator python setup.py install
Python 3.3 & 3.4 are supported.
The next code shows a sample usage of this library:
import genderator
guesser = genderator.Parser()
answer = guesser.guess_gender('David Moreno García')
if answer:
print(answer)
else:
print('Name doesn\'t match')
Output:
OrderedDict([ ('names', ['david']), ('surnames', ['moreno', 'garcia']), ('real_name', 'david'), ('gender', 'Male'), ('confidence', 1.0) ])
Genderator's parser can receive some arguments to control its behaviour. Those arguments are:
- force_combinations=Boolean: Force combinations during classification.
- force_split=Boolean: Force name split if no surnames are detected.
- normalize=Boolean: Enable or disable normalization.
- normalizer_options=Dictionary: Normalizer options to be applied.
Normalizer options are a dictionary to control what normalization rules are applied to each name. Possible options are:
- hyphens: Boolean option to enable or disable hyphens removal.
- symbols: Boolean option to enable or disable symbols removal.
- whitespaces: Boolean option to enable or disable extra whitespaces removal.