This project has been superseded by another one of my projects:
Please use that instead as it has better functionality.
- The new tool does a much better job a identifying and replacing
PERSON
entities. -
- This project's code always had problems with hyphenated names and possessive form, marked by an apostrophe followed by
s
.
- This project's code always had problems with hyphenated names and possessive form, marked by an apostrophe followed by
- The new tool also contains an API as well as a command-line tool, which also has the name of
deidentify
. - The new tool has better debugging built in with the
-d
option. - With this code, the
--tokens.json
file was always created. With the new tool, this is a command-line option:-t
.
This command-line tool automatically identifies and replaces personal information in text documents using Natural Language Processing (NLP) techniques. It focuses on finding and replacing person names and gender-specific pronouns while maintaining the text's readability and structure.
Natural Language Processing is a field of artificial intelligence that enables computers to understand, interpret, and manipulate human language. This tool specifically uses Named Entity Recognition (NER), an NLP technique that locates and classifies named entities (like person names, organizations, locations) in text. NER helps identify person names even in complex contexts, making it more reliable than simple word matching.
Key Features:
- Automatic detection of person names using spaCy's transformer model
- Gender-specific pronoun replacement with neutral alternatives
- Intelligent encoding detection and Unicode handling
- Optional HTML output with color-coded replacements
- Detection of potentially missed names (possessives, hyphenated names)
- Efficient metadata caching for quick reprocessing
- Clone the repository:
git clone https://github.com/jftuga/deidentify.git
cd deidentify
- Create and activate a Python virtual environment:
# On Windows
python -m venv venv
venv\Scripts\activate
# On macOS/Linux
python3 -m venv venv
source venv/bin/activate
- Install dependencies:
pip install -r requirements.txt
Note: As of 2024-12-15, spaCy is not yet supported on macOS with Python 3.13.
- Download the spaCy model:
python -m spacy download en_core_web_trf
Note: The transformer model is large (~500MB) but provides superior accuracy.
Basic usage with output to STDOUT:
python deidentify.py input.txt -r "PERSON"
Generate color-coded HTML output:
python deidentify.py input.txt -r "[REDACTED]" -H -o output.html
Command line options:
usage: deidentify.py [-h] -r REPLACEMENT [-o OUTPUT_FILE] [-H] [-v] input_file
positional arguments:
input_file text file to deidentify
options:
-h, --help show this help message and exit
-r REPLACEMENT, --replacement REPLACEMENT
a word/phrase to replace identified names with
-o OUTPUT_FILE, --output_file OUTPUT_FILE
output file
-H, --html output in HTML format
-v, --version display program version and then exit
- Yellow: Gender-specific pronouns replaced with neutral alternatives
- Turquoise: Person names replaced with specified text, given by the
-r
switch
These are listed as possible_misses
in an intermediate JSON file named
input--tokens.json
when using input.txt
as the input file name.
Input:
John Smith's report was excellent. He clearly understands the topic.
Output:
PERSON's report was excellent. HE/SHE clearly understand the topic.
The tool processes text in two stages:
- Identification Stage: Uses spaCy's transformer model to identify:
-
- Person names through Named Entity Recognition
-
- Gender-specific pronouns through part-of-speech tagging
- Replacement Stage: Replaces identified items while maintaining text integrity:
-
- Processes text from end to beginning to preserve character positions
-
- Handles gender-specific pronouns with neutral alternatives
-
- Supports optional HTML output with color-coded replacements
-
- Handles various Unicode punctuation variants
- Intelligent encoding detection using the
chardet
third-party Python module - Unicode punctuation normalization
- Safe handling of mixed encodings
- Metadata caching for efficient reprocessing
The en_core_web_trf
(Transformer-based) model is used because:
- Highest accuracy for most NLP tasks, especially for named entity recognition and dependency parsing
- Best performance on complex or ambiguous sentences
- Most robust handling of modern language and edge cases
However, be aware of these shortcomings vs other spaCy models:
- Much slower than statistical models
- Higher memory requirements (~200MB+)
- Not suitable for real-time processing of large volumes of text
- Requires GPU for optimal performance, but is still performant with CPU-only
This tool relies on several excellent open-source projects: