DEPRECATION NOTICE

This project has been superseded by another one of my projects:

Please use that instead as it has better functionality.

Reasons behind this change:

The new tool does a much better job a identifying and replacing PERSON entities.
- This project's code always had problems with hyphenated names and possessive form, marked by an apostrophe followed by s.
The new tool also contains an API as well as a command-line tool, which also has the name of deidentify.
The new tool has better debugging built in with the -d option.
With this code, the --tokens.json file was always created. With the new tool, this is a command-line option: -t.

Text De-identification Tool

INTRODUCTION

This command-line tool automatically identifies and replaces personal information in text documents using Natural Language Processing (NLP) techniques. It focuses on finding and replacing person names and gender-specific pronouns while maintaining the text's readability and structure.

Natural Language Processing is a field of artificial intelligence that enables computers to understand, interpret, and manipulate human language. This tool specifically uses Named Entity Recognition (NER), an NLP technique that locates and classifies named entities (like person names, organizations, locations) in text. NER helps identify person names even in complex contexts, making it more reliable than simple word matching.

Key Features:

Automatic detection of person names using spaCy's transformer model
Gender-specific pronoun replacement with neutral alternatives
Intelligent encoding detection and Unicode handling
Optional HTML output with color-coded replacements
Detection of potentially missed names (possessives, hyphenated names)
Efficient metadata caching for quick reprocessing

INSTALLATION

Clone the repository:

git clone https://github.com/jftuga/deidentify.git
cd deidentify

Create and activate a Python virtual environment:

# On Windows
python -m venv venv
venv\Scripts\activate

# On macOS/Linux
python3 -m venv venv
source venv/bin/activate

Install dependencies:

pip install -r requirements.txt

Note: As of 2024-12-15, spaCy is not yet supported on macOS with Python 3.13.

Download the spaCy model:

python -m spacy download en_core_web_trf

Note: The transformer model is large (~500MB) but provides superior accuracy.

USAGE

Basic usage with output to STDOUT:

python deidentify.py input.txt -r "PERSON"

Generate color-coded HTML output:

python deidentify.py input.txt -r "[REDACTED]" -H -o output.html

Command line options:

usage: deidentify.py [-h] -r REPLACEMENT [-o OUTPUT_FILE] [-H] [-v] input_file

positional arguments:
  input_file            text file to deidentify

options:
  -h, --help            show this help message and exit
  -r REPLACEMENT, --replacement REPLACEMENT
                        a word/phrase to replace identified names with
  -o OUTPUT_FILE, --output_file OUTPUT_FILE
                        output file
  -H, --html            output in HTML format
  -v, --version         display program version and then exit

HTML Output Colors:

Yellow: Gender-specific pronouns replaced with neutral alternatives
Turquoise: Person names replaced with specified text, given by the -r switch

Possible Misses

These are listed as possible_misses in an intermediate JSON file named input--tokens.json when using input.txt as the input file name.

Example

Input:

John Smith's report was excellent. He clearly understands the topic.

Output:

PERSON's report was excellent. HE/SHE clearly understand the topic.

TECHNICAL DETAILS

The tool processes text in two stages:

Identification Stage: Uses spaCy's transformer model to identify:

- Person names through Named Entity Recognition
- Gender-specific pronouns through part-of-speech tagging

Replacement Stage: Replaces identified items while maintaining text integrity:

- Processes text from end to beginning to preserve character positions
- Handles gender-specific pronouns with neutral alternatives
- Supports optional HTML output with color-coded replacements
- Handles various Unicode punctuation variants

Text Processing Features:

Intelligent encoding detection using the chardet third-party Python module
Unicode punctuation normalization
Safe handling of mixed encodings
Metadata caching for efficient reprocessing

spaCy NER model

The en_core_web_trf (Transformer-based) model is used because:

Highest accuracy for most NLP tasks, especially for named entity recognition and dependency parsing
Best performance on complex or ambiguous sentences
Most robust handling of modern language and edge cases

However, be aware of these shortcomings vs other spaCy models:

Much slower than statistical models
Higher memory requirements (~200MB+)
Not suitable for real-time processing of large volumes of text
Requires GPU for optimal performance, but is still performant with CPU-only

ACKNOWLEDGEMENTS

This tool relies on several excellent open-source projects:

spaCy - Industrial-strength Natural Language Processing
chardet - Universal character encoding detector

LICENSE

MIT LICENSE

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DEPRECATION NOTICE

Reasons behind this change:

Text De-identification Tool

INTRODUCTION

INSTALLATION

USAGE

HTML Output Colors:

Possible Misses

Example

TECHNICAL DETAILS

Text Processing Features:

spaCy NER model

ACKNOWLEDGEMENTS

LICENSE

About

Releases 1

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
deidentify.py		deidentify.py
requirements.txt		requirements.txt

License

jftuga/deidentify

Folders and files

Latest commit

History

Repository files navigation

DEPRECATION NOTICE

Reasons behind this change:

Text De-identification Tool

INTRODUCTION

INSTALLATION

USAGE

HTML Output Colors:

Possible Misses

Example

TECHNICAL DETAILS

Text Processing Features:

spaCy NER model

ACKNOWLEDGEMENTS

LICENSE

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Languages

Packages