Skip to content
This repository has been archived by the owner on Jan 3, 2025. It is now read-only.
/ deidentify Public archive

Deidentify people's names along with pronoun substitution

License

Notifications You must be signed in to change notification settings

jftuga/deidentify

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DEPRECATION NOTICE

This project has been superseded by another one of my projects:

deidentification

Please use that instead as it has better functionality.

Reasons behind this change:

  • The new tool does a much better job a identifying and replacing PERSON entities.
    • This project's code always had problems with hyphenated names and possessive form, marked by an apostrophe followed by s.
  • The new tool also contains an API as well as a command-line tool, which also has the name of deidentify.
  • The new tool has better debugging built in with the -d option.
  • With this code, the --tokens.json file was always created. With the new tool, this is a command-line option: -t.

Text De-identification Tool

INTRODUCTION

This command-line tool automatically identifies and replaces personal information in text documents using Natural Language Processing (NLP) techniques. It focuses on finding and replacing person names and gender-specific pronouns while maintaining the text's readability and structure.

Natural Language Processing is a field of artificial intelligence that enables computers to understand, interpret, and manipulate human language. This tool specifically uses Named Entity Recognition (NER), an NLP technique that locates and classifies named entities (like person names, organizations, locations) in text. NER helps identify person names even in complex contexts, making it more reliable than simple word matching.

Key Features:

  • Automatic detection of person names using spaCy's transformer model
  • Gender-specific pronoun replacement with neutral alternatives
  • Intelligent encoding detection and Unicode handling
  • Optional HTML output with color-coded replacements
  • Detection of potentially missed names (possessives, hyphenated names)
  • Efficient metadata caching for quick reprocessing

INSTALLATION

  1. Clone the repository:
git clone https://github.com/jftuga/deidentify.git
cd deidentify
  1. Create and activate a Python virtual environment:
# On Windows
python -m venv venv
venv\Scripts\activate

# On macOS/Linux
python3 -m venv venv
source venv/bin/activate
  1. Install dependencies:
pip install -r requirements.txt

Note: As of 2024-12-15, spaCy is not yet supported on macOS with Python 3.13.

  1. Download the spaCy model:
python -m spacy download en_core_web_trf

Note: The transformer model is large (~500MB) but provides superior accuracy.

USAGE

Basic usage with output to STDOUT:

python deidentify.py input.txt -r "PERSON"

Generate color-coded HTML output:

python deidentify.py input.txt -r "[REDACTED]" -H -o output.html

Command line options:

usage: deidentify.py [-h] -r REPLACEMENT [-o OUTPUT_FILE] [-H] [-v] input_file

positional arguments:
  input_file            text file to deidentify

options:
  -h, --help            show this help message and exit
  -r REPLACEMENT, --replacement REPLACEMENT
                        a word/phrase to replace identified names with
  -o OUTPUT_FILE, --output_file OUTPUT_FILE
                        output file
  -H, --html            output in HTML format
  -v, --version         display program version and then exit

HTML Output Colors:

  • Yellow: Gender-specific pronouns replaced with neutral alternatives
  • Turquoise: Person names replaced with specified text, given by the -r switch

Possible Misses

These are listed as possible_misses in an intermediate JSON file named input--tokens.json when using input.txt as the input file name.

Example

Input:

John Smith's report was excellent. He clearly understands the topic.

Output:

PERSON's report was excellent. HE/SHE clearly understand the topic.

TECHNICAL DETAILS

The tool processes text in two stages:

  1. Identification Stage: Uses spaCy's transformer model to identify:
    • Person names through Named Entity Recognition
    • Gender-specific pronouns through part-of-speech tagging
  1. Replacement Stage: Replaces identified items while maintaining text integrity:
    • Processes text from end to beginning to preserve character positions
    • Handles gender-specific pronouns with neutral alternatives
    • Supports optional HTML output with color-coded replacements
    • Handles various Unicode punctuation variants

Text Processing Features:

  • Intelligent encoding detection using the chardet third-party Python module
  • Unicode punctuation normalization
  • Safe handling of mixed encodings
  • Metadata caching for efficient reprocessing

spaCy NER model

The en_core_web_trf (Transformer-based) model is used because:

  • Highest accuracy for most NLP tasks, especially for named entity recognition and dependency parsing
  • Best performance on complex or ambiguous sentences
  • Most robust handling of modern language and edge cases

However, be aware of these shortcomings vs other spaCy models:

  • Much slower than statistical models
  • Higher memory requirements (~200MB+)
  • Not suitable for real-time processing of large volumes of text
  • Requires GPU for optimal performance, but is still performant with CPU-only

ACKNOWLEDGEMENTS

This tool relies on several excellent open-source projects:

  • spaCy - Industrial-strength Natural Language Processing
  • chardet - Universal character encoding detector

LICENSE

MIT LICENSE

About

Deidentify people's names along with pronoun substitution

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages