Skip to content
This repository has been archived by the owner on Oct 29, 2023. It is now read-only.

Latest commit

 

History

History
221 lines (165 loc) · 9.16 KB

File metadata and controls

221 lines (165 loc) · 9.16 KB

Text PII Anonymization Power Skill

Presidio is an open-source tool to recognize, analyze and anonymize personally identifiable information (PII). Using trained ML models, Presidio was built to ensure sensitive text is properly managed and governed.

This Power Skill uses Presidio Analyzer and Anonymizer to find and remove PII entities. Even though Presidio supports several anonymization methods (hash, encrypt, redact, replace, mask), this Power Skill only uses redact and removes the PIIs completely from the text .

This skill is ideal for finding and removing PII entities from the search text.

Using the PII detection custom skill can give you only some features Presidio offers. Presidio could be customized for specific needs, either by adding PII recognizers or custom anonymizers.

⚠️ Presidio can help identify sensitive/PII data in un/structured text. However, because Presidio is using trained ML models, there is no guarantee that Presidio will find all sensitive information. Consequently, additional systems and protections should be employed.

Deploy with docker

To run this PowerSkill you will need:

  • Docker
  • An Azure Blob storage container
  • A provisioned Azure Cognitive Search (ACS) instance
  • A provisioned Azure Container Registry
  • A Cognitive Services key in the region you deploy ACS to

Below is a full working example that you can get working E2E on sample data.

How to implement

This section describes how to get this working on sample data and how it can be amended for your data.

  1. Data

    The first step is to view the sample data. Link to sample data.

  2. The next step is to run the API locally and test the model against a test record. Create a local python environment and install the requirements:

    python -m pip install -r powerskill/requirements.txt

    In addition to the common requirements described in the root README.md file, this Power Skill requires spacy en_core_web_lg module being downloaded:

    python -m spacy download en_core_web_lg

    Activate your environment and run the API locally, execute the following:

    python app.py

    Run the cell Test PII anonymization on our local running API. Make sure you rename the file sample_env file to .env and populate it with the relevant values. Use the variable bash URL_LOCAL as the URL.

  3. Build the docker image

    Now build the docker image and upload the image to your container registry
    For this step you will need docker running so that we can build and test our Presidio inference API locally. You will also need a container registry for the build.

    Run the following command to build the inference API container image:

    docker build -t [container_registry_name.azurecr.io]/pii_anonymization:[your_tag] .  

    The container will require the following variables set at runtime, namely:

    KEY=[YourSecretKeyCanBeAnything]    # This is a secret key - only requests with this key will be allowed
    DEBUG=True   # This enables verbose logging

    See the file sample_env for the .env format

    Now we can test the container by running it locally with our variables:

    docker run -it --rm -p 5000:5000 -e DEBUG=true -e KEY=[YourSecretKeyCanBeAnything] 
    [container_registry_name.azurecr.io]/pii_anonymization:[your_tag]

    Upon starting, you will see the server initializing:

    INFO:     Uvicorn running on http://0.0.0.0:5000 (Press CTRL+C to quit)

    We are now ready to send a request. Run the cell Test PII anonymization on local running API to test the running container.

    The response should show the anonymized text.

  4. Deploy the container to an Azure Web App.

    We will deploy this as an Azure App Service Web App. running a container.

    First we need to push our newly built image to our container registry.

    Run the following command:

    docker push [container_registry_name].azurecr.io/pii_anonymization:[your_tag]

    In the deployment folder there are two terraform files to deploy the inference API to an App Service Web App for linux.

    The simplest is to open a cloud shell in the portal and upload the main and variables to your cloud shell storage as this avoids the need for any installation.

    Set the following values in the main file:

    backend "azurerm" {
        storage_account_name = "[your storage account name]"
        container_name = "[your storage container name]"
        key = "[your storage account key]"
        resource_group_name = "[your storage account resource group name]"
      }
    

    Set the following values in the variables file:

    variable "app_service_sku" {
      description = "The SKU (size - cpu/mem) of the app plan hosting the container. See: https://azure.microsoft.com/en-us/pricing/details/app-service/linux/"
      default = "P2V2"
    }
    
    variable "docker_registry_url" {
      description = "[your container registry].azurecr.io"
      default = ""
    }
    
    variable "docker_registry_username" {
      description = "[your container registry username]"
      default = ""
    }
    
    variable "docker_registry_password" {
      description = "[your container registry password]"
      default = ""
    }
    
    variable "docker_image" {
      description = "[your docker image name]:[your tag]"
      default = ""
    }
    
    variable "resource_group" {
      description = "This is the name of an existing resource group to deploy to"
      default = ""
    }
    
    variable "location" {
      description = "This is the region of an existing resource group you want to deploy to"
      default = "eastus2"
    }
    
    variable "debug" {
      description = "API logging - set to True for verbose logging"
      default = false
    }
    

    Navigate to the directory containing the files and enter:

    terraform init

    Then enter:

    terraform apply

    You will be prompted with:

    Do you want to perform these actions?
      Terraform will perform the actions described above.
      Only 'yes' will be accepted to approve.

    Type yes

    Once deployed, copy the Azure Web App URL which may be found in the overview section of the portal as we will need it to plug into Azure Search.

  5. Deploy the datasource, index, skillset and indexer

    Data source

    Populate your values in the data source file or use the 'Create the data source' script

    Index

    Populate your values in the index file or use the 'Create the index' script

    Skillset

    Populate the values in the skillset file or use the 'Create the SkillSet' script

    Note, you need an already deployed ACS instance in the same region as your cognitive services instance as we want to augment what we can extract using custom vision with our similarity model.

    You will need your ACS API Key and the URL for your ACS instance.

  6. Run the ACS indexer

    Populate the values in the indexer file or use the 'Create/Run your indexer' script

    The indexer will automatically run. You should see requests coming in if you look at the Web App logs.

  7. Test the index

    When looking at your data, you will now see the imported text data without PII entities in it.

    Now we are in a position to search on our most similar data, navigate to the 'Let's go and test the ACS index' to view the anonymized text.