Trying to use presidio analyzer in pyspark. #669

Ratnakara-Sarma · 2021-04-16T13:34:13Z

Ratnakara-Sarma
Apr 16, 2021

Hello,

I've been trying to use the analyzer in pyspark. My idea is to apply presidio analyzer on a spark dataframe that has file paths(50000 ".txt" files from IMDB reviews dataset found in Kaggle) and their contents. I was able to create the dataframe(50000 x 2). It looks like this:

Now I used the python function:
def pii_apply_pandas(file):
text = file
response = analyzer.analyze(correlation_id=0,text = text,entities=[],language='en')
if len(response) == 0:
return "no"
else:
return "yes"
This function is working fine on single text of any length. Then I converted it into a spark udf. As I try to apply this udf on the spark dataframe, it is taking more than 30 mins and the spark jobs does not even run. Please help.

omri374 · 2021-04-18T05:52:38Z

omri374
Apr 18, 2021
Maintainer

There is an issue with serializing the en_core_web_lg spaCy model used by presidio, therefore the sample initializes the model on every request which isn't optimal. Worth checking if the new en_core_web_trf model behaves better on Spark.

11 replies

RamSinha May 9, 2021

So when i try to call analyzer.analyze() method inside my executor code (), job goes into the hung state.
Below is the code skeleton.
In this case: The spark goes into hung state, doesn't do anything.. and eventually crashes with out of memory (even when df has only 4 sample records)..

analyzer = AnalyzerEngine()
def checkRecord(rows):
    // apply some transformation
   // Return list of records that may have PII

df.rdd.mapPartitions(checkRecord)

Also if i move analyzer initialization under checkRecord method- then i get below error

objc[49543]: +[__NSPlaceholderDictionary initialize] may have been in progress in another thread when fork() was called.
objc[49543]: +[__NSPlaceholderDictionary initialize] may have been in progress in another thread when fork() was called. We cannot safely call it or ignore it in the fork() child process. Crashing instead. Set a breakpoint on objc_initializeAfterForkError to debug.

omri374 May 9, 2021
Maintainer

Have you looked at our sample implementation here: https://github.com/microsoft/presidio/blob/main/docs/samples/deployments/spark/presidio_anonymize_blobs.py

You seem to use a different implementation and it is difficult to say if this is an issue with Presidio/spaCy or with the spark process you are running.

omri374 May 9, 2021
Maintainer

Spark has to broadcast the AnalyzerEngine (together with the underlying spaCy model) to all workers.

RamSinha May 9, 2021

Thanks for the pointer, lemme restructure the code based on above hints.

prasanthgp93 Oct 20, 2022

omri374,RamSinha

We have a similar issue here to parse on larger dataset you got any optimal solution here.
we are using same pyspark and below are the packages we are downloading below packages.
Also we have use case of records having text length more than 3 lakhs to 9 lakhs ,

Used : Apache Spark 3.2.1,Python version 3.8.10

#!/bin/bash
pip install presidio-analyzer
pip install presidio-anonymizer
python -m spacy download en_core_web_lg

Please do reply

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trying to use presidio analyzer in pyspark. #669

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 11 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Trying to use presidio analyzer in pyspark. #669

Ratnakara-Sarma Apr 16, 2021

Replies: 1 comment · 11 replies

omri374 Apr 18, 2021 Maintainer

RamSinha May 9, 2021

omri374 May 9, 2021 Maintainer

omri374 May 9, 2021 Maintainer

RamSinha May 9, 2021

prasanthgp93 Oct 20, 2022

Ratnakara-Sarma
Apr 16, 2021

Replies: 1 comment 11 replies

omri374
Apr 18, 2021
Maintainer

omri374 May 9, 2021
Maintainer

omri374 May 9, 2021
Maintainer