Trying to use presidio analyzer in pyspark. #669
Unanswered
Ratnakara-Sarma
asked this question in
Q&A
Replies: 1 comment 11 replies
-
There is an issue with serializing the |
Beta Was this translation helpful? Give feedback.
11 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello,
I've been trying to use the analyzer in pyspark. My idea is to apply presidio analyzer on a spark dataframe that has file paths(50000 ".txt" files from IMDB reviews dataset found in Kaggle) and their contents. I was able to create the dataframe(50000 x 2). It looks like this:
Now I used the python function:
def pii_apply_pandas(file):
text = file
response = analyzer.analyze(correlation_id=0,text = text,entities=[],language='en')
if len(response) == 0:
return "no"
else:
return "yes"
This function is working fine on single text of any length. Then I converted it into a spark udf. As I try to apply this udf on the spark dataframe, it is taking more than 30 mins and the spark jobs does not even run. Please help.
Beta Was this translation helpful? Give feedback.
All reactions