How to use LanguageDetectorDL to remove non-english words in python? #2737
Replies: 1 comment 2 replies
-
Hi, I think you either need to look at this problem with another approach or change the problem in a way that removing English works are not the only solution. In short, LanguageDetectorDL like all the other language detection (even online) works best on a longer sequence, not just 1 token/word. |
Beta Was this translation helpful? Give feedback.
-
Hi
I am working with pyspark dataframe.
I have df that looks like this:
I need to use
LanguageDetectorDL
from spark NLP onwords
column which isarray<strings>
type, such that it detects english language and keeps only english words and removes other.I have already used
DocumentAssembler()
to transform data to annotation format:documentAssembler = DocumentAssembler().setInputCol('words').setOutputCol('document')
But I am not sure how to use
LanguageDetectorDL
on the column and get rid of non-english words?Or if convert
array<string>
tostring
is it possible then? let's say there is a sentenceprotection outlook com cyprmb namprd prod outlook com
, so is it possible to getprotection outlook com prod outlook com
?Beta Was this translation helpful? Give feedback.
All reactions