Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to load pretrained pipeline for ContextSpellCheckerModel; serialVersionUID error #7340

Open
geowynn opened this issue Mar 17, 2022 · 8 comments
Assignees
Labels
bug models_hub pretrained models and pipelines

Comments

@geowynn
Copy link

geowynn commented Mar 17, 2022

I'm new to using sparknlp and facing this error when adding the contextSpellCheckerModel pretrained to the pipeline.

Description

Also, working on my project in CDSW (Cloudera workbench) and I've referred to links here but I'm not entirely sure how to point into the correct jar on cloud.

References:
#2562
#5984

Expected Behavior

Pretrained model for ContextSpellCheckerModel should run without serialVersionUID InvalidClassException error.

Current Behavior

Code for pipeline:
spellChecker = ContextSpellCheckerModel.pretrained("spellcheck_dl") .setInputCols("tokenized") .setOutputCol("checked")

Summarised error:
spellcheck_dl download started this may take some time. Approximate size to download 112.2 MB [ | ]spellcheck_dl download started this may take some time. Approximate size to download 112.2 MB Download done! Loading the resource. 22/03/17 07:52:54 052 ERROR Executor: Exception in task 0.0 in stage 90.0 (TID 892) java.io.InvalidClassException: com.johnsnowlabs.nlp.annotators.spell.context.parser.MainVocab; local class incompatible: stream classdesc serialVersionUID = 2150722227907329010, local class serialVersionUID = 7050539942427507052

Full error message:

spellcheck_dl download started this may take some time.
Approximate size to download 112.2 MB
[ | ]spellcheck_dl download started this may take some time.
Approximate size to download 112.2 MB
Download done! Loading the resource.
22/03/17 07:52:54 052 ERROR Executor: Exception in task 0.0 in stage 90.0 (TID 892)
java.io.InvalidClassException: com.johnsnowlabs.nlp.annotators.spell.context.parser.MainVocab; local class incompatible: stream classdesc serialVersionUID = 2150722227907329010, local class serialVersionUID = 7050539942427507052
	at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:699)
	at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1885)
	at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1751)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2042)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
	at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1975)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1567)
	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:431)
	at org.apache.spark.util.Utils$.deserialize(Utils.scala:173)
	at org.apache.spark.SparkContext$$anonfun$objectFile$1$$anonfun$apply$16.apply(SparkContext.scala:1316)
	at org.apache.spark.SparkContext$$anonfun$objectFile$1$$anonfun$apply$16.apply(SparkContext.scala:1316)
	at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
	at scala.collection.Iterator$class.foreach(Iterator.scala:891)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
	at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
	at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
	at scala.collection.AbstractIterator.to(Iterator.scala:1334)
	at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
	at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1334)
	at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
	at scala.collection.AbstractIterator.toArray(Iterator.scala:1334)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:967)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:967)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2121)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2121)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:121)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$11.apply(Executor.scala:407)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1408)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:413)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
22/03/17 07:52:54 055 ERROR TaskSetManager: Task 0 in stage 90.0 failed 1 times; aborting job

An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadModel.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 90.0 failed 1 times, most recent failure: Lost task 0.0 in stage 90.0 (TID 892, localhost, executor driver): java.io.InvalidClassException: com.johnsnowlabs.nlp.annotators.spell.context.parser.MainVocab; local class incompatible: stream classdesc serialVersionUID = 2150722227907329010, local class serialVersionUID = 7050539942427507052
	at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:699)
	at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1885)
	at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1751)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2042)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
	at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1975)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1567)
	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:431)
	at org.apache.spark.util.Utils$.deserialize(Utils.scala:173)
	at org.apache.spark.SparkContext$$anonfun$objectFile$1$$anonfun$apply$16.apply(SparkContext.scala:1316)
	at org.apache.spark.SparkContext$$anonfun$objectFile$1$$anonfun$apply$16.apply(SparkContext.scala:1316)
	at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
	at scala.collection.Iterator$class.foreach(Iterator.scala:891)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
	at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
	at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
	at scala.collection.AbstractIterator.to(Iterator.scala:1334)
	at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
	at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1334)
	at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
	at scala.collection.AbstractIterator.toArray(Iterator.scala:1334)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:967)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:967)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2121)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2121)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:121)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$11.apply(Executor.scala:407)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1408)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:413)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1926)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1914)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1913)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1913)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:951)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:951)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:951)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2147)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2096)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2085)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:762)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2081)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2102)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2121)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2146)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:967)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:385)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:966)
	at com.johnsnowlabs.nlp.serialization.TransducerFeature.deserializeObject(Feature.scala:293)
	at com.johnsnowlabs.nlp.serialization.Feature.deserialize(Feature.scala:61)
	at com.johnsnowlabs.nlp.FeaturesReader$$anonfun$load$1.apply(ParamsAndFeaturesReadable.scala:31)
	at com.johnsnowlabs.nlp.FeaturesReader$$anonfun$load$1.apply(ParamsAndFeaturesReadable.scala:30)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at com.johnsnowlabs.nlp.FeaturesReader.load(ParamsAndFeaturesReadable.scala:30)
	at com.johnsnowlabs.nlp.FeaturesReader.load(ParamsAndFeaturesReadable.scala:24)
	at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadModel(ResourceDownloader.scala:406)
	at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadModel(ResourceDownloader.scala:400)
	at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader$.downloadModel(ResourceDownloader.scala:546)
	at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadModel(ResourceDownloader.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.InvalidClassException: com.johnsnowlabs.nlp.annotators.spell.context.parser.MainVocab; local class incompatible: stream classdesc serialVersionUID = 2150722227907329010, local class serialVersionUID = 7050539942427507052
	at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:699)
	at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1885)
	at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1751)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2042)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
	at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1975)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1567)
	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:431)
	at org.apache.spark.util.Utils$.deserialize(Utils.scala:173)
	at org.apache.spark.SparkContext$$anonfun$objectFile$1$$anonfun$apply$16.apply(SparkContext.scala:1316)
	at org.apache.spark.SparkContext$$anonfun$objectFile$1$$anonfun$apply$16.apply(SparkContext.scala:1316)
	at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
	at scala.collection.Iterator$class.foreach(Iterator.scala:891)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
	at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
	at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
	at scala.collection.AbstractIterator.to(Iterator.scala:1334)
	at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
	at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1334)
	at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
	at scala.collection.AbstractIterator.toArray(Iterator.scala:1334)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:967)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:967)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2121)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2121)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:121)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$11.apply(Executor.scala:407)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1408)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:413)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	... 1 more
[OK!]

Possible Solution

Steps to Reproduce

spark = sparknlp.start(spark24=True)
documentAssembler = DocumentAssembler().setInputCol(text_col).setOutputCol('document')

languageDetector = LanguageDetectorDL.pretrained() \
  .setInputCols(["document"]) \
  .setOutputCol("language")


tokenizer = Tokenizer() \
     .setInputCols(['document']) \
     .setOutputCol('tokenized')

spellChecker = ContextSpellCheckerModel.pretrained("spellcheck_dl") \
  .setInputCols("tokenized") \
  .setOutputCol("checked")

pipeline = Pipeline() \
     .setStages([documentAssembler,
                 languageDetector,
                 tokenizer,
                 spellChecker,
                 finisher])

Context

Unable to make use of contextual spell checker in pipeline

Your Environment

  • Spark NLP version sparknlp.version(): '3.4.2'

  • Apache NLP version spark.version: '2.4.0-cdh6.3.4'

  • Java version java -version: java version "1.8.0_181"
    Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
    Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)

  • Setup and installation (Pypi, Conda, Maven, etc.): pip install spark-nlp

  • Operating System and version: Cloudera Data Science Workbench Python 3

  • Link to your project (if any):

@maziyarpanahi
Copy link
Member

Hi @geowynn

Could you please share how you start SparkSession in your Cloudera cluster which has spark-nlp Maven package? (since you are on Spark 2.4, the name of that package must be spark-nlp-spark24: https://github.com/JohnSnowLabs/spark-nlp#packages-cheatsheet)

Just double checking here, it could be totally the model itself that needs another copy for Spark 2.4, but before that, I want to be sure you are using the correct Spark NLP maven package.

@geowynn
Copy link
Author

geowynn commented Mar 17, 2022

Hi @maziyarpanahi ,

Are you referring to this line in the Steps to Reproduce Section? I'm using spark = sparknlp.start(spark24=True) directly in CDSW IDE for starting spark session.

Separately I've tried passing it directly but the issue persists.

#spark = SparkSession.builder \
#    .appName("Spark NLP")\
#    .master("local[4]")\
#    .config("spark.driver.memory","16G")\
#    .config("spark.driver.maxResultSize", "0") \
#    .config("spark.kryoserializer.buffer.max", "2000M")\
#    .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.4.2")\
#    .getOrCreate()

Thanks for the prompt response!

@maziyarpanahi
Copy link
Member

Yes, that is the line, thanks @geowynn

I can confirm it's not your setup and it's actually the spellcheck_dl model. More precisely spellcheck_dl_en_2.7.2_2.4_1611394065565 exact model which will be downloaded and crashes.

We will fix this model, re-upload it for Spark 2.4 and I'll keep you updated

@albertoandreottiATgmail
Copy link
Contributor

hello @geowynn , have you tried using Kryo serializer?

@geowynn
Copy link
Author

geowynn commented Mar 29, 2022

hello @geowynn , have you tried using Kryo serializer?

Hi @albertoandreottiATgmail what do you mean by Kryo serializer? any examples? I'm not too familiar with it

@maziyarpanahi
Copy link
Member

@geowynn

We have fixed the English version of the spellcheck_dl model. Could you please try it one more time to see if it works? (they are fixed for 3.4.2)

@sharathjapa
Copy link

I am trying to use the ContextSpellCheckerMode in my NLP pipeline and I am facing py4JJava Exception. My runtime environment is Spark 3.2.1 and sparknlp 4.0.0. This notebook is running on a data bricks environment

spellChecker = ContextSpellCheckerModel.load("dbfs:path/to/pretrained/model") .setInputCols("tokenized") .setOutputCol("checked")

Any help is appreciated

@maziyarpanahi
Copy link
Member

I am trying to use the ContextSpellCheckerMode in my NLP pipeline and I am facing py4JJava Exception. My runtime environment is Spark 3.2.1 and sparknlp 4.0.0. This notebook is running on a data bricks environment

spellChecker = ContextSpellCheckerModel.load("dbfs:path/to/pretrained/model") .setInputCols("tokenized") .setOutputCol("checked")

Any help is appreciated

Could you please create a new issue, we need all the info in the issue template especially what exactly is that path to pretrained. (Link to the actual model on Models Hub)

Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug models_hub pretrained models and pipelines
Projects
None yet
Development

No branches or pull requests

6 participants