structured streaming #120

joques · 2024-09-21T14:22:08Z

Hi all, how do I get Spark.jl to read a stream from and write to Kafka? I need help finding documentation on that.

dfdx · 2024-09-21T16:13:34Z

As a general rule, you can follow the official documentation with a few context-specific modifications. I don't have a proper setup to test it myself, but I'd start with this doc. According to it, first you need to add spark-sql-kafka library to the Spark session. Something like this:

spark = SparkSession.builder()
        .appName("...")
        .master("spark://ip:7077")
        .config("spark.jars", "/path/,/path/to/another.jar")
        .getOrCreate()

then just follow the examples on the linked page (Python or Scala API):

df = spark \
  .readStream \
  .format("kafka") \
  .option("kafka.bootstrap.servers", "host1:port1,host2:port2") \
  .option("subscribe", "topic1") \
  .load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")

Please, let me know if it worked for you.

joques · 2024-09-23T13:21:32Z

Thanks for your response. I followed your suggestion but I am still getting the same error. This is what I did.

I downloaded spark-sql-kafka into a folder. Let's call it spark-jars. Then I updated the spark session as follows

spark = SparkSession.builder.appName("Main").master("spark://IP:7077").config("spark.jars", "/absolute/path/to/spark-jars/spark-sql-kafka-0-10_2.12-3.5.2.jar").getOrCreate()

Next, I added the streaming code as follows

stream_df = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "localhost:9092").option("subscribe", "st-streaming-session").load()

But I am still getting the following exception:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Failed to find data source: kafka. Please deploy the application as per the deployment section of Structured Streaming + Kafka Integration Guide.

Note that for the kafkaesque.boostrap.servers, I tried the domain name, the IP address and localhost. All three are throwing the same exception. What am I doing wrong?
Regards

dfdx · 2024-09-23T13:29:48Z

Can you post the full stacktrace?

Edit: or even better the full log or its pieces related to JAR sending to workers.

joques · 2024-09-24T04:16:15Z

Below is the full stack trace of the error I get

Exception in thread "main" org.apache.spark.sql.AnalysisException: Failed to find data source: kafka. Please deploy the application as per the deployment section of Structured Streaming + Kafka Integration Guide.
	at org.apache.spark.sql.errors.QueryCompilationErrors$.failedToFindKafkaDataSourceError(QueryCompilationErrors.scala:1567)
	at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:645)
	at org.apache.spark.sql.streaming.DataStreamReader.loadInternal(DataStreamReader.scala:158)
	at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:145)

and below is the latest log from the master for my latest run

24/09/24 05:59:24 INFO Master: Start scheduling for app app-20240924055923-0004
 with rpId: 0
24/09/24 06:02:27 INFO Master: 196.216.167.103:36600 got disassociated, removin
g it.
24/09/24 06:02:27 INFO Master: COMPSCSRV04.nust.na:33649 got disassociated, rem
oving it.
24/09/24 06:02:27 INFO Master: Removing app app-20240924055923-0004
24/09/24 06:02:27 WARN Master: Got status update for unknown executor app-20240
924055923-0004/0
24/09/24 06:02:27 WARN Master: Got status update for unknown executor app-20240
924055923-0004/1
24/09/24 06:04:53 INFO Master: Registering app Main
24/09/24 06:04:53 INFO Master: Registered app Main with ID app-20240924060453-0
005
24/09/24 06:04:53 INFO Master: Start scheduling for app app-20240924060453-0005
 with rpId: 0
24/09/24 06:04:53 INFO Master: Launching executor app-20240924060453-0005/0 on
worker worker-20240923111146-196.216.167.102-45805
24/09/24 06:04:53 INFO Master: Launching executor app-20240924060453-0005/1 on
worker worker-20240923111223-196.216.167.105-36589
24/09/24 06:04:54 INFO Master: Start scheduling for app app-20240924060453-0005
 with rpId: 0
24/09/24 06:04:54 INFO Master: Start scheduling for app app-20240924060453-0005
 with rpId: 0

joques · 2024-09-25T09:48:24Z

Stack trace
Here is what happened, the most recent locations are first:

geterror() @ core.jl:544
_jcall(::JavaCall.JavaObject{Symbol("org.apache.spark.sql.streaming.DataStreamReader")}, ::Ptr{Nothing}, ::Type, ::Tuple{}; callmethod::typeof(JavaCall.JNI.CallObjectMethodA)) @ core.jl:482
_jcall(::JavaCall.JavaObject{Symbol("org.apache.spark.sql.streaming.DataStreamReader")}, ::Ptr{Nothing}, ::Type, ::Tuple{}) @ core.jl:475
jcall(::JavaCall.JavaObject{Symbol("org.apache.spark.sql.streaming.DataStreamReader")}, ::String, ::Type, ::Tuple{}) @ core.jl:371
load @ streaming.jl:54
DotChainer @ chainable.jl:13

dfdx · 2024-09-25T09:49:32Z

For some reason, the added jar file is not propagated to your workers. Debugging this issue via Github issues isn't easy, but as a short in the night let's try to add Kafka connector as a package instead of a plain jar:

spark = SparkSession.builder()
        .appName("...")
        .master("spark://ip:7077")
        .config('spark.jars.packages', 'org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.1' )   # <-- this line changed
        .getOrCreate()

joques · 2024-09-25T10:14:13Z

It keeps throwing the same error. Suspecting that the added jar was not propagated, I created the same folder on each worker node and added the jar. But I am still getting the same error.
Another thing I tried was to add the jar file to the spark-defaults.conf on each node as follows

spark.jars path/to/spark-sql-kafka.jar

But the error persists. I am just confused now.

dfdx · 2024-09-25T10:34:05Z

What's your cluster config, e.g. cluster manager, remote or local workers, etc.? Usually, Spark creates project directories on workers dynamically, so putting jar files to some directory on workers beforehand doesn't take effect.

joques · 2024-09-25T10:41:42Z

I am running spark in a standalone mode on a 3-node cluster. I have one master and two workers. Each spark instance is on a different node. The kafka instance is running on the same node as the master. That's pretty much it.

joques · 2024-09-25T12:16:17Z

I made a few changes. I reverted the session back to spark.jars as follows:

spark = SparkSession.builder.appName("SoftwareTools").config("spark.jars", "/path/to/spark-sql-kafka-0-10_2.12-3.5.2.jar").master("spark://IP:7077").getOrCreate()

I also upgraded the version of scala to 2.12.2 to match the version of the jar file. But I am now getting the error below. Any hint?

Exception in thread "main" java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.sql.kafka010.KafkaSourceProvider$
	at org.apache.spark.sql.kafka010.KafkaSourceProvider.org$apache$spark$sql$kafka010$KafkaSourceProvider$$validateStreamOptions(KafkaSourceProvider.scala:338)
	at org.apache.spark.sql.kafka010.KafkaSourceProvider.sourceSchema(KafkaSourceProvider.scala:71)
	at org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:233)
	at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:118)
	at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:118)
	at org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:36)
	at org.apache.spark.sql.streaming.DataStreamReader.loadInternal(DataStreamReader.scala:169)
	at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:145)

dfdx · 2024-09-25T13:05:53Z

It looks like Kafka jar is now being picked up, but versions of libraries still mismatch. Let's try to align them. In your local installation of Spark.jl, find a file called pom.xml and change this:

    <java.version>1.11</java.version>
    <scala.version>2.13</scala.version>
    <scala.binary.version>2.13.6</scala.binary.version>

    <spark.version>[3.2.0,3.2.1]</spark.version>

to this:

    <java.version>YOUR ACUTAL JAVA VERSION</java.version>
    <scala.version>2.12</scala.version>
    <scala.binary.version>2.12.2</scala.binary.version>

    <spark.version>3.5.2</spark.version>

Then in Julia console type:

] build

or

] build Spark

Once the build is complete (and successful), try to test your code again.

joques · 2024-09-25T13:35:28Z

I made the suggested change. But my code is in a Pluto notebook. Do you think it will rebuild automatically if I restart the notebook? Spark is not available in the global space in Julia.

joques · 2024-09-25T14:05:18Z

After following the suggestion above and successfully building the Spark package, I am now observing two types of error:

1 - When I use the spark.jars.packages config option I get this error

Exception in thread "main" java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.sql.kafka010.KafkaSourceProvider$
	at org.apache.spark.sql.kafka010.KafkaSourceProvider.org$apache$spark$sql$kafka010$KafkaSourceProvider$$validateStreamOptions(KafkaSourceProvider.scala:338)
	at org.apache.spark.sql.kafka010.KafkaSourceProvider.sourceSchema(KafkaSourceProvider.scala:71)
	at org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:233)
	at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:118)
	at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:118)
	at org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:36)
	at org.apache.spark.sql.streaming.DataStreamReader.loadInternal(DataStreamReader.scala:169)
	at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:145)

When I revert back to the configuration using spark.jars with the path to the jar file, I get the following error

xception in thread "main" java.lang.NoClassDefFoundError: org/apache/kafka/common/serialization/ByteArraySerializer
	at org.apache.spark.sql.kafka010.KafkaSourceProvider$.(KafkaSourceProvider.scala:599)
	at org.apache.spark.sql.kafka010.KafkaSourceProvider$.(KafkaSourceProvider.scala)
	at org.apache.spark.sql.kafka010.KafkaSourceProvider.org$apache$spark$sql$kafka010$KafkaSourceProvider$$validateStreamOptions(KafkaSourceProvider.scala:338)
	at org.apache.spark.sql.kafka010.KafkaSourceProvider.sourceSchema(KafkaSourceProvider.scala:71)
	at org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:233)
	at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:118)
	at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:118)
	at org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:36)
	at org.apache.spark.sql.streaming.DataStreamReader.loadInternal(DataStreamReader.scala:169)
	at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:145)
Caused by: java.lang.ClassNotFoundException: org.apache.kafka.common.serialization.ByteArraySerializer
	at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
	... 10 more

joques · 2024-09-25T14:44:35Z

I added the Kafka clients package and it finally worked. Thanks for your assistance on that.
Now I am getting an error with the df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") code. I looked up the code, but the function is not in there. Is it inherited? I'd like to extract the key and values from the topic

PS: This is the error

UndefVarError: `selectExpr` not defined

joques · 2024-09-25T15:13:02Z

May I please have a concrete example of how to read and write from and to the kafka stream based on the Spark.jl implementation? I can't find any example in the package doc.

I did the following but it is not working.

rec_evts = stream_df.select(Column("key"), Column("value"))
query = rec_evts.writeStream().outputMode("complete").format("console").start()
query.awaitTermination()

Thanks

dfdx · 2024-09-25T15:48:47Z

I added the Kafka clients package and it finally worked. Thanks for your assistance on that.

Great! Could you please add a bit more details on what actions helped you to resolve the issue? E.g. what exact package did you add and how (jars, packages, manually)? This way others will be able to resolve similar issues quicker. Thanks in advance!

I did the following but it is not working.

Can you post the error you are getting? I haven't tested Spark/Kafka integration thoroughly, but the I don't see immediately any red flags in you code.

dfdx · 2024-09-26T08:06:20Z

I made the suggested change. But my code is in a Pluto notebook. Do you think it will rebuild automatically if I restart the notebook? Spark is not available in the global space in Julia.

Sorry, I forgot to address this. You can also control Spark and Scala versions using BUILD_SPARK_VERSION and BUILD_SCALA_VERSION environment variables. For examples:

ENV["BUILD_SPARK_VERSION"] = "3.5.2" 
ENV["BUILD_SCALA_VERSION"] = "2.12.2"
] build Spark

I guess this approach is easier to maintain in a notebook.

joques · 2024-09-26T08:12:16Z

Thanks for that. Could you also provide documentation on how to read a stream and write a stream using the library. I've struggled to get it to access data.

dfdx · 2024-09-26T08:22:31Z

As I mentioned above, I don't have proper environment to test it myself, but your code looks correct to me. If you could post the error you encounter, I may be able to suggest a fix.

joques · 2024-09-30T09:56:01Z

I've updated the session as follows:

spark = SparkSession.builder.appName("SoftwareTools").master("spark://196.216.167.103:7077").config("spark.jars","/home/sysdev/spark-3.5.2-bin-hadoop3/jars/commons-pool2-2.11.1.jar,/home/sysdev/spark-3.5.2-bin-hadoop3/jars/spark-sql-kafka-0-10_2.12-3.5.2.jar,/home/sysdev/spark-3.5.2-bin-hadoop3/jars/spark-streaming_2.12-3.5.2.jar,/home/sysdev/spark-3.5.2-bin-hadoop3/jars/kafka-clients-3.5.2.jar,/home/sysdev/spark-3.5.2-bin-hadoop3/jars/spark-token-provider-kafka-0-10_2.12-3.5.2.jar,/home/sysdev/spark-3.5.2-bin-hadoop3/jars/spark-tags-2_12-3.5.2.jar").getOrCreate()

Then when I do

stream_df = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "localhost:9092").option("subscribe", "st-streaming-session").option("includeHeaders", "true").option("startingOffsets", "earliest").load()

rec_evts = stream_df.select(Column("key"), Column("value"))

query = rec_evts.writeStream().outputMode("append").format("console").start().awaitTermination()

I get the following error

java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer$
	at org.apache.spark.sql.kafka010.KafkaBatchPartitionReader.(KafkaBatchPartitionReader.scala:53)
	at org.apache.spark.sql.kafka010.KafkaBatchReaderFactory$.createReader(KafkaBatchPartitionReader.scala:41)
	at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.advanceToNextIter(DataSourceRDD.scala:84)
	at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:63)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
	at org.apache.spark.sql.execution.datasources.v2.WritingSparkTask.$anonfun$run$1(WriteToDataSourceV2Exec.scala:441)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1397)
	at org.apache.spark.sql.execution.datasources.v2.WritingSparkTask.run(WriteToDataSourceV2Exec.scala:486)
	at org.apache.spark.sql.execution.datasources.v2.WritingSparkTask.run$(WriteToDataSourceV2Exec.scala:425)
	at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2Exec.scala:491)
	at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.$anonfun$writeWithV2$2(WriteToDataSourceV2Exec.scala:388)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
	at org.apache.spark.scheduler.Task.run(Task.scala:141)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)

What am I missing?

joques · 2024-10-01T14:33:46Z

is it normal that the print to console rec_evts.writeStream().outputMode("append").format("console").start().awaitTermination() returns an empty result when there is data in kafka? I've tried this in the REPL and it just returns the header. No content. I tried a similar code in scala and it prints. How does the Spark.jl one work?

dfdx · 2024-10-01T15:44:21Z

I have an idea what may have gone wrong. Could you please try streaming-fixes branch, i.e.:

] add Spark#streaming-fixes

and then run your code again?

joques · 2024-10-01T15:56:42Z

it still doesn't print anything. I get the following print out

Batch: 0

+-----+
|value|
+-----+
+-----+

Although it does a microbactch execution and prints the details below out

24/10/01 17:55:43 INFO MicroBatchExecution: Streaming query made progress: {
  "id" : "f5550c88-a331-4f4d-bef6-a8059ce42b03",
  "runId" : "323c29a5-5be2-4099-bf3c-bbe70c9484ba",
  "name" : null,
  "timestamp" : "2024-10-01T15:55:38.385Z",
  "batchId" : 0,
  "numInputRows" : 0,
  "inputRowsPerSecond" : 0.0,
  "processedRowsPerSecond" : 0.0,
  "durationMs" : {
    "addBatch" : 2754,
    "commitOffsets" : 71,
    "getBatch" : 42,
    "latestOffset" : 1334,
    "queryPlanning" : 693,
    "triggerExecution" : 5014,
    "walCommit" : 44
  },
  "stateOperators" : [ ],
  "sources" : [ {
    "description" : "KafkaV2[Subscribe[st-streaming-session]]",
    "startOffset" : null,
    "endOffset" : {
      "st-streaming-session" : {
        "0" : 15
      }
    },
    "latestOffset" : {
      "st-streaming-session" : {
        "0" : 15
      }
    },
    "numInputRows" : 0,
    "inputRowsPerSecond" : 0.0,
    "processedRowsPerSecond" : 0.0,
    "metrics" : {
      "avgOffsetsBehindLatest" : "0.0",
      "maxOffsetsBehindLatest" : "0",
      "minOffsetsBehindLatest" : "0"
    }
  } ],
  "sink" : {
    "description" : "org.apache.spark.sql.execution.streaming.ConsoleTable$@2169e18e",
    "numOutputRows" : 0
  }
}

dfdx · 2024-10-01T17:24:22Z

Can you show complete Julia and Scala code?

Spark.jl is nothing more than a shallow wrapper around Spark API. For instance, the format function is implemented as:

function format(writer::DataStreamWriter, fmt::String)
    jwriter = jcall(writer.jwriter, "format", JDataStreamWriter, (JString,), fmt)
    return DataStreamWriter(jwriter)
end

Here writer is a Julia wrapper around Java pointer jwriter to an object of Spark's DataStreamWriter. The line:

jcall(writer.jwriter,   "format", JDataStreamWriter,     (JString,),        fmt       )
#            ^              ^               ^               ^                ^ 
#      java-object       fn-name       return-type        arg-types        arg-val

is thus just a JNI call to String DataStreamWriter.format(String fmt). So there's no added magic.

Yet, some Julia wrappers may be broken - as I said earlier, I never had a chance to extensively test streaming. If there's really no difference between Scala and Julia version, you can call Spark API directly, e.g.:

using JavaCall
using Spark
import Spark: JDataStreamWriter

...
jwriter = rec_evts.writeStream().jwriter
jwriter = jcall(jwriter, "format", JDataStreamWriter, (JString,), fmt)
jwriter = jcall(jwriter, "outputMode", JDataStreamWriter, (JString,), m)
...

joques · 2024-10-01T17:33:41Z

Please find below both the Julia and the Scala versions. The scala version shows the messages in spark-shell.
1 - Julia

using Spark

spark = SparkSession.builder.appName("SoftwareTools").master("spark://IP:7077").config("saprk.jars", "/home/sysdev/spark-3.5.2-bin-hadoop3/jars/commons-pool2-2.12.0.jar,/home/sysdev/spark-3.5.2-bin-hadoop3/jars/spark-streaming_2.12-3.5.2.jar,/home/sysdev/spark-3.5.2-bin-hadoop3/jars/spark-sql-kafka-0-10_2.12-3.5.2.jar,/home/sysdev/spark-3.5.2-bin-hadoop3/jars/kafka-clients-3.5.2.jar,/home/sysdev/spark-3.5.2-bin-hadoop3/jars/spark-token-provider-kafka-0-10_2.12-3.5.2.jar,/home/sysdev/spark-3.5.2-bin-hadoop3/jars/spark-tags-2.12-3.5.2.jar,/home/sysdev/spark-3.5.2-bin-hadoop3/jars/spark-streaming_2.12-3.5.2.jar").getOrCreate()

stream_df = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "localhost:9092").option("subscribe", "st-streaming-session").option("includeHeaders", "true").option("startingOffsets", "latest").load()

rec_evts = stream_df.select(Column("value"))

rec_evts.writeStream().format("console").outputMode("append").start().awaitTermination()

2 - Scala

import org.apache.spark.sql.SparkSession
import org.apache.commons.pool2.impl._

val sp_session = SparkSession.builder().appName("Test App").master("spark://IP:7077").config("spark.jars","/home/sysdev/spark-3.5.2-bin-hadoop3/jars/spark-sql-kafka-0-10_2.12-3.5.2.jar,/home/sysdev/spark-3.5.2-bin-hadoop3/jars/kafka-clients-3.5.2.jar,/home/sysdev/spark-3.5.2-bin-hadoop3/jars/spark-token-provider-kafka-0-10_2.12-3.5.2.jar,/home/sysdev/spark-3.5.2-bin-hadoop3/jars/spark-tags-2_12-3.5.2.jar").getOrCreate()

val stream_df = sp_session.readStream.format("kafka").option("kafka.bootstrap.servers","localhost:9092").option("subscribe","st-streaming-session").option("startingOffsets", "earliest").load()

val values = stream_df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)").as[(String,String)]

values.writeStream.outputMode("append").format("console").start().awaitErmination()

dfdx · 2024-10-01T17:40:43Z

You use latest offset in the Julia version and earliest in the Scala version. Is it intended? How many messages do you add to the topic during the test?

joques · 2024-10-01T17:47:51Z

I've actually tried both offsets. But I get no output.

dfdx · 2024-10-01T23:26:26Z

Here I'm trying to set up a dev environment to check this issue, but today I've run out of time. I'll try to allocate more time in the coming days.

dfdx · 2024-10-01T23:28:21Z

By the way, what version of Java are you running?

joques · 2024-10-02T00:07:21Z

openjdk version "1.8.0_412"
OpenJDK Runtime Environment (build 1.8.0_412-b08)
OpenJDK 64-Bit Server VM (build 25.412-b08, mixed mode)

dfdx · 2024-10-02T13:32:25Z

java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer$

How did you manage to fix this issue after all? I'm getting a similar error both - in Julia and Scala version.

joques · 2024-10-02T14:03:22Z

I downloaded the spark-token-provider-kafka-0-10_2.12-3.5.2.jar, commons-pool2-2.12.0.jar and spark-tags-2.12-3.5.2.jar

joques · 2024-10-02T14:08:55Z

Please tell me, based on the JWriter example, how do I start writing to the console? I am bit lost on that part

dfdx · 2024-10-02T14:39:50Z

Please tell me, based on the JWriter example, how do I start writing to the console? I am bit lost on that part

It should be something like this:

import Spark: JDataStreamWriter, JStreamingQuery

rec_evts = ...   # same as previously

jwriter = rec_evts.writeStream().jwriter
jwriter = jcall(jwriter, "format", JDataStreamWriter, (JString,), "console")
jwriter = jcall(jwriter, "outputMode", JDataStreamWriter, (JString,), "append")
jquery = jcall(jwriter, "start", JStreamingQuery, ())
jcall(jquery, "awaitTermination", Nothing, ())

If it works, then the problem is in Spark.jl wrappers. If it doesn't, there may be a misconfiguration on the Apache Spark's side.

I downloaded the spark-token-provider-kafka-0-10_2.12-3.5.2.jar, commons-pool2-2.12.0.jar and spark-tags-2.12-3.5.2.jar

It didn't work for me in a container, so I'm digging deeper.

joques · 2024-10-06T05:58:07Z

any luck with this so far?

dfdx · 2024-10-06T09:43:31Z

Unfortunately, no. Apparently, years of not using Spark flashed away my experience setting up the environment. I will give it another try next week.

Meanwhile, did you have a chance to test jcall-based example from my previous message? If so, did you see any difference with previous examples?

joques · 2024-10-06T09:57:38Z

The jcall-based example is still not printing. An almost similar example in scala is working though.

dfdx · 2024-10-06T20:21:48Z

So the problem must be in Spark's setup. One possible issue is a conflict of versions of Julia / JavaCall / Java / Scala / Spark. In my docker env, I noticed that some other pretty basic functions don't work too. Could you please run this simple code and tell if it works?

using JavaCall

JavaCall.init()
listmethods(JString)

For me (Julia 1.10.5, JavaCall 0.8.0, OpenJDK 1.8.0_422) this gives:

Error calling Java: java.lang.NoSuchMethodError: forName

which is very unexpected. Do you experience it as well?

joques · 2024-10-06T20:33:10Z

I just tried it in the REPL but got no error. I got a list of methods. I also checked the versions of Julia, JavaCall and jdk. Same as yours.
Below is a snapshot of the output

java.lang.String toString()
 int hashCode()
 int compareTo(java.lang.Object)
 int compareTo(java.lang.String)
 int indexOf(java.lang.String, int)
 int indexOf(int)
 int indexOf(java.lang.String)
 int indexOf(int, int)
 java.lang.String valueOf(char)
 java.lang.String valueOf(java.lang.Object)
 java.lang.String valueOf(boolean)
 java.lang.String valueOf(char[], int, int)
 java.lang.String valueOf(char[])
 java.lang.String valueOf(double)
 java.lang.String valueOf(float)
...

dfdx · 2024-10-06T20:41:39Z

Perhaps its due to the underlying OSX. I will try it on Linux VM a bit later.

joques · 2024-10-06T20:46:25Z

Okay! I am using Cent OS 7
Spark's behaviour with Julia is intriguing me. Same configuration and almost same code work in scala but not in Julia. I can't fathom that.

dfdx · 2024-10-06T22:37:40Z

Spark was not designed to be extended from outside. A good example of it is SparkSubmit class which is 1200+ lines of convoluted logic and special cases for Java, Scala and Python.

In main branch, there's also a bug: we override the provided "spark.jars" with our own sparkjl.jar. So I'm actually surprised you were able to make Kafka source visible to Spark executors. Is there a chance you can make a reproducible Dockerfile? If not, could you please share more details about your Spark installation and jar placement?

By the way, moving to Linux indeed fixed the issue with JavaCall.

joques · 2024-10-07T10:36:27Z

I will look into that later this week. I shall get back to you

joques · 2024-10-08T16:10:18Z

@dfdx is there a way I can contact you directly by email? I'd like to suggest some ideas regarding a PR.

dfdx · 2024-10-08T20:51:14Z

@joques Please check out the Github notifications for a comment with my email (I've deleted it from the issue itself)

structured streaming #120

structured streaming #120

Comments

joques commented Sep 21, 2024

dfdx commented Sep 21, 2024

joques commented Sep 23, 2024

dfdx commented Sep 23, 2024 • edited Loading

joques commented Sep 24, 2024

joques commented Sep 25, 2024

dfdx commented Sep 25, 2024

joques commented Sep 25, 2024

dfdx commented Sep 25, 2024

joques commented Sep 25, 2024

joques commented Sep 25, 2024 • edited Loading

dfdx commented Sep 25, 2024

joques commented Sep 25, 2024 • edited Loading

joques commented Sep 25, 2024 • edited Loading

joques commented Sep 25, 2024 • edited Loading

joques commented Sep 25, 2024 • edited Loading

dfdx commented Sep 25, 2024

dfdx commented Sep 26, 2024

joques commented Sep 26, 2024

dfdx commented Sep 26, 2024

joques commented Sep 30, 2024 • edited Loading

joques commented Oct 1, 2024

dfdx commented Oct 1, 2024

joques commented Oct 1, 2024 • edited Loading

Batch: 0

dfdx commented Oct 1, 2024 • edited Loading

joques commented Oct 1, 2024

dfdx commented Oct 1, 2024

joques commented Oct 1, 2024

dfdx commented Oct 1, 2024

dfdx commented Oct 1, 2024

joques commented Oct 2, 2024

dfdx commented Oct 2, 2024

joques commented Oct 2, 2024

joques commented Oct 2, 2024

dfdx commented Oct 2, 2024

joques commented Oct 6, 2024

dfdx commented Oct 6, 2024

joques commented Oct 6, 2024

dfdx commented Oct 6, 2024

joques commented Oct 6, 2024

dfdx commented Oct 6, 2024

joques commented Oct 6, 2024

dfdx commented Oct 6, 2024

joques commented Oct 7, 2024

joques commented Oct 8, 2024

dfdx commented Oct 8, 2024

dfdx commented Sep 23, 2024 •

edited

Loading

joques commented Sep 25, 2024 •

edited

Loading

joques commented Sep 25, 2024 •

edited

Loading

joques commented Sep 25, 2024 •

edited

Loading

joques commented Sep 25, 2024 •

edited

Loading

joques commented Sep 25, 2024 •

edited

Loading

joques commented Sep 30, 2024 •

edited

Loading

joques commented Oct 1, 2024 •

edited

Loading

dfdx commented Oct 1, 2024 •

edited

Loading