Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empty array filtering issue #1268

Open
vokuk-walkme opened this issue Jul 23, 2024 · 1 comment
Open

Empty array filtering issue #1268

vokuk-walkme opened this issue Jul 23, 2024 · 1 comment
Assignees

Comments

@vokuk-walkme
Copy link

vokuk-walkme commented Jul 23, 2024

I have a piece of code which reads the same data from GCS or BigQuery. Simplified schema for the data looks like this:

root
 |-- time: timestamp (nullable = true)
 |-- name: string (nullable = true)
 |-- ids: array (nullable = true)
 |    |-- element: string (containsNull = true)

I need to read only the data where ids array is not empty and it seem to work fine with a GCS connector.

spark.read.format("parquet")
  .load("gs://some-path")
  .filter($"ids" =!= Array.empty[String])
  ... // other filters

However when I try to do the same with a BigQuery connector I get an error:

spark.read.format("bigquery")
  .option("viewsEnabled", "true")
  .load("internal-table-name")
  .filter($"ids" =!= Array.empty[String])
  ... // other filters

24/07/23 11:00:30 INFO ReadSessionCreator: |creation a read session for table null, parameters: |selectedFields=[name,ids],|filter=[(NOT (`ids` = ArraySeq())) AND (`ids` IS NOT NULL) AND (`name` = 'John') AND (`name` IS NOT NULL) AND (`time` <= TIMESTAMP '2024-07-18T23:59:59.999999Z') AND (`time` >= TIMESTAMP '2024-07-18T00:00:00Z') AND (`time` IS NOT NULL)]
com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.InvalidArgumentException: com.google.cloud.spark.bigquery.repackaged.io.grpc.StatusRuntimeException: INVALID_ARGUMENT: request failed: Query error: Function not found: ArraySeq at [1:189]
        at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:92)
        at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:41)
        at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:86)
        at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:66)
        at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.grpc.GrpcExceptionCallable$ExceptionTransformingFuture.onFailure(GrpcExceptionCallable.java:97)
...

Seems like a predicate pushdown doesn't work correctly for this case.
Changing filter condition to .filter(array_size($"ids") > 0) works but this condition most likely not going to be pushed down.

Dependencies used:

"org.apache.spark" %% "spark-sql" % "3.5.0",
"com.google.cloud.bigdataoss" % "gcs-connector" % "3.0.0",
"com.google.cloud.spark" % "[spark-3.5-bigquery / spark-bigquery-with-dependencies]" % "0.39.1"
@vishalkarve15 vishalkarve15 self-assigned this Jul 29, 2024
@vokuk-walkme
Copy link
Author

Any updates on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants