v1 streaming read test #652

pjfanning · 2022-10-02T13:19:28Z

Test is based on @christianknoepfle work in in #651 - I just feel it useful to test that the v1 data source handles streaming ok.

pjfanning · 2022-10-02T13:34:21Z

@christianknoepfle is there a reason not to use the v1 data source? It feels like the v2 data source needs a lot of work to fix the streaming support - the v2 code is just far too eager to shut all the underlying resources before they are even iterated over.

christianknoepfle · 2022-10-02T16:50:20Z

v2 gives me folder and partitioning and works in the same way as csv/json etc. it would allow us to simplify our code base.

christianknoepfle · 2022-10-02T16:51:31Z

src/test/scala/com/crealytics/spark/excel/MaxNumRowsSuite.scala

+
+class MaxNumRowsSuite extends AnyWordSpec with DataFrameSuiteBase with Matchers {
+
+  "excel v2 and maxNumRows" can {


should be excel v1 ;)

thanks - fixed

nightscape · 2022-10-03T20:20:42Z

@pjfanning the integration tests actually do test streaming:
https://github.com/crealytics/spark-excel/blob/main/src/test/scala/com/crealytics/spark/excel/IntegrationSuite.scala#L349
Do you feel that sth. is missing in the integration tests?
I'm a little wary of adding large binary files to the repo if we can handle this another way.

pjfanning · 2022-10-03T20:32:15Z

@nightscape so the related PRs (#651, #653) try to fix the v2 data source streaming and the existing tests didn't catch the issue. This PR uses the same xlsx that causes the v2 data source problem just for completeness. PRs #651, #653 need this new 6.7Mb file anyway. Do you think 6.7Mb is going to be an issue?

nightscape · 2022-10-03T20:49:13Z

It would be interesting to see if the integration tests would have caught the issue in v2.
Would you mind locally changing the

val reader = spark.read.excel(dataAddress = s"'$sheetName'!A1", header = header)

to

val reader = spark.read.format("excel").option("dataAddress", s"'$sheetName'!A1").option("header", header)

here and see if it catches the issue as well?

nightscape · 2022-10-03T20:58:52Z

Generally I don't think the 6.7Mb themselves are going to hurt, but they go in a different direction than I had intended the project to go:
I wrote the integration tests such that they test the full cycle of writing arbitrary data to disk and reading it again, all with spark-excel, and with various combinations of options: streaming yes/no, maxByteArraySize values, v1/v2 (which I unfortunately have not added yet).
That way, whenever we encounter a bug, we just adapt the random data generator to produce data that exhibits the bug, and it will be exposed in all implementations.

I fear that if we add Excel files then this will become the standard to test bugs (as it already seems to happen), one has to take extra care of applying it to all combinations of options and it doesn't take the writing side into account.

pjfanning · 2022-10-04T13:05:58Z

I'll close this because it is part of #653 anyway

pjfanning added 2 commits October 2, 2022 14:18

v1 streaming read test

16d71c3

Update MaxNumRowsSuite.scala

52fecd2

christianknoepfle reviewed Oct 2, 2022

View reviewed changes

pjfanning added 2 commits October 2, 2022 19:45

Update MaxNumRowsSuite.scala

a991dd4

Update MaxRowsReadSuite.scala

1060005

pjfanning closed this Oct 4, 2022

nightscape deleted the v1-streaming-read-test branch October 4, 2022 15:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1 streaming read test #652

v1 streaming read test #652

pjfanning commented Oct 2, 2022 •

edited

Loading

pjfanning commented Oct 2, 2022

christianknoepfle commented Oct 2, 2022

christianknoepfle Oct 2, 2022

pjfanning Oct 2, 2022

nightscape commented Oct 3, 2022

pjfanning commented Oct 3, 2022

nightscape commented Oct 3, 2022

nightscape commented Oct 3, 2022

pjfanning commented Oct 4, 2022


		class MaxNumRowsSuite extends AnyWordSpec with DataFrameSuiteBase with Matchers {

		"excel v2 and maxNumRows" can {

v1 streaming read test #652

v1 streaming read test #652

Conversation

pjfanning commented Oct 2, 2022 • edited Loading

pjfanning commented Oct 2, 2022

christianknoepfle commented Oct 2, 2022

christianknoepfle Oct 2, 2022

Choose a reason for hiding this comment

pjfanning Oct 2, 2022

Choose a reason for hiding this comment

nightscape commented Oct 3, 2022

pjfanning commented Oct 3, 2022

nightscape commented Oct 3, 2022

nightscape commented Oct 3, 2022

pjfanning commented Oct 4, 2022

pjfanning commented Oct 2, 2022 •

edited

Loading