Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1 streaming read test #652

Closed
wants to merge 4 commits into from
Closed

v1 streaming read test #652

wants to merge 4 commits into from

Conversation

pjfanning
Copy link
Collaborator

@pjfanning pjfanning commented Oct 2, 2022

Test is based on @christianknoepfle work in in #651 - I just feel it useful to test that the v1 data source handles streaming ok.

@pjfanning
Copy link
Collaborator Author

@christianknoepfle is there a reason not to use the v1 data source? It feels like the v2 data source needs a lot of work to fix the streaming support - the v2 code is just far too eager to shut all the underlying resources before they are even iterated over.

@christianknoepfle
Copy link
Contributor

v2 gives me folder and partitioning and works in the same way as csv/json etc. it would allow us to simplify our code base.


class MaxNumRowsSuite extends AnyWordSpec with DataFrameSuiteBase with Matchers {

"excel v2 and maxNumRows" can {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be excel v1 ;)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks - fixed

@nightscape
Copy link
Owner

@pjfanning the integration tests actually do test streaming:
https://github.com/crealytics/spark-excel/blob/main/src/test/scala/com/crealytics/spark/excel/IntegrationSuite.scala#L349
Do you feel that sth. is missing in the integration tests?
I'm a little wary of adding large binary files to the repo if we can handle this another way.

@pjfanning
Copy link
Collaborator Author

@nightscape so the related PRs (#651, #653) try to fix the v2 data source streaming and the existing tests didn't catch the issue. This PR uses the same xlsx that causes the v2 data source problem just for completeness. PRs #651, #653 need this new 6.7Mb file anyway. Do you think 6.7Mb is going to be an issue?

@nightscape
Copy link
Owner

It would be interesting to see if the integration tests would have caught the issue in v2.
Would you mind locally changing the

val reader = spark.read.excel(dataAddress = s"'$sheetName'!A1", header = header)

to

val reader = spark.read.format("excel").option("dataAddress", s"'$sheetName'!A1").option("header", header)

here and see if it catches the issue as well?

@nightscape
Copy link
Owner

Generally I don't think the 6.7Mb themselves are going to hurt, but they go in a different direction than I had intended the project to go:
I wrote the integration tests such that they test the full cycle of writing arbitrary data to disk and reading it again, all with spark-excel, and with various combinations of options: streaming yes/no, maxByteArraySize values, v1/v2 (which I unfortunately have not added yet).
That way, whenever we encounter a bug, we just adapt the random data generator to produce data that exhibits the bug, and it will be exposed in all implementations.

I fear that if we add Excel files then this will become the standard to test bugs (as it already seems to happen), one has to take extra care of applying it to all combinations of options and it doesn't take the writing side into account.

@pjfanning
Copy link
Collaborator Author

I'll close this because it is part of #653 anyway

@pjfanning pjfanning closed this Oct 4, 2022
@nightscape nightscape deleted the v1-streaming-read-test branch October 4, 2022 15:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants