closed stream issue (issue #650) #651

christianknoepfle · 2022-10-02T11:12:47Z

The issue this PR tries to fix is #650

I introduced ExcelPartitionReaderFromIterator and moved the problematic Workbook.close() to the close() method of this class. Now spark takes care of closing the file.

The code change is pretty straightforward. Basically I moved the code from ExcelHelper.getRows and ExcelPartitionReaderFactory.readFile to the apply() of the newly introduced class.

In the ExcelPartitionReaderFactory.buildReader() we create an instance of the new class and pass it to PartitionReaderWithPartitionValues, which is used by spark for reading the data.
val fileReader = ExcelPartitionReaderFromIterator(conf, parsedOptions, file, parser, headerChecker, readDataSchema) new PartitionReaderWithPartitionValues(fileReader, readDataSchema, partitionSchema, file.partitionValues)
when spark finishes reading it calls close() on PartitionReaderWithPartitionValues which in turn calls close on the fileReader. There we call close on the workbook and the issue is solved.

I am not really satisifed with the method signatures, but that was the best I could come up in the given time. If someone has an idea on how to improve it pls let me know.

The PR doesn't adress ExcelHelper.getRows(). I think this function could still cause issues, because when accessing the iterator we are reading from a closed workbook.

…s not closed yet

… after read

pjfanning · 2022-10-02T11:23:33Z

Generally, this looks like a good way to go. If you can get your test to compile and work in all the CI builds, I think this is worth merging.

pjfanning · 2022-10-02T11:24:27Z

Generally, this looks like a good way to go. If you can get your test to compile and work in all the CI builds, I think this is work merging.

pjfanning · 2022-10-02T11:20:54Z

src/test/scala/com/crealytics/spark/v2/excel/MaxNumRowsSuite.scala

+
+  "excel v2 and maxNumRows" can {
+
+    s"read with maxNumRows=200" in {


nit: the s interpolation is not needed

pjfanning · 2022-10-02T11:21:21Z

src/main/scala/com/crealytics/spark/v2/excel/ExcelHelper.scala

@@ -135,6 +135,7 @@ class ExcelHelper private (options: ExcelOptions) {
  def getRows(conf: Configuration, uri: URI): Iterator[Vector[Cell]] = {
    val workbook = getWorkbook(conf, uri)
    val excelReader = DataLocator(options)
+    // todo this does not work with streaming reader


are in you a position to continue with the 'todo'?

not yet, I removed the TODO for now...

pjfanning · 2022-10-02T11:28:47Z

build.sbt

@@ -80,6 +80,9 @@ libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-core" % testSparkVersion.value % "provided",
  "org.apache.spark" %% "spark-sql" % testSparkVersion.value % "provided",
  "org.apache.spark" %% "spark-hive" % testSparkVersion.value % "provided",
+  // added hadoop libs to test allowing to execute the tests locally (from within intellij)
+  "org.apache.hadoop" % "hadoop-common" % "3.3.1" % Test,


these lib additions are causing the CI (Github Actions) build to fail

oh yeah, just found out ;) do you know why? because for local testing I had to add those

I'm afraid I don't know the niceties of this CI - I presume the different parallel runs are deliberately using different versions of Spark and Hadoop - so any of the jobs that need an older version of Hadoop libs are affected by this change

libs are now only used if not running in CI. Allows to test locally without modifying the build.sbt

pjfanning · 2022-10-02T11:43:08Z

There are also compile issues in some of the plans

[error] /home/runner/work/spark-excel/spark-excel/src/main/3.x/scala/com/crealytics/spark/v2/excel/ExcelPartitionReaderFromIterator.scala:17:55: type parameter InternalRow defined in class ExcelPartitionReaderFromIterator shadows class InternalRow defined in package catalyst. You may want to rename your type parameter, or possibly remove it.
[error] private[excel] class ExcelPartitionReaderFromIterator[InternalRow] private (

christianknoepfle · 2022-10-02T11:46:02Z

Heck, works fine locally and fails on CI with the xml stream error message that should have been fixed now :( Will look into it later

christianknoepfle · 2022-10-02T20:19:46Z

@pjfanning the ci passes, but I could only acchieve this by removing the workbook.close from ExcelHelper.getRows() for streaming workbook. So this method still gets called on V2 implementation and could cause resource issues (potentially all machines that are not my local WIndows box ;) ). I am not familiar with the code but I assume getRows is called to determine header etc. SO it needs some overhaul but I saw that you are also working on some alternative approach. Great :)

pjfanning · 2022-10-02T20:22:12Z

but I saw that you are also working on some alternative approach. Great :)

I'm just trying out something - feel free to review it but it's still WIP

My attempt may come to nought but I think I'm making a little progress

christianknoepfle · 2022-10-04T14:28:59Z

I will close this PR, replaced by #653

christianknoepfle added 3 commits October 1, 2022 22:15

added large file xlsx testcase, fixed xml stream issue but workbook i…

793d689

…s not closed yet

introduced ExcelPartitionReaderFromIterator which closes the workbook…

0d08ff8

… after read

some cleanup

1f7bf05

pjfanning requested changes Oct 2, 2022

View reviewed changes

pjfanning reviewed Oct 2, 2022

View reviewed changes

some cleanup

354346a

some cleanup

800a9f2

pjfanning mentioned this pull request Oct 2, 2022

v1 streaming read test #652

Closed

christianknoepfle added 2 commits October 2, 2022 19:42

some cleanup

cc09382

added some test code to find out what could go wrong in ci

13f473d

pjfanning mentioned this pull request Oct 2, 2022

V2 streaming read (alternative approach) #653

Merged

scalastyle fixed

700252d

christianknoepfle added 2 commits October 3, 2022 18:12

changed CloseableIterator

8a6bf42

added testcases

90e96fb

christianknoepfle closed this Oct 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

closed stream issue (issue #650) #651

closed stream issue (issue #650) #651

christianknoepfle commented Oct 2, 2022 •

edited

Loading

pjfanning commented Oct 2, 2022 •

edited

Loading

pjfanning commented Oct 2, 2022

pjfanning Oct 2, 2022

christianknoepfle Oct 2, 2022

pjfanning Oct 2, 2022 •

edited

Loading

christianknoepfle Oct 2, 2022

pjfanning Oct 2, 2022

christianknoepfle Oct 2, 2022

pjfanning Oct 2, 2022

christianknoepfle Oct 2, 2022 •

edited

Loading

pjfanning commented Oct 2, 2022

christianknoepfle commented Oct 2, 2022

christianknoepfle commented Oct 2, 2022

pjfanning commented Oct 2, 2022

christianknoepfle commented Oct 4, 2022


		"excel v2 and maxNumRows" can {

		s"read with maxNumRows=200" in {

closed stream issue (issue #650) #651

closed stream issue (issue #650) #651

Conversation

christianknoepfle commented Oct 2, 2022 • edited Loading

pjfanning commented Oct 2, 2022 • edited Loading

pjfanning commented Oct 2, 2022

pjfanning Oct 2, 2022

Choose a reason for hiding this comment

christianknoepfle Oct 2, 2022

Choose a reason for hiding this comment

pjfanning Oct 2, 2022 • edited Loading

Choose a reason for hiding this comment

christianknoepfle Oct 2, 2022

Choose a reason for hiding this comment

pjfanning Oct 2, 2022

Choose a reason for hiding this comment

christianknoepfle Oct 2, 2022

Choose a reason for hiding this comment

pjfanning Oct 2, 2022

Choose a reason for hiding this comment

christianknoepfle Oct 2, 2022 • edited Loading

Choose a reason for hiding this comment

pjfanning commented Oct 2, 2022

christianknoepfle commented Oct 2, 2022

christianknoepfle commented Oct 2, 2022

pjfanning commented Oct 2, 2022

christianknoepfle commented Oct 4, 2022

christianknoepfle commented Oct 2, 2022 •

edited

Loading

pjfanning commented Oct 2, 2022 •

edited

Loading

pjfanning Oct 2, 2022 •

edited

Loading

christianknoepfle Oct 2, 2022 •

edited

Loading