-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
closed stream issue (issue #650) #651
closed stream issue (issue #650) #651
Conversation
Generally, this looks like a good way to go. If you can get your test to compile and work in all the CI builds, I think this is worth merging. |
Generally, this looks like a good way to go. If you can get your test to compile and work in all the CI builds, I think this is work merging. |
|
||
"excel v2 and maxNumRows" can { | ||
|
||
s"read with maxNumRows=200" in { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: the s interpolation is not needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
@@ -135,6 +135,7 @@ class ExcelHelper private (options: ExcelOptions) { | |||
def getRows(conf: Configuration, uri: URI): Iterator[Vector[Cell]] = { | |||
val workbook = getWorkbook(conf, uri) | |||
val excelReader = DataLocator(options) | |||
// todo this does not work with streaming reader |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are in you a position to continue with the 'todo'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not yet, I removed the TODO for now...
build.sbt
Outdated
@@ -80,6 +80,9 @@ libraryDependencies ++= Seq( | |||
"org.apache.spark" %% "spark-core" % testSparkVersion.value % "provided", | |||
"org.apache.spark" %% "spark-sql" % testSparkVersion.value % "provided", | |||
"org.apache.spark" %% "spark-hive" % testSparkVersion.value % "provided", | |||
// added hadoop libs to test allowing to execute the tests locally (from within intellij) | |||
"org.apache.hadoop" % "hadoop-common" % "3.3.1" % Test, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these lib additions are causing the CI (Github Actions) build to fail
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh yeah, just found out ;) do you know why? because for local testing I had to add those
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid I don't know the niceties of this CI - I presume the different parallel runs are deliberately using different versions of Spark and Hadoop - so any of the jobs that need an older version of Hadoop libs are affected by this change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
libs are now only used if not running in CI. Allows to test locally without modifying the build.sbt
There are also compile issues in some of the plans
|
Heck, works fine locally and fails on CI with the xml stream error message that should have been fixed now :( Will look into it later |
@pjfanning the ci passes, but I could only acchieve this by removing the workbook.close from ExcelHelper.getRows() for streaming workbook. So this method still gets called on V2 implementation and could cause resource issues (potentially all machines that are not my local WIndows box ;) ). I am not familiar with the code but I assume getRows is called to determine header etc. SO it needs some overhaul but I saw that you are also working on some alternative approach. Great :) |
I'm just trying out something - feel free to review it but it's still WIP My attempt may come to nought but I think I'm making a little progress |
I will close this PR, replaced by #653 |
The issue this PR tries to fix is #650
I introduced ExcelPartitionReaderFromIterator and moved the problematic Workbook.close() to the close() method of this class. Now spark takes care of closing the file.
The code change is pretty straightforward. Basically I moved the code from ExcelHelper.getRows and ExcelPartitionReaderFactory.readFile to the apply() of the newly introduced class.
In the ExcelPartitionReaderFactory.buildReader() we create an instance of the new class and pass it to PartitionReaderWithPartitionValues, which is used by spark for reading the data.
val fileReader = ExcelPartitionReaderFromIterator(conf, parsedOptions, file, parser, headerChecker, readDataSchema) new PartitionReaderWithPartitionValues(fileReader, readDataSchema, partitionSchema, file.partitionValues)
when spark finishes reading it calls close() on PartitionReaderWithPartitionValues which in turn calls close on the fileReader. There we call close on the workbook and the issue is solved.
I am not really satisifed with the method signatures, but that was the best I could come up in the given time. If someone has an idea on how to improve it pls let me know.
The PR doesn't adress ExcelHelper.getRows(). I think this function could still cause issues, because when accessing the iterator we are reading from a closed workbook.