Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

closed stream issue (issue #650) #651

Conversation

christianknoepfle
Copy link
Contributor

@christianknoepfle christianknoepfle commented Oct 2, 2022

The issue this PR tries to fix is #650

I introduced ExcelPartitionReaderFromIterator and moved the problematic Workbook.close() to the close() method of this class. Now spark takes care of closing the file.

The code change is pretty straightforward. Basically I moved the code from ExcelHelper.getRows and ExcelPartitionReaderFactory.readFile to the apply() of the newly introduced class.

In the ExcelPartitionReaderFactory.buildReader() we create an instance of the new class and pass it to PartitionReaderWithPartitionValues, which is used by spark for reading the data.
val fileReader = ExcelPartitionReaderFromIterator(conf, parsedOptions, file, parser, headerChecker, readDataSchema) new PartitionReaderWithPartitionValues(fileReader, readDataSchema, partitionSchema, file.partitionValues)
when spark finishes reading it calls close() on PartitionReaderWithPartitionValues which in turn calls close on the fileReader. There we call close on the workbook and the issue is solved.

I am not really satisifed with the method signatures, but that was the best I could come up in the given time. If someone has an idea on how to improve it pls let me know.

The PR doesn't adress ExcelHelper.getRows(). I think this function could still cause issues, because when accessing the iterator we are reading from a closed workbook.

@pjfanning
Copy link
Collaborator

pjfanning commented Oct 2, 2022

Generally, this looks like a good way to go. If you can get your test to compile and work in all the CI builds, I think this is worth merging.

@pjfanning
Copy link
Collaborator

Generally, this looks like a good way to go. If you can get your test to compile and work in all the CI builds, I think this is work merging.


"excel v2 and maxNumRows" can {

s"read with maxNumRows=200" in {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the s interpolation is not needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

@@ -135,6 +135,7 @@ class ExcelHelper private (options: ExcelOptions) {
def getRows(conf: Configuration, uri: URI): Iterator[Vector[Cell]] = {
val workbook = getWorkbook(conf, uri)
val excelReader = DataLocator(options)
// todo this does not work with streaming reader
Copy link
Collaborator

@pjfanning pjfanning Oct 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are in you a position to continue with the 'todo'?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not yet, I removed the TODO for now...

build.sbt Outdated
@@ -80,6 +80,9 @@ libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % testSparkVersion.value % "provided",
"org.apache.spark" %% "spark-sql" % testSparkVersion.value % "provided",
"org.apache.spark" %% "spark-hive" % testSparkVersion.value % "provided",
// added hadoop libs to test allowing to execute the tests locally (from within intellij)
"org.apache.hadoop" % "hadoop-common" % "3.3.1" % Test,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these lib additions are causing the CI (Github Actions) build to fail

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yeah, just found out ;) do you know why? because for local testing I had to add those

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid I don't know the niceties of this CI - I presume the different parallel runs are deliberately using different versions of Spark and Hadoop - so any of the jobs that need an older version of Hadoop libs are affected by this change

Copy link
Contributor Author

@christianknoepfle christianknoepfle Oct 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

libs are now only used if not running in CI. Allows to test locally without modifying the build.sbt

@pjfanning
Copy link
Collaborator

There are also compile issues in some of the plans

[error] /home/runner/work/spark-excel/spark-excel/src/main/3.x/scala/com/crealytics/spark/v2/excel/ExcelPartitionReaderFromIterator.scala:17:55: type parameter InternalRow defined in class ExcelPartitionReaderFromIterator shadows class InternalRow defined in package catalyst. You may want to rename your type parameter, or possibly remove it.
[error] private[excel] class ExcelPartitionReaderFromIterator[InternalRow] private (

@christianknoepfle
Copy link
Contributor Author

Heck, works fine locally and fails on CI with the xml stream error message that should have been fixed now :( Will look into it later

@pjfanning pjfanning mentioned this pull request Oct 2, 2022
@christianknoepfle
Copy link
Contributor Author

@pjfanning the ci passes, but I could only acchieve this by removing the workbook.close from ExcelHelper.getRows() for streaming workbook. So this method still gets called on V2 implementation and could cause resource issues (potentially all machines that are not my local WIndows box ;) ). I am not familiar with the code but I assume getRows is called to determine header etc. SO it needs some overhaul but I saw that you are also working on some alternative approach. Great :)

@pjfanning
Copy link
Collaborator

but I saw that you are also working on some alternative approach. Great :)

I'm just trying out something - feel free to review it but it's still WIP

My attempt may come to nought but I think I'm making a little progress

@christianknoepfle
Copy link
Contributor Author

I will close this PR, replaced by #653

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants