-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
V2 streaming read (alternative approach) #653
Conversation
@christianknoepfle this seems to be working now - have a look if you have time |
cool :) I think I can kill my PR ;) (the core idea of the ExcelPartitionReaderFromIterator has made it in your approach but with a cleaner interface, so I am happy ;) ). The only thing that I didn't really liked is the fact that the CloseableIterator is not really an iterator. It is something that contains an iterator. So the name is a bit misleading (unfortunately I have no better idea for now). So finding another name or as an alternative one could implement it as an iterator In any case: Thanks a lot for your effort. Hope the other folks are fine with your fixes and we see a new version soon (and then hopefully we could use spark excel it in our production code) |
workbook match { | ||
case _: StreamingWorkbook => CloseableIterator(rowIter, Seq(workbook)) | ||
case _ => { | ||
workbook.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you close the workbook here and do not hand that to CloseableIterator()? Would be more consistent to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this workbook doesn't need to be kept open but I suppose it could be changed to work like the streaming workbook
val dfExcel = spark.read | ||
.format("excel") | ||
.option("path", "src/test/resources/v2readwritetest/large_excel/largefile-wide-single-sheet.xlsx") | ||
.option("header", value = false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we test with inferSchema= true and header = true too to get greater coverage? See src/test/scala/com/crealytics/spark/v2/excel/MaxNumRowsSuite.scala in my PR as an idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you liek that, you will need the new xlsx too.. (has headers)
I agree that the name isn't great. I was thinking of renaming it to |
@christianknoepfle I made the changes you asked for and copied your test changes. One issue is I get dataframe sizes that are 1 less than you - for each of the 3 sub tests. I haven't investigated yet whether my code strips a row that shouldn't be stripped or if your PR overcounts. |
src/main/2.4/scala/com/crealytics/spark/v2/excel/ExcelDataSource.scala
Outdated
Show resolved
Hide resolved
src/main/2.4/scala/com/crealytics/spark/v2/excel/ExcelDataSource.scala
Outdated
Show resolved
Hide resolved
src/main/2.4/scala/com/crealytics/spark/v2/excel/ExcelDataSource.scala
Outdated
Show resolved
Hide resolved
src/main/2.4/scala/com/crealytics/spark/v2/excel/ExcelDataSource.scala
Outdated
Show resolved
Hide resolved
src/main/3.0_3.1/scala/com/crealytics/spark/v2/excel/ExcelTable.scala
Outdated
Show resolved
Hide resolved
src/main/3.0_3.1/scala/com/crealytics/spark/v2/excel/ExcelTable.scala
Outdated
Show resolved
Hide resolved
val numberOfRowToIgnore = if (options.header) (options.ignoreAfterHeader + 1) else 0 | ||
paths.tail.foreach(path => { | ||
val newRows = excelHelper.getSheetData(conf, path) | ||
sheetData = SheetData( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a lot of copied code here. I wonder at what point we DRY this and put the code that is shared for the different Spark versions into shared code.
val r = excelHelper.getColumnNames(headerRow) | ||
rows = Iterator(headerRow) ++ rows | ||
r | ||
try { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here. I had not seen that there is so much overlap in the code, but we should get rid of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've done some refactoring to reduce code duplication - probably more can be done
.option("path", "src/test/resources/v2readwritetest/large_excel/largefile-wide-single-sheet.xlsx") | ||
.option("header", value = false) | ||
// .option("dataAddress", "'Sheet1'!B7:M16") | ||
.option("maxRowsInMemory", "200") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned in the other PR, the integration tests actually cover streaming reads:
https://github.com/crealytics/spark-excel/blob/main/src/test/scala/com/crealytics/spark/excel/IntegrationSuite.scala#L349
If there is anything missing there, I would prefer to extend them instead of adding new tests with large binary files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have been unable to get any of the existing tests to reproduce the issue - the datasets are too small
Overall this Iterator stuff is dangerous territory. It's bitten me more than once, because at some point some code was reading from an Iterator from which another Iterator was derived... |
For me, test coverage is the main way to manage this. The code already relies on reading off headers and things like so I'm not sure how much can be done to control that the iterators are not consumed at the wrong time. I covered some of the topics you asked me to look at but there is more I can do. I haven't yet looked at a way to test the changes without the new file. I'll get back to that later. |
you will need the new excel. I just added a header row to it. Download from here: |
thanks - I've updated my PR to use your latest xlsx file |
240d8d7
to
fbb6a1a
Compare
Hey @pjfanning, I saw you picked up my changes regarding spark-testing-base to spark-fast-tests already 👍 👍 I created a branch which applies the integration tests to v2 as well. |
Is there anything holding off this PR? I can say that it finally fixed one of our production issues and is key to get V2 stable. If you need help pls let me know |
If possible, I'd like to expose the bug without adding a huge
WDYT? |
Co-authored-by: Martin Mauch <[email protected]>
ad77586
to
7f7f0dc
Compare
I've removed the large xlsx and the tests that use it |
@pjfanning thanks a lot for your efforts and continued support!! |
alternative to #651