Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to read XLSX from file and not wait it's entire download to process it #103

Open
oviniciuslara opened this issue Mar 15, 2022 · 8 comments

Comments

@oviniciuslara
Copy link

oviniciuslara commented Mar 15, 2022

Hey, I have a question. How can we process one XLSX file stored remotely, in this example as URL, without waiting the code to entire download it first? For large files it takes a long time.

Example:

InputStream is = new URL("https://filebin.net/qe5ynsl7ikmzap8a/LINEITEM_6M.xlsx").openStream();
Workbook workbook = StreamingReader
		.builder()
		.rowCacheSize(100)
		.bufferSize(4096) 
		.open(is); 
for (Sheet sheet : workbook) {
	System.out.println(sheet.getSheetName());
	for (Row r : sheet) {
		for (Cell c : r) {
			System.out.println(c.getStringCellValue());
		}
	}
}
@pjfanning
Copy link
Owner

isn't this the same as #38 ?

@pjfanning
Copy link
Owner

pjfanning commented Mar 15, 2022

One suggestion would be to try https://github.com/dhatim/fastexcel. excel-streaming-reader uses Apache POI code to read the initial zip file (xlsx files are basically zip files) and this POI code reads the full file before making the inner data available to parse. fastexcel may use a different approach that might make the sheet data available earlier.

The way that excel file is split across multiple files inside the zip probably makes it pretty unlikely that most excel files can be read without buffering the whole file. Sheet XML references shared string data in a separate sharedStrings.xml file and the style data is in another file.

One scenario that might be fully streamable would be an xlsx that was not password protected and where sharedStrings.xml is not used. POI's SXSSFWorkbook can produce a file in this format but that is just one way to produce xlsx and other approaches would not be fully streamable. A variant of the existing excel-streaming-reader code could probably be produced to handle this case but I'm not volunteering to do it.

If you truly want to parse the data in a fully streaming way, I would suggest using a different data format - CSV, XML, JSON - these formats are much more streamable than xlsx.

@oviniciuslara
Copy link
Author

Thanks for the quick response. Unfortunately the file format is not possible to be changed. Do you have any idea how can I change the library to work accordingly?

@pjfanning
Copy link
Owner

pjfanning commented Mar 15, 2022

If you read my comment above - in particular, the paragraph beginning with 'One scenario that might be fully streamable'. Only in this scenario, this algorithm will work.

I would like to reiterate that Excel is a really bad format for streaming large data over the internet - the format was not designed for this use case.

@oviniciuslara
Copy link
Author

oviniciuslara commented Mar 15, 2022

In my use case I only need to read the first 1000 rows of the sheet and I don't need the exact values of the cells (just the non formulated content), this would ease things, right?

@pjfanning
Copy link
Owner

In my use case I only need to read the first 1000 rows of the sheet and I don't need the exact values of the cells (just the non formulated content), this would ease things, right?

no - excel usually puts all the strings in a different file from the sheet data - this is not streaming friendly - if the xlsx only has numbers, then you might be ok

@oviniciuslara
Copy link
Author

I see, well I will have to look a little deeper. I will try to post the solution here.

In the meantime, do you have any clue how to do this same stream fashion with XLS files? Or know if it’s even possible?

@pjfanning
Copy link
Owner

xls is not streamable at all - Microsoft did not write their formats with your use case in mind - there are much much better data formats out there

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants