excessive memory usage when using embedded base64 images #228

lucribas · 2024-11-18T18:25:14Z

I'm encountering a memory issue when generating PDFs with many embedded images using html-to-pdfmake. I have a backend that uses html-to-pdfmake to convert a HTML document with images to JSON for pdfmake generate a pdf document.

Currently, my approach is to embed images in the HTML as base64 strings. However, this results in excessive memory usage because it allocates a lot of memory to store the base64 image strings, often leading to my node server running out of memory.

Questions:

Are there any best practices for handling such cases that I might not be aware of?
Is there an alternative approach to handle images more efficiently with html-to-pdfmake?

Improvement Proposal:

I was considering opening a PR to improve how images are handled. My idea is to use streams so that pdfmake could consume a stream of bytes instead of relying on base64 strings. This could potentially reduce memory usage significantly.

Would this be a viable improvement for the library? If so, I'd be happy to contribute.

Looking forward to your feedback.
Thanks for your work on this project!

Aymkdn · 2024-11-18T18:34:45Z

My idea is to use streams so that pdfmake could consume a stream of bytes instead of relying on base64 strings. This could potentially reduce memory usage significantly.

In this case, the pull request must be made to the PDFMake project, which I don't own or manage. I'm using base64 because that's the format required by PDFMake. The PDFMake documentation does not provide any alternative methods for handling images in Node.js, apart from the base64 approach.

I'm closing this issue as there’s nothing further I can do on my side. You’ll need to reach out to the PDFMake creator for assistance.

Suggestion: You might consider generating your PDF one page at a time and then merging the pages into a single document. Alternatively, you could convert your HTML to a Canvas and export it to a PDF using the dropflow library.

lucribas · 2024-11-19T22:07:00Z

Hey, thanks a lot for the response and the clarification! Your input really helped me understand the situation better.

After digging into this further, I realized that adding support for image streams will actually require changes across three levels (one PR for each, at least):

PDFKit: Add native support for handling image streams.
PDFMake: Update it to use the new PDFKit feature and pass streams along.
html-to-pdfmake: Extend the library to handle image streams with a new mechanism.

By the way, do you think there are other types of embedded objects that could also benefit from a streaming approach? If so, I could try to design a more generic solution that covers these cases too.

For the html-to-pdfmake part, I’m thinking of adding a data-stream-id attribute to tags, so you can mark images for streaming. There would also be a new { imageStreams } parameter to pass an object of streams. Here’s an example of what it could look like:

Example
HTML:

<img src="" data-stream-id="stream1" />
<img src="" data-stream-id="stream2" />

Code:

const htmlToPdfMake = require('html-to-pdfmake');
const fs = require('fs');

// Example image streams
const imageStreams = {
  stream1: fs.createReadStream('./path/to/image1.jpg'),
  stream2: fs.createReadStream('./path/to/image2.png'),
};

const pdfContent = htmlToPdfMake(html, { streams: imageStreams });

This way, the library could process streams smoothly while building the PDF without excessive memory usage.

Let me know what you think of this approach or if you have any ideas to make it better. Thanks again for the guidance—it’s been super helpful! 😊

Aymkdn · 2024-11-20T08:38:07Z

do you think there are other types of embedded objects that could also benefit from a streaming approach?

I don't think so.

For the html-to-pdfmake part, I’m thinking of adding a data-stream-id attribute to tags, so you can mark images for streaming. > There would also be a new { imageStreams } parameter to pass an object of streams.

Before considering how to implement this in html-to-pdfmake, check if it’s even possible with the other libraries first 🙂

Keep in mind, it might take months before the feature is deployed in those libraries. It’s better to focus on them for now, and once the feature becomes available, let me know — I’ll work on integrating it into my library then 👍🏻

Aymkdn closed this as completed Nov 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

excessive memory usage when using embedded base64 images #228

excessive memory usage when using embedded base64 images #228

lucribas commented Nov 18, 2024

Aymkdn commented Nov 18, 2024

lucribas commented Nov 19, 2024 •

edited

Loading

Aymkdn commented Nov 20, 2024

excessive memory usage when using embedded base64 images #228

excessive memory usage when using embedded base64 images #228

Comments

lucribas commented Nov 18, 2024

Aymkdn commented Nov 18, 2024

lucribas commented Nov 19, 2024 • edited Loading

Aymkdn commented Nov 20, 2024

lucribas commented Nov 19, 2024 •

edited

Loading