Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

excessive memory usage when using embedded base64 images #228

Closed
lucribas opened this issue Nov 18, 2024 · 3 comments
Closed

excessive memory usage when using embedded base64 images #228

lucribas opened this issue Nov 18, 2024 · 3 comments

Comments

@lucribas
Copy link

I'm encountering a memory issue when generating PDFs with many embedded images using html-to-pdfmake. I have a backend that uses html-to-pdfmake to convert a HTML document with images to JSON for pdfmake generate a pdf document.

Currently, my approach is to embed images in the HTML as base64 strings. However, this results in excessive memory usage because it allocates a lot of memory to store the base64 image strings, often leading to my node server running out of memory.

Questions:

  • Are there any best practices for handling such cases that I might not be aware of?
  • Is there an alternative approach to handle images more efficiently with html-to-pdfmake?

Improvement Proposal:

I was considering opening a PR to improve how images are handled. My idea is to use streams so that pdfmake could consume a stream of bytes instead of relying on base64 strings. This could potentially reduce memory usage significantly.

Would this be a viable improvement for the library? If so, I'd be happy to contribute.

Looking forward to your feedback.
Thanks for your work on this project!

@Aymkdn
Copy link
Owner

Aymkdn commented Nov 18, 2024

My idea is to use streams so that pdfmake could consume a stream of bytes instead of relying on base64 strings. This could potentially reduce memory usage significantly.

In this case, the pull request must be made to the PDFMake project, which I don't own or manage. I'm using base64 because that's the format required by PDFMake. The PDFMake documentation does not provide any alternative methods for handling images in Node.js, apart from the base64 approach.

I'm closing this issue as there’s nothing further I can do on my side. You’ll need to reach out to the PDFMake creator for assistance.

Suggestion: You might consider generating your PDF one page at a time and then merging the pages into a single document. Alternatively, you could convert your HTML to a Canvas and export it to a PDF using the dropflow library.

@Aymkdn Aymkdn closed this as completed Nov 18, 2024
@lucribas
Copy link
Author

lucribas commented Nov 19, 2024

Hey, thanks a lot for the response and the clarification! Your input really helped me understand the situation better.

After digging into this further, I realized that adding support for image streams will actually require changes across three levels (one PR for each, at least):

  • PDFKit: Add native support for handling image streams.
  • PDFMake: Update it to use the new PDFKit feature and pass streams along.
  • html-to-pdfmake: Extend the library to handle image streams with a new mechanism.

By the way, do you think there are other types of embedded objects that could also benefit from a streaming approach? If so, I could try to design a more generic solution that covers these cases too.

For the html-to-pdfmake part, I’m thinking of adding a data-stream-id attribute to tags, so you can mark images for streaming. There would also be a new { imageStreams } parameter to pass an object of streams. Here’s an example of what it could look like:

Example
HTML:

<img src="" data-stream-id="stream1" />
<img src="" data-stream-id="stream2" />

Code:

const htmlToPdfMake = require('html-to-pdfmake');
const fs = require('fs');

// Example image streams
const imageStreams = {
  stream1: fs.createReadStream('./path/to/image1.jpg'),
  stream2: fs.createReadStream('./path/to/image2.png'),
};

const pdfContent = htmlToPdfMake(html, { streams: imageStreams });

This way, the library could process streams smoothly while building the PDF without excessive memory usage.

Let me know what you think of this approach or if you have any ideas to make it better. Thanks again for the guidance—it’s been super helpful! 😊

@Aymkdn
Copy link
Owner

Aymkdn commented Nov 20, 2024

do you think there are other types of embedded objects that could also benefit from a streaming approach?

I don't think so.

For the html-to-pdfmake part, I’m thinking of adding a data-stream-id attribute to tags, so you can mark images for streaming. > There would also be a new { imageStreams } parameter to pass an object of streams.

Before considering how to implement this in html-to-pdfmake, check if it’s even possible with the other libraries first 🙂

Keep in mind, it might take months before the feature is deployed in those libraries. It’s better to focus on them for now, and once the feature becomes available, let me know — I’ll work on integrating it into my library then 👍🏻

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants