Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write images to PDF as soon as possible (otherwise easy to run out of memory) #291

Open
andrzejbl opened this issue Nov 1, 2022 · 8 comments

Comments

@andrzejbl
Copy link

andrzejbl commented Nov 1, 2022

Hi!

I'm having a strange issue with a large number of images.
Looks like streaming waits for .end() to push all data to the file.
As a result, I'm getting out of memory. output.pdf is empty, for the whole time before .end() is called.
What I'm missing?

const doc = new pdf.Document({
  font: require('pdfjs/font/Helvetica'),
  width: 9 * 72,
  height: 6 * 72,
});
const writeStream = createWriteStream('output.pdf');
doc.pipe(writeStream);

const files = readdirSync('images');
const handleFiles = async () => {
  for (const image in new Array(30).fill(1)) {
      const img = new pdf.Image(
        await sharp(`images/${files[image]}`)
          .resize({ width: 2217, height: 2217, fit: 'inside' })
          .toBuffer()
      );
      doc.image(img);
  }
};
handleFiles().then(() => {
  doc.end();
});
@rkusa
Copy link
Owner

rkusa commented Nov 1, 2022

The issue is probably that the rendering does not progress until you've read all images into memory. A workaround might be the following:

  for (const image in new Array(30).fill(1)) {
      const img = new pdf.Image(
        await sharp(`images/${files[image]}`)
          .resize({ width: 2217, height: 2217, fit: 'inside' })
          .toBuffer()
      );
      doc.image(img);
+     await doc._next();
  }

This should write each image to the PDF before continuing with the next image. However, this is an internal API, so in case it works, I'd keep the issue open until there is a public API to achieve that.

@andrzejbl
Copy link
Author

Hi,

already tried this.
I also tried this:
doc.on('data', (chunk) => { appendFile('output2.pdf', chunk, (err) => {}); });
the result was output2.pdf growing by few kb. It went full only after .end() call. It ended up being a broken pdf with the same size as output.pdf.

@rkusa
Copy link
Owner

rkusa commented Nov 1, 2022

Are you also running out of memory if you skip the pdfjs related stuff from handleFiles?

  for (const image in new Array(30).fill(1)) {
-     const img = new pdf.Image(
        await sharp(`images/${files[image]}`)
          .resize({ width: 2217, height: 2217, fit: 'inside' })
          .toBuffer()
-     );
-     doc.image(img);
  }

Just to make sure it is pdfjs related and not sharp related?

@andrzejbl
Copy link
Author

andrzejbl commented Nov 1, 2022

After running this file:

import { createWriteStream, readdirSync, readFileSync } from 'fs';
import * as pdf from 'pdfjs';

const doc = new pdf.Document({
  font: require('pdfjs/font/Helvetica'),
  width: 9 * 72,
  height: 6 * 72,
});
const writeStream = createWriteStream('output.pdf', {});
doc.pipe(writeStream);
const files = readdirSync('D:/small');
const handleFiles = async () => {
  for (const image in new Array(3000).fill(1)) {
    console.log(files[image]);
    const img = new pdf.Image(readFileSync(`D:/small/${files[image]}`));
    doc.image(img);
  }
};
handleFiles().then(() => {
  console.log('finish');
  doc.end();
});`

I'm ending up with out of memory and output.pdf with size of 145M

@andrzejbl
Copy link
Author

Hi!

do you have any ideas about what might be the reason for this?

Happy New Year :)

@rkusa
Copy link
Owner

rkusa commented Jan 11, 2023

Thanks for the example above. I gave it a more detailed look, and it is pdfjs's fault and not your code. When adding an image, it cannot be directly written to the PDF file, because there is usually a page already opened and written to. The current implementation thus postpones writing images at the end of the file (upon doc.end()). This should probably be changed to writing the image as soon as possible (e.g. as soon as the current page got closed). Something I don't have the time to right now to change, so I am afraid pdfjs doesn't solve your use-case right now.

@rkusa rkusa changed the title Streaming / out of memory Write images to PDF as soon as possible (otherwise easy to run out of memory) Jan 11, 2023
@andrzejbl
Copy link
Author

Thanks! Could you point me to a place where I should dig into pdfjs? Maybe I can help with this.

@rkusa
Copy link
Owner

rkusa commented Jan 11, 2023

The image data is written into the file here: https://github.com/rkusa/pdfjs/blob/main/lib/document.js#L563
But should probably be written as part of _endPage: https://github.com/rkusa/pdfjs/blob/main/lib/document.js#L339
There might be more to it, this is just the part that I can point you to from the top of my head. Something to keep in mind when considering to contribute: pdfjs is very low on my personal list of prios, so I am not very actively working on it. So depending on what requirements you have for a PDF lib, a more actively maintained project might be more worth your time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants