[BUG] Unipdf is messing up the unicode while merging Searchable PDF documents and applying PDF/1-a #550

sagar-kalburgi-ripcord · 2024-04-04T14:42:50Z

Description

When I try merging two searchable PDF documents (produced by an OCR engine) using unipdf I am also specifying that PDF/1-a standard needs to be applied before writing the result of the merge to an output PDF file. But when I open the output PDF file, I see that the unicode is messed up because I am unable to search for any existing word on it, and when I copy some of the text on the output PDF and paste it on a notepad, I see this unrecognizable text:

􀀀ı􀀀ˇ􀀀􀀀􀀀 􀀀􀀀 􀀀 􀀀 􀀀
􀀀􀀀􀀀 􀀀􀀀 ˇ􀀀􀀀
􀀀􀀀 􀀀􀀀 􀀀 ı􀀀 􀀀 􀀀􀀀ˇ􀀀􀀀ı 􀀀 􀀀􀀀 􀀀 􀀀􀀀􀀀 􀀀􀀀 􀀀􀀀 􀀀􀀀􀀀􀀀

􀀀􀀀 􀀀􀀀􀀀􀀀
􀀀􀀀􀀀􀀀 ı􀀀
􀀀
􀀀 􀀀 􀀀 􀀀 􀀀 􀀀 􀀀
􀀀 􀀀 􀀀􀀀􀀀 􀀀􀀀
􀀀 􀀀􀀀􀀀
􀀀􀀀 􀀀􀀀

Expected Behavior

When I open the output PDF file and search for an existing word, it should show up. And when I copy the textual contents of the output PDF file and paste it on a notepad, it should paste the exact text that's present.

Actual Behavior

Steps to reproduce the behavior:
Please run the below sample program with a valid Unidoc license and using the attached PDF files.

func main() {
	reader1, _, err := model.NewPdfReaderFromFile("Page 1 of OCRed image0065.pdf", nil)
	reader2, _, err := model.NewPdfReaderFromFile("Page 1 of OCRed image0069.pdf", nil)

	writer := model.NewPdfWriter()

	for _, reader := range []*model.PdfReader{reader1, reader2} {
		numPages, _ := reader.GetNumPages()
		for i := 1; i <= numPages; i++ {
			page, err := reader.GetPage(i)
			if err != nil {
				log.Fatalf("Error getting page %d: %v", i, err)
			}

			err = writer.AddPage(page)
			if err != nil {
				log.Fatalf("Error adding page %d: %v", i, err)
			}
		}
	}

	writer.ApplyStandard(model.StandardApplier(pdfa.NewProfile1A(pdfa.DefaultProfile1Options())))
	err = writer.WriteToFile("merged.pdf")
	if err != nil {
		log.Fatalf("Error writing merged file: %v", err)
	}

	err = checkCompliancePdfA1a("merged.pdf")
	if err != nil {
		panic(err)
	}

	fmt.Printf("The document is compliant with the standard PDF/A-1a\n")

}

func checkCompliancePdfA1a(fileName string) error {
	// Open up the file with given name.
	f, err := os.Open(fileName)
	if err != nil {
		return err
	}
	defer f.Close()

	// Prepare compliant document reader.
	r, err := model.NewCompliancePdfReader(f)
	if err != nil {
		return err
	}

	// Define to which standard we want to check document compliance.
	profile1A := pdfa.NewProfile1A(pdfa.DefaultProfile1Options())

	// Verify the standard.
	if err = profile1A.ValidateStandard(r); err != nil {
		return err
	}

	return nil
}

Attachments

2 PDF files have been attached

The text was updated successfully, but these errors were encountered:

sagar-kalburgi-ripcord · 2024-04-04T16:43:02Z

Also if it helps, the two documents that I attached to this ticket claim to be PDF/A-1a compliant. But when I try validating either of these documents using unipdf PDF/A-1a validation function, the validation fails with some errors. But again if I try to enforce PDF/A-1a standard on the merged document, the unicode gets messed up and I am unable to search for any text.

sagar-kalburgi-ripcord · 2024-04-04T19:20:10Z

Page 1 of OCRed image0069.pdf
Delta.pdf

Please find the two PDF files I used for the sample program

anovik · 2024-04-15T05:57:40Z

Hello @sagar-kalburgi-ripcord we have fixed your issue, it was merged into development branch of unipdf source repository (I believe you have access to it and can test it).

It will be included in the next release of UniPDF as well, we will let you know when it will be out.

sagar-kalburgi-ripcord · 2024-04-15T14:06:12Z

Hi @anovik,

Thanks! Sure I will test it. Any idea when the next release of UniPDF is going to be?

anovik · 2024-04-15T14:29:01Z

@sagar-kalburgi-ripcord It is planned for the end of April.

sagar-kalburgi-ripcord · 2024-04-15T18:20:39Z

@anovik I tested the fix from your development branch and it looks good!
I'm afraid that end of April is late for us. This issue is critically impacting Production and we are losing revenue for every day that passes with this issue being active. Would it be possible for you to provide a hotfix release for this at the earliest possible?

anovik · 2024-04-16T09:12:27Z

@sagar-kalburgi-ripcord We completely understand the urgency of your situation and are prioritizing the release of a hotfix to address this issue as quickly as possible.

We'll keep you updated on the progress and notify you as soon as the release is completed.

sagar-kalburgi-ripcord · 2024-04-16T09:44:52Z

ok thanks!

anovik · 2024-04-17T06:19:47Z

@sagar-kalburgi-ripcord The new release of UniPDF is available https://github.com/unidoc/unipdf/releases/tag/v3.57.0 and it includes this issue.

Closing the current ticket, feel free to re-open it in case of any problems.

anovik closed this as completed Apr 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Unipdf is messing up the unicode while merging Searchable PDF documents and applying PDF/1-a #550

[BUG] Unipdf is messing up the unicode while merging Searchable PDF documents and applying PDF/1-a #550

sagar-kalburgi-ripcord commented Apr 4, 2024 •

edited

Loading

sagar-kalburgi-ripcord commented Apr 4, 2024

sagar-kalburgi-ripcord commented Apr 4, 2024

anovik commented Apr 15, 2024

sagar-kalburgi-ripcord commented Apr 15, 2024

anovik commented Apr 15, 2024

sagar-kalburgi-ripcord commented Apr 15, 2024

anovik commented Apr 16, 2024

sagar-kalburgi-ripcord commented Apr 16, 2024

anovik commented Apr 17, 2024

[BUG] Unipdf is messing up the unicode while merging Searchable PDF documents and applying PDF/1-a #550

[BUG] Unipdf is messing up the unicode while merging Searchable PDF documents and applying PDF/1-a #550

Comments

sagar-kalburgi-ripcord commented Apr 4, 2024 • edited Loading

Description

Expected Behavior

Actual Behavior

Attachments

sagar-kalburgi-ripcord commented Apr 4, 2024

sagar-kalburgi-ripcord commented Apr 4, 2024

anovik commented Apr 15, 2024

sagar-kalburgi-ripcord commented Apr 15, 2024

anovik commented Apr 15, 2024

sagar-kalburgi-ripcord commented Apr 15, 2024

anovik commented Apr 16, 2024

sagar-kalburgi-ripcord commented Apr 16, 2024

anovik commented Apr 17, 2024

sagar-kalburgi-ripcord commented Apr 4, 2024 •

edited

Loading