Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Unipdf is messing up the unicode while merging Searchable PDF documents and applying PDF/1-a #550

Closed
sagar-kalburgi-ripcord opened this issue Apr 4, 2024 · 9 comments

Comments

@sagar-kalburgi-ripcord
Copy link

sagar-kalburgi-ripcord commented Apr 4, 2024

Description

When I try merging two searchable PDF documents (produced by an OCR engine) using unipdf I am also specifying that PDF/1-a standard needs to be applied before writing the result of the merge to an output PDF file. But when I open the output PDF file, I see that the unicode is messed up because I am unable to search for any existing word on it, and when I copy some of the text on the output PDF and paste it on a notepad, I see this unrecognizable text:

􀀀ı􀀀ˇ􀀀􀀀􀀀   􀀀􀀀  􀀀 􀀀 􀀀
􀀀􀀀􀀀   􀀀􀀀 ˇ􀀀􀀀 
􀀀􀀀 􀀀􀀀 􀀀 ı􀀀 􀀀 􀀀􀀀ˇ􀀀􀀀ı 􀀀   􀀀􀀀  􀀀 􀀀􀀀􀀀   􀀀􀀀  􀀀􀀀  􀀀􀀀􀀀􀀀 

 􀀀􀀀  􀀀􀀀􀀀􀀀
   􀀀􀀀􀀀􀀀   ı􀀀
􀀀
􀀀 􀀀 􀀀 􀀀 􀀀 􀀀 􀀀 
􀀀 􀀀 􀀀􀀀􀀀   􀀀􀀀 
􀀀 􀀀􀀀􀀀
  􀀀􀀀  􀀀􀀀 

Expected Behavior

When I open the output PDF file and search for an existing word, it should show up. And when I copy the textual contents of the output PDF file and paste it on a notepad, it should paste the exact text that's present.

Actual Behavior

Steps to reproduce the behavior:
Please run the below sample program with a valid Unidoc license and using the attached PDF files.

func main() {
	reader1, _, err := model.NewPdfReaderFromFile("Page 1 of OCRed image0065.pdf", nil)
	reader2, _, err := model.NewPdfReaderFromFile("Page 1 of OCRed image0069.pdf", nil)

	writer := model.NewPdfWriter()

	for _, reader := range []*model.PdfReader{reader1, reader2} {
		numPages, _ := reader.GetNumPages()
		for i := 1; i <= numPages; i++ {
			page, err := reader.GetPage(i)
			if err != nil {
				log.Fatalf("Error getting page %d: %v", i, err)
			}

			err = writer.AddPage(page)
			if err != nil {
				log.Fatalf("Error adding page %d: %v", i, err)
			}
		}
	}

	writer.ApplyStandard(model.StandardApplier(pdfa.NewProfile1A(pdfa.DefaultProfile1Options())))
	err = writer.WriteToFile("merged.pdf")
	if err != nil {
		log.Fatalf("Error writing merged file: %v", err)
	}

	err = checkCompliancePdfA1a("merged.pdf")
	if err != nil {
		panic(err)
	}

	fmt.Printf("The document is compliant with the standard PDF/A-1a\n")

}

func checkCompliancePdfA1a(fileName string) error {
	// Open up the file with given name.
	f, err := os.Open(fileName)
	if err != nil {
		return err
	}
	defer f.Close()

	// Prepare compliant document reader.
	r, err := model.NewCompliancePdfReader(f)
	if err != nil {
		return err
	}

	// Define to which standard we want to check document compliance.
	profile1A := pdfa.NewProfile1A(pdfa.DefaultProfile1Options())

	// Verify the standard.
	if err = profile1A.ValidateStandard(r); err != nil {
		return err
	}

	return nil
}

Attachments

2 PDF files have been attached

@sagar-kalburgi-ripcord
Copy link
Author

Also if it helps, the two documents that I attached to this ticket claim to be PDF/A-1a compliant. But when I try validating either of these documents using unipdf PDF/A-1a validation function, the validation fails with some errors. But again if I try to enforce PDF/A-1a standard on the merged document, the unicode gets messed up and I am unable to search for any text.

@sagar-kalburgi-ripcord
Copy link
Author

Page 1 of OCRed image0069.pdf
Delta.pdf

Please find the two PDF files I used for the sample program

@anovik
Copy link

anovik commented Apr 15, 2024

Hello @sagar-kalburgi-ripcord we have fixed your issue, it was merged into development branch of unipdf source repository (I believe you have access to it and can test it).

It will be included in the next release of UniPDF as well, we will let you know when it will be out.

@sagar-kalburgi-ripcord
Copy link
Author

Hi @anovik,

Thanks! Sure I will test it. Any idea when the next release of UniPDF is going to be?

@anovik
Copy link

anovik commented Apr 15, 2024

@sagar-kalburgi-ripcord It is planned for the end of April.

@sagar-kalburgi-ripcord
Copy link
Author

@anovik I tested the fix from your development branch and it looks good!
I'm afraid that end of April is late for us. This issue is critically impacting Production and we are losing revenue for every day that passes with this issue being active. Would it be possible for you to provide a hotfix release for this at the earliest possible?

@anovik
Copy link

anovik commented Apr 16, 2024

@sagar-kalburgi-ripcord We completely understand the urgency of your situation and are prioritizing the release of a hotfix to address this issue as quickly as possible.

We'll keep you updated on the progress and notify you as soon as the release is completed.

@sagar-kalburgi-ripcord
Copy link
Author

ok thanks!

@anovik
Copy link

anovik commented Apr 17, 2024

@sagar-kalburgi-ripcord The new release of UniPDF is available https://github.com/unidoc/unipdf/releases/tag/v3.57.0 and it includes this issue.

Closing the current ticket, feel free to re-open it in case of any problems.

@anovik anovik closed this as completed Apr 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants