Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: CSV output #75

Open
AbelLykens opened this issue May 17, 2023 · 6 comments
Open

Feature: CSV output #75

AbelLykens opened this issue May 17, 2023 · 6 comments

Comments

@AbelLykens
Copy link

AbelLykens commented May 17, 2023

Would love CSV output like this:

page,type,author,created,text
1,Highlight,John,2023-05-17T11:38:17,Text

Sounds like that should be possible but not sure how. Great tool, thanks!

@0xabu
Copy link
Owner

0xabu commented May 31, 2023

You can certainly write a printer to do that -- take a look at the Json output for an example:
https://github.com/0xabu/pdfannots/blob/658984edebb6bb8409e9ce8bb49ac85ded8f8675/pdfannots/printer/json.py

If you don't want to do that, perhaps take json from pdfannots and convert it to csv:
https://stackoverflow.com/questions/32960857/how-to-convert-arbitrary-simple-json-to-csv-using-jq

@AbelLykens
Copy link
Author

Thanks -- yeah took a quick look, seems possible. Might look at it indeed, thanks for the pointer.

@Proeliorr
Copy link

[Beginner's level question]

I would like to ask if there is an option [or rather how to set it] to use encoding that contains Polish and German special signs.
I want to implement your algorithm in learning German language. The problem is that the output .txt (json) file does not show any Polish or German special signs.

notepad_uDvH1LirmZ

Correct version
text: Lösung
contents: rozwiązanie

I tried to modify the json file but I stuck. :/

Console line:

pdfannots "path" -f json > directories\json_to_csv.txt

Some additional information:

  1. The PDF file has been written in Goethe FF Clan font. When I copy the word from the file and paste p.e. to Notepad++/WordPad/browser, it copies the special signs, too.
  2. Currently I can create the .csv file from the .json output, but there are still no German or Polish signs
  3. The same situation takes place when I am trying to create a markdown (.md) file.

Best regards

@0xabu
Copy link
Owner

0xabu commented Aug 18, 2023

@Proeliorr this has nothing to do with CSV. Why are you commenting on this issue?

In any case, pdfannots always outputs utf8, and indeed 00f6 is the unicode codepoint for ö (https://codepoints.net/U+00F6) -- I think perhaps you need to tell your text editor to use the utf8 encoding.

@Proeliorr
Copy link

@0xabu After some consideration I agree.

The file was re-saved in utf8 encoding, notepad++ sees it as an utf8 encoded file. That is where the problem lies.

notepad++_jlfvWWnyN5

Nevertheless, I will not disturb the given below topic anymore. I think it is not a pdfannots case further.
Cheers

@0xabu
Copy link
Owner

0xabu commented Aug 20, 2023

@Proeliorr I took another look at this, there is something fishy going on with output redirection on Windows. I've created #84 to track it. Luckily it has a pretty simple workaround -- use -o to write output to a file, rather than redirection.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants