-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathpdftoedn.1
165 lines (165 loc) · 4.38 KB
/
pdftoedn.1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
.mso www.tmac
.TH PDFTOEDN "1" "August 2016" "pdftoedn" "User Commands"
.SH NAME
pdftoedn \- manual page for pdftoedn
.SH SYNOPSIS
.B pdftoedn
[\fI\,options\/\fR] \fI\,{-o <output file>} {filename}\/\fR
.SH DESCRIPTION
.B pdftoedn
is tool for extracting the contents of a PDF document and saving them
to a file in Extensible Data Notation (EDN) format. The written output
will contain a hash with two entries:
.sp
.RS 4
.ie n \{\
\h'-04'\(bu\h'+03'\c
.\}
.el \{\
.sp -1
.IP \(bu 2.3
.\}
\fI:meta\fR, a hash containing the PDF format version, linked library
versions, and document's outline.
.RE
.sp
.RS 4
.ie n \{\
\h'-04'\(bu\h'+03'\c
.\}
.el \{\
.sp -1
.IP \(bu 2.3
.\}
\fI:pages\fR, an array with each index carrying a hash of resources
(fonts, image blob references, colors), text, and graphics page data.
.RE
.RE
.PP
.PP
.B pdftoedn
saves text data in sorted order (top-left to bottom-right) for easier
extraction. Spans are split based on either glyph position or a change
in font but the application provides support for font and glyph
replacement via a JSON configuration map file that can be passed as an
argument. It prioritizes the replacement fonts to assemble text spans
by matching fonts specified to be equivalent. For example, a document
carrying a line with the text "Total cost is " using a font named
\fIQUDIXD-Custom1-1\fR followed by the text "$20" using a font named
\fISJFHE-Swissb\fR, with both fonts mapped to \fIHelvetica\fR in the
config map, produces output with the entire span assembled using
\fIHelvetica\fR.
.PP
.B pdftoedn
extracts document images and saves them to disk in PNG format with any
transformations specified in the document. Files are saved in a
subdirectory named after the document's base name.
.PP
This man page covers basic usage of the application. For more
information, including output format and font map configuration
format, please refer to the online documentation
.URL https://github.com/edporras/pdftoedn/wiki wiki.
.SH USAGE
.PP
Process the document file1.pdf:
.sp
.if n \{\
.RS 4
.\}
.nf
$ pdftoedn \-o file1.edn file1.pdf
.fi
.if n \{\
.RE
.\}
.PP
Process only the first page of the document file1.pdf. Note that the
value should be 0-indexed and that the number of pages can't be
determined until the document's \fI:meta\fR has been read. Also, the
output will contain only one entry in the \fI:pages\fR array and the
entry will be at index 0 (however, the has contains a \fI:pgnum\fR
entry to indicate the document's page number (1-indexed):
.sp
.if n \{\
.RS 4
.\}
.nf
$ pdftoedn \-p 0 \-o file1.edn file1.pdf
.fi
.if n \{\
.RE
.\}
.PP
Process the document file1.pdf using the font map file
fontmap1.json:
.sp
.if n \{\
.RS 4
.\}
.nf
$ pdftoedn \-m fontmap1.json \-o file1.edn file1.pdf
.fi
.if n \{\
.RE
.\}
.PP
.SH OPTIONS
.TP
\fB\-a\fR [ \fB\-\-use_page_crop_box\fR ]
Use page crop box instead of media box when
reading page content.
.TP
\fB\-D\fR [ \fB\-\-debug_meta\fR ]
Include additional debug metadata in output.
.TP
\fB\-F\fR [ \fB\-\-show_font_map_list\fR ]
Display the configured font substitution list and exit.
.TP
\fB\-f\fR [ \fB\-\-force_output\fR ]
Overwrite output file if it exists.
.TP
\fB\-i\fR [ \fB\-\-invisible_text\fR ]
Include invisible text in output (for use with
OCR'd documents).
.TP
\fB\-l\fR [ \fB\-\-links_only\fR ]
Extract only link data.
.TP
\fB\-m\fR [ \fB\-\-font_map_file\fR ] filename.json
JSON font mapping configuration file to use for this run.
A relative path can be specified. Alternatively,
.B pdftoedn
will look for it in ~/.pdftoedn.
.TP
\fB\-O\fR [ \fB\-\-omit_outline\fR ]
Don't extract outline data.
.TP
\fB\-p\fR [ \fB\-\-page_number\fR ] arg
Extract data for only this page.
.TP
\fB\-t\fR [ \fB\-\-owner_password\fR ] arg
PDF owner password if document is encrypted.
.TP
\fB\-u\fR [ \fB\-\-user_password\fR ] arg
PDF user password if document is encrypted.
.TP
\fB\-v\fR [ \fB\-\-version\fR ]
Display version information and exit.
.TP
\fB\-h\fR [ \fB\-\-help\fR ]
Display this message.
.PP
.SH FILES
.B pdftoedn
searches for specified font maps configuration files under the ~/.pdftoedn
directory.
.SH EXIT STATUS
.B pdftoedn
exits 0 on success and >0 if an error occurs. See
.URL https://github.com/edporras/pdftoedn/wiki/Returned-Exit-Error-Codes Returned-Exit-Error-Codes
for more information.
.SH BUGS
Please report all issues via
.URL https://github.com/edporras/pdftoedn/issues github.
.SH AUTHOR
Ed Porras. Based on an initial implementation by Jack Rusher.