forked from documentcloud/docsplit
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.html
executable file
·451 lines (404 loc) · 16.8 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
<!DOCTYPE HTML>
<html>
<head>
<meta http-equiv="content-type" content="text/html;charset=UTF-8" />
<meta http-equiv="X-UA-Compatible" content="chrome=1">
<title>Doc⚡split</title>
<style>
body {
font-size: 16px;
line-height: 24px;
background: #fffff5;
color: #333300;
font-family: Arial;
font-family: "Palatino Linotype", "Book Antiqua", Palatino, FreeSerif, serif;
}
div.container {
width: 720px;
margin: 50px 0 50px 50px;
}
p, li {
margin: 16px 0 16px 0;
width: 550px;
}
p.break {
margin-top: 35px;
}
a, a:visited {
padding: 0 2px;
text-decoration: none;
background: #f7f7bb;
color: #333300;
}
a:active, a:hover {
color: #000;
background: #ffff88;
}
h1, h2, h3, h4, h5, h6 {
margin-top: 40px;
}
b.header {
font-size: 18px;
}
span.alias {
font-size: 14px;
font-style: italic;
margin-left: 20px;
}
table {
margin: 16px 0; padding: 0;
}
tr, td {
margin: 0; padding: 0;
}
td {
padding: 9px 15px 9px 0;
}
td.definition {
line-height: 18px;
font-size: 14px;
}
code, pre, tt {
font-family: Monaco, Consolas, "Lucida Console", monospace;
font-size: 12px;
line-height: 18px;
color: #444;
}
code {
margin-left: 20px;
}
pre {
font-size: 12px;
padding: 2px 0 2px 12px;
border-left: 6px solid #da304d;
margin: 0px 0 10px;
}
li pre {
padding: 0;
border-left: 0;
margin: 6px 0 6px 0;
}
#diagram {
margin: 20px 0 0 0;
}
</style>
</head>
<body>
<div class="container">
<h1>Doc<sub style=""><img style="width:24pt" src="noto_bolt.svg"></sub>split</h1>
<p>
<a href="http://github.com/documentcloud/docsplit/">Docsplit</a>
is a command-line utility and Ruby library for splitting apart
documents into their component parts: searchable UTF-8 <b>plain text</b>
via OCR if necessary, page <b>images</b> or thumbnails in any format,
<b>PDFs</b>, single <b>pages</b>, and document <b>metadata</b>
(title, author, number of pages...)
</p>
<p>Docsplit is currently at <a href="http://rubygems.org/gems/docsplit">version 0.7.6</a>.</p>
<p>
<i>Docsplit is an open-source component of <a href="http://documentcloud.org/">DocumentCloud</a>.</i>
</p>
<p>
<a href="#installation">Installation & Dependencies</a> |
<a href="#usage">Usage</a> |
<a href="#internals">Internals</a> |
<a href="#changes">Change Log</a>
</p>
<h2 id="installation">Installation & Dependencies</h2>
<ol>
<li>
Grab the gem:<br />
<tt>gem install docsplit</tt>
</li>
<li>
Install <a href="http://www.graphicsmagick.org/">GraphicsMagick</a>.
Its ‘<b>gm</b>’ command is used to generate images.<br />
Either compile it from
<a href="http://sourceforge.net/projects/graphicsmagick/files/">source</a>,
or use a package manager:
<pre>
[aptitude | port | brew] install graphicsmagick</pre>
</li>
<li>
Install <a href="http://poppler.freedesktop.org/">Poppler</a>.
On Linux, use <b>aptitude</b>, <b>apt-get</b> or <b>yum</b>:<br />
<tt>aptitude install poppler-utils poppler-data</tt><br />
On the Mac, you can install from source or use <b>MacPorts</b>:<br />
<tt>sudo port install poppler | brew install poppler</tt><br />
</li>
<li>
(Optional) Install <a href="http://www.ghostscript.com/">Ghostscript</a>:<br />
<tt>[aptitude | port | brew] install ghostscript</tt><br />
Ghostscript is required to convert PDF and Postscript files.
</li>
<li>
(Optional) Install <a href="http://code.google.com/p/tesseract-ocr/">Tesseract</a>:<br />
<tt>[aptitude | port | brew] install [tesseract | tesseract-ocr]</tt><br />
Without Tesseract installed, you'll still be able to extract text from
documents, but you won't be able to automatically OCR them.
</li>
<li>
(Optional) Install <a href="http://www.accesspdf.com/pdftk/">pdftk</a>.
On Linux, use <b>aptitude</b>, <b>apt-get</b> or <b>yum</b>:<br />
<tt>aptitude install pdftk</tt><br />
On the Mac, you can <a href="https://www.pdflabs.com/tools/pdftk-server/">download a recent installer</a> for the binary.
Without <b>pdftk</b> installed, you can use Docsplit, but won't be able
to split apart a multi-page PDF into single-page PDFs.
</li>
<li>
(Optional) Install <a href="http://www.libreoffice.org/">LibreOffice</a>.
On Linux, use <b>aptitude</b>, <b>apt-get</b> or <b>yum</b>:<br />
<tt>aptitude install libreoffice</tt><br />
On the Mac, download and install <a href="http://www.libreoffice.org/download">the latest release</a>.
</li>
<li>
(Optional) Install fonts to process documents that use <a href="https://help.ubuntu.com/community/Fonts#Chinese.2C_Japanese.2C_and_Korean_Fonts">Chinese, Japanese, and Korean Fonts</a>.
On Linux, use <b>aptitude</b>, <b>apt-get</b> or <b>yum</b>:<br />
<tt>aptitude install ttf-wqy-microhei ttf-wqy-zenhei ttf-kochi-gothic ttf-kochi-mincho fonts-nanum</tt><br />
On the Mac, the fonts should already be present. However you can always download the TTF files and install them using <a href="http://support.apple.com/en-us/HT201749">Font Book</a>.
</li>
</ol>
<p><i>
Note: the gem will take a minute to download — the
JODConverter jar file tips the scales at 2MB.
</i></p>
<h2 id="usage">Usage</h2>
<p>
The Docsplit gem includes both the <tt>docsplit</tt> command-line utility
as well as a Ruby API. The available commands and options are identical in both.<br />
<tt>--output</tt> or <tt>-o</tt> can be passed to any command in order to
store the generated files in a directory of your choosing.
</p>
<p>
<b class="header">images</b><code>--size --format --pages --density</code>
<span class="alias">Ruby: <b>extract_images</b></span>
<br />
Generates an image for each page in the document at the specified resolution
and format. Pass <tt>--pages</tt> or <tt>-p</tt> to choose the specific pages to
image. Passing<br /> <tt>--size</tt> or <tt>-s</tt> will specify the desired
image resolution, <tt>--density</tt> or <tt>-d</tt> will specify the DPI to rasterize the images
at during conversion by GraphicsMagick, and <tt>--format</tt> or <tt>-f</tt>
will select the format of the final images.
</p>
<pre>
docsplit images example.pdf
docsplit images docs/*.pdf --size 700x,50x50 --format gif --pages 3,10-15,42</pre>
<pre>
Docsplit.extract_images('example.doc', :size => '1000x', :format => [:png, :jpg])</pre>
<p class="break">
<b class="header">text</b><code>--pages --ocr --no-ocr --no-clean --language --no-orientation-detection</code>
<span class="alias">Ruby: <b>extract_text</b></span>
<br />
Extract the complete <b>UTF-8</b>-encoded plain text of a document to a
single file. If you'd like to extract the text for each page separately,
pass <tt>--pages all</tt>. You can use the <tt>--ocr</tt> and <tt>--no-ocr</tt>
flags to force OCR, or disable it, respectively. By default (if Tesseract is installed)
Docsplit will OCR the text of each page for which it fails to extract text
directly from the document. Docsplit will also attempt to clean up garbage
characters in the OCR'd text — to disable this, pass the
<tt>--no-clean</tt> flag.
</p>
<p>
By default Tesseract ships only with english extraction data.
If <a href="https://code.google.com/p/tesseract-ocr/downloads/list"/>
any additional language models</a> are installed you can select one using
the <tt>--language</tt> flag.
If <a href="https://code.google.com/p/tesseract-ocr/downloads/detail?name=tesseract-ocr-3.01.osd.tar.gz&can=2&q=">
Tesseract's orientation detection model</a> Docsplit will automatically use it
unless you specify not to with the <tt>--no-orientation-detection</tt>.
</p>
<pre>
docsplit text path/to/doc.pdf --pages all --language deu</pre>
<pre>
docs = Dir['storage/originals/*.doc']
Docsplit.extract_text(docs, :ocr => false, :output => 'storage/text')</pre>
<p class="break">
<b class="header">pages</b><code>--pages</code>
<span class="alias">Ruby: <b>extract_pages</b></span>
<br />
Burst apart a document into single-page PDFs. Use <tt>--pages</tt> to
specify the individual pages (or ranges of pages) you'd like to generate.
</p>
<pre>
docsplit pages path/to/doc.pdf --pages 1-10</pre>
<pre>
Docsplit.extract_pages('path/to/presentation.ppt')
Docsplit.extract_pages('doc.pdf', :pages => 1..10)</pre>
<p class="break">
<b class="header">pdf</b>
<span class="alias">Ruby: <b>extract_pdf</b></span>
<br />
Convert documents into PDFs. Any type of document that LibreOffice can read
may be converted. These include the Microsoft Office formats: <b>doc</b>, <b>docx</b>, <b>ppt</b>,
<b>xls</b> and so on, as well as <b>html</b>, <b>odf</b>, <b>rtf</b>, <b>swf</b>, <b>svg</b>, and <b>wpd</b>.
The first time that you convert a new file type, LibreOffice will lazy-load
the code that processes it — subsequent conversions will be much faster.
</p>
<pre>
docsplit pdf documentation/*.html</pre>
<pre>
Docsplit.extract_pdf('expense_report.xls')</pre>
<p class="break">
<b class="header">author, date, creator, keywords, producer, subject, title, length</b><br />
<small><i>Ruby: <b>extract_...</b></i></small>
<br />
Retrieve a piece of metadata about the document. The <tt>docsplit</tt>
utility will print to <b>stdout</b>, the Ruby API will return the value.
</p>
<pre>
docsplit title path/to/stooges.pdf
=> Disorder in the Court</pre>
<pre>
Docsplit.extract_length('path/to/stooges.pdf')
=> 36</pre>
<h2 id="internals">Internals</h2>
<p>
Under the hood, Docsplit is a thin wrapper around the excellent
<a href="http://www.graphicsmagick.org/">GraphicsMagick</a>,
<a href="http://poppler.freedesktop.org/">Poppler</a>,
<a href="http://www.accesspdf.com/pdftk/">PDFTK</a>,
<a href="http://code.google.com/p/tesseract-ocr/">Tesseract</a>, and
<a href="http://www.libreoffice.org/">LibreOffice</a> libraries.
Poppler is used to extract text and metadata from PDF documents,
PDFTK is used to split them apart into pages, and GraphicsMagick is used to generate
the page images (internally, it's rendering them with
<a href="http://pages.cs.wisc.edu/~ghost/doc/GPL/index.htm">GhostScript</a>).
LibreOffice and GraphicsMagick convert documents and images to PDF.
Tesseract provides the transparent OCR fallback support, if the document
is a simple scan, and the file doesn't contain any embedded text.
</p>
<p>
Because documents need to be in PDF format before any metadata, text,
or images are extracted, it's faster to use <tt>docsplit pdf</tt>
to convert it up front, if you're planning to run more than one extraction.
Otherwise Docsplit will write out the PDF version to a temporary file before
proceeding with each command.
</p>
<h2 id="changes">Change Log</h2>
<p>
<b class="header">0.7.6</b><small> – Nov. 16, 2014</small><br />
Docsplit will now automatically use Tesseract's orientation detection model
if it is installed.
</p>
<p>
<b class="header">0.7.5</b><small> – May 28, 2014</small><br />
Docsplit will detect PDFs regardless of extension using magic number-based
detection.
</p>
<p>
<b class="header">0.7.2</b><small> – Feb. 23, 2013</small><br />
Bug fixes for LibreOffice support.
</p>
<p>
<b class="header">0.7.0</b><small> – Feb. 23, 2013</small><br />
Docsplit now expresses a preference for LibreOffice over OpenOffice, with
an eye to removing JODConverter and OpenOffice support in future versions
(direct LibreOffice support is substantially faster than JODConverter).
Improved unicode support now correctly collects non-ascii characters from
pdfinfo.
</p>
<p>
<b class="header">0.6.4</b><small> – Nov. 12, 2012</small><br />
Added a language flag for the Docsplit commandline, fixed several bugs,
and began preparations for the deprecation of pdftk.
</p>
<p>
<b class="header">0.6.2</b><small> – Nov. 22, 2011</small><br />
Bugfix to escape document names during file type detection.
</p>
<p>
<b class="header">0.6.1</b><small> – Nov. 18, 2011</small><br />
Docsplit now supports converting documents using LibreOffice
as well as OpenOffice, through JODConverter 3.0 beta4.
</p>
<p>
<b class="header">0.6.0</b><small> – Sept. 13, 2011</small><br />
Docsplit should now handle shelling out for documents with arbitrary
characters in their filenames correctly, thanks to a series of
epic patches from Vladimir Rybas.
A <tt>--density</tt> option was added for specifying the resolution of
rasterization when generating images from documents.
The image resolution for OCR has been doubled from 200 to 400 DPI —
this shouldn't make a noticeable difference for normal docs, but will make
a world of difference for the fine print.
Docsplit now uses GraphicsMagick's <tt>--despeckle</tt> before OCR.
</p>
<p>
<b class="header">0.5.2</b><small> – May 13, 2011</small><br />
For transparent conversion to PDF, made Docsplit prefer GraphicsMagick
over OpenOffice, when the file format is one that GraphicsMagick is able
to read: (png, gif, jpg, jpeg, tif, tiff, bmp, pnm, ppm, svg, eps).
</p>
<p>
<b class="header">0.5.1</b><small> – April 26, 2011</small><br />
Minor tweaks to the <tt>TextCleaner</tt> to be more lenient about acryonms
with hyphens, and words with four vowels in a row.
</p>
<p>
<b class="header">0.5.0</b><br />
Added a <tt>Docsplit::TextCleaner</tt> class which is used to post-process
OCR'd text, and remove garbage characters that are created when Tesseract
encounters non-english text. To disable the cleanup, pass <tt>--no-clean</tt>.
</p>
<p>
<b class="header">0.4.1</b><br />
Upgraded the JODConverter dependency for PDF conversion via OpenOffice to
3.0 beta. Added PNG, GIF, TIF, JPG, and BMP to the list of supported
formats.
</p>
<p>
<b class="header">0.3.4</b><br />
Adding a suggested optimization from the GraphicsMagick list -- only ever
generate one page image per GraphicsMagick call. Saves large amounts of
disk space for tempfiles on long documents.
</p>
<p>
<b class="header">0.3.3</b><br />
Start using the MAGICK_TMPDIR environment variable to prevent parallel
Docsplit runs from having the potential to clobber each other's temporary
image files.
</p>
<p>
<b class="header">0.3.1</b><br />
Added a memory limit to GraphicsMagick while generating the TIFFs for
Tesseract OCR -- prevents <tt>gm</tt> from gobbling up all available memory
on large files.
</p>
<p>
<b class="header">0.3.0</b><br />
OCR support added via Tesseract, and the <tt>--ocr</tt> and <tt>--no-ocr</tt>
flags. PDFBox is no longer a dependency, and the gem is many megabytes
lighter for it.
</p>
<p>
<b class="header">0.2.0</b><br />
Moving to Poppler's <tt>pdftotext</tt>. PDFBox had issues with Unicode in PDFs
and incorrectly split individual pages of text.
</p>
<p>
<b class="header">0.1.3</b><br />
Fixing a bug with specifying explicit page ranges for image extraction.
</p>
<p>
<b class="header">0.1.2</b><br />
Limiting the memory usage of GraphicsMagick to avoid out of memory errors
on very large PDFs.
</p>
<p>
<b class="header">0.1.1</b><br />
Upgraded for compatibility with GraphicsMagick 1.3.11.
</p>
<p>
<b class="header">0.1.0</b><br />
Initial Docsplit release.
</p>
<p>
<br />
<a href="http://documentcloud.org/" title="A DocumentCloud Project" style="background:none;">
<img src="http://jashkenas.s3.amazonaws.com/images/a_documentcloud_project.png" alt="A DocumentCloud Project" style="position:relative;left:-10px;" />
</a>
</p>
</div>
</div>
</body>
</html>