forked from apache/tika
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathCHANGES.txt
1785 lines (1289 loc) · 62.6 KB
/
CHANGES.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
Release 1.13 - ???
* Upgrade to POI 3.14-beta1 (TIKA-1799).
* Upgrade to PDFBox 1.8.11 (TIKA-1830).
Release 1.12 - 01/24/2016
* Slide notes are now linked to the slide XHTML in the PPT output
(TIKA-1840).
* JSON tests in Tika server were updated to remove impossible casts
(Github-73).
* Fix bug in GeoTopicParser where NER is reused instead of instantiated
with each request (TIKA-1834).
* Upgrade rome to 1.5.1 && Downgrade Rome dependency to 0.9 to avoid
nasty NPE (TIKA-1820, TIKA-1516)
* The NamedEntityParser was enhanced to generate text content
in addition to metadata (TIKA-1815, TIKA-1816).
* A significant speed-up is made to the GeoTopicParser by
using the new REST server capabilities from Lucene Geo
Gazetteer (TIKA-1803).
* A parser to compute motion properties in Videos, e.g.,
Histogram of Oriented Gradients and Histogram of Optical Flows
using the Pooled Time Series algorithm, was added (TIKA-1798).
* Provide NamedEntityParser which exposes Named Entity Recognition
from OpenNLP and Stanford NER providers (TIKA-1787, GitHub-61,
GitHub-62).
* Allow XHTMLContentHandler to pass attributes of html element
via Markus Jelsma (TIKA-1782).
* Fix regression with spacing in PPT via Andreas Beeker (TIKA-1777).
* Tika Facade parse methods for Path and File added which take a
Metadata object, to mirror the existing InputStream one (GitHub-60)
* GeoParser fix for loading the NER model from a jar file (TIKA-1791)
Release 1.11 - 10/18/2015
* Java7 API support for allowing java.nio.file.Path as method arguments
was added to Tika and to ParsingReader, TikaFileTypeDetector, and to
Tika Config (TIKA-1745, TIKA-1746, TIKA-1751).
* MIME support was added for WebVTT: The Web Video Text Tracks Format
files (TIKA-1772).
* MIME magic improved to ensure emails detected as message/rfc822
(TIKA-1771).
* Upgrade to Jackcess Encrypt 2.1.1 to avoid binary incompatibility
with Bouncy Castle (TIKA-1736).
* Make div and other markup more consistent between PPT and
PPTX (TIKA-1755).
* Parse multiple authors from MSOffice's semi-colon delimited
author field (TIKA-1765).
* Include CTAKESConfig.properties within tika-parsers resources
by default (TIKA-1741).
* Prevent infinite recursion when processing inline images
in PDF files by limiting extraction of duplicate images
within the same page (TIKA-1742).
* Upgrade to POI 3.13-final (via Andreas Beeker) (TIKA-1707).
* Upgraded tika-batch to use Path throughout (TIKA-1747 and
(TIKA-1754).
* Upgraded to Path in TikaInputStream (via Yaniv Kunda) (TIKA-1744).
* Changed default content handler type for "/rmeta" in tika-server
to "xml" to align with "-J" option in tika-app.
Clients can now specify handler types via PathParam. (TIKA-1716).
* The fantastic GROBID (or Grobid) GeneRation Of BIbliographic Data
for machine learning from PDF files is now integrated as a
Tika parser (TIKA-1699, TIKA-1712).
* The ability to specify the Tesseract Config Path was added
to the OCR Parser (TIKA-1703).
* Upgraded to ASM 5.0.4 (TIKA-1705).
* Corrected Tika Config XML detector definition explicit loading
of MimeTypes (TIKA-1708)
* In Tika Parsers, Batch, Server, App and Examples, use Apache
Commons IO instead of inlined ex-Commons classes, and the Java 7
Standard Charset definitions (TIKA-1710)
* Upgraded to Commons Compress 1.10, which enables zlib compressed
archives support (TIKA-1718)
Release 1.10 - 8/1/2015
* Tika Config XML can now be used to create composite detectors,
and exclude detectors that DefaultDetector would otherwise
have used. This brings support in-line with Parsers. (TIKA-1702)
* Reverted to legacy sort order of parsers that was
mistakenly reversed in Tika 1.9 (TIKA-1689).
* Upgrade to POI 3.13-beta1 (TIKA-1667).
* Upgrade to PDFBox 1.8.10 (TIKA-1588).
* MimeTypes now tries to find a registered type with and
without parameters (TIKA-1692).
* Added more robust error handling for encoding detection
of .MSG files (TIKA-1238).
* Fixed bug in Tika's use of the Jackcess parser that
prevented reading of v97 Access files (TIKA-1681).
* Upgrade xerial.org's sqlite-jdbc to 3.8.10.1. NOTE:
as of Tika 1.9, this jar is "provided." Make sure
to upgrade your provided jar! (TIKA-1687).
* Add header/footer extraction to xls (via Aeham Abushwashi)
(TIKA-1400).
* Drop the source file name from the embedded file path in
RecursiveParserWrapper's "X-TIKA:embedded_resource_path"
(TIKA-1673).
* Upgraded to Java 7 (TIKA-1536).
* Non-standards compliant emails are now correctly detected
as message/rfc822 (TIKA-1602).
* Added parser for MS Access files via Jackcess. Many thanks
to Health Market Science, Brian O'Neill and James Ahlborn
for relicensing Jackcess to Apache v2! (TIKA-1601)
* GDALParser now correctly sets "nitf" as a supported
MediaType (TIKA-1664).
* Added DigestingParser to calculate digest hashes
and record them in metadata. Integrated with
tika-app and tika-server (TIKA-1663).
* Fixed ZipContainerDetector to detect all IPA files
(TIKA-1659).
Release 1.9 - 6/6/2015
* The ability to use the cTAKES clinical text
knowledge extraction system for biomedical data is
now included as a Tika parser (TIKA-1645, TIKA-1642).
* Tika-server allows a user to specify the Tika config
from the command line (TIKA-1652, TIKA-1426).
* Matlab file detection has been improved (TIKA-1634).
* The EXIFTool was added as an External parser
(TIKA-1639).
* If FFMPEG is installed and on the PATH, it is a
usable Parser in Tika now (TIKA-1510).
* Fixes have been applied to the ExternalParser to make
it functional (TIKA-1638).
* Tika service loading can now be more verbose with the
org.apache.tika.service.error.warn system property (TIKA-1636).
* Tika Server now allows for metadata extraction from remote
URLs and in addition it outputs the detected language as a
metadata field (TIKA-1625).
* OUTPUT_FILE_TOKEN not being replaced in ExternalParser
contributed by Pascal Essiembre (TIKA-1620).
* Tika REST server now supports language identification
(TIKA-1622).
* All of the example code from the Tika in Action book has
been donated to Tika and added to tika-examples (TIKA-1562).
* Tika server now logs errors determining ContentDisposition
(TIKA-1621).
* An algorithm for using Byte Histogram frequencies to construct
a Neural Network and to perform MIME detection was added
(TIKA-1582).
* A Bayesian algorithm for MIME detection by probabilistic
means was added (TIKA-1517).
* Tika now incorporates the Apache Spatial Information
System capability of parsing Geographic ISO 19139
files (TIKA-443). It can also detect those files as
well.
* Update the MimeTypes code to support inheritance
(TIKA-1535).
* Provide ability to parse and identify Global Change
Master Directory Interchange Format (GCMD DIF)
scientific data files (TIKA-1532).
* Improvements to detect CBOR files by extension (TIKA-1610).
* Change xerial.org's sqlite-jdbc jar to "provided" (TIKA-1511).
Users will now need to add sqlite-jdbc to their classpath for
the Sqlite3Parser to work.
* ExternalParser.check now catches (suppresses) SecurityException
and returns false, so it's OK to run Tika with a security policy
that does not allow execution of external processes (TIKA-1628).
Release 1.8 - 4/13/2015
* Fix null pointer when processing ODT footer styles (TIKA-1600).
* Upgrade to com.drewnoakes' metadata-extractor to 2.0 and
add parser for webp metadata (TIKA-1594).
* Duration extracted from MP3s with no ID3 tags (TIKA-1589).
* Upgraded to PDFBox 1.8.9 (TIKA-1575).
* Tika now supports the IsaTab data standard for bioinformatics
both in terms of MIME identification and in terms of parsing
(TIKA-1580).
* Tika server can now enable CORS requests with the command line
"--cors" or "-C" option (TIKA-1586).
* Update jhighlight dependency to avoid using LGPL license. Thank
@kkrugler for his great contribution (TIKA-1581).
* Updated HDF and NetCDF parsers to output file version in
metadata (TIKA-1578 and TIKA-1579).
* Upgraded to POI 3.12-beta1 (TIKA-1531).
* Added tika-batch module for directory to directory batch
processing. This is a new, experimental capability, and the API will
likely change in future releases (TIKA-1330).
* Translator.translate() Exceptions are now restricted to
TikaException and IOException (TIKA-1416).
* Tika now supports MIME detection for Microsoft Extended
Makefiles (EMF) (TIKA-1554).
* Tika has improved delineation in XML and HTML MIME detection
(TIKA-1365).
* Upgraded the Drew Noakes metadata-extractor to version 2.7.2
(TIKA-1576).
* Added basic style support for ODF documents, contributed by
Axel Dörfler (TIKA-1063).
* Move Tika server resources and writers to separate
org.apache.tika.server.resource and writer packages (TIKA-1564).
* Upgrade UCAR dependencies to 4.5.5 (TIKA-1571).
* Fix Paths in Tika server welcome page (TIKA-1567).
* Fixed infinite recursion while parsing some PDFs (TIKA-1038).
* XHTMLContentHandler now properly passes along body attributes,
contributed by Markus Jelsma (TIKA-995).
* TikaCLI option --compare-file-magic to report mime types known to
the file(1) tool but not known / fully known to Tika.
* MediaTypeRegistry support for returning known child types.
* Support for excluding (blacklisting) certain Parsers from being
used by DefaultParser via the Tika Config file, using the new
parser-exclude tag (TIKA-1558).
* Detect Global Change Master Directory (GCMD) Directory
Interchange Format (DIF) files (TIKA-1561).
* Tika's JAX-RS server can now return stacktraces for
parse exceptions (TIKA-1323).
* Added MockParser for testing handling of exceptions, errors
and hangs in code that uses parsers (TIKA-1553).
* The ForkParser service removed from Activator. Rollback of (TIKA-1354).
* Increased the speed of language identification by
a factor of two -- contributed by Toke Eskildsen (TIKA-1549).
* Added parser for Sqlite3 db files. Some users will need to
exclude the dependency on xerial.org's sqlite-jdbc because
it contains native libs (TIKA-1511).
* Use POST instead of PUT for tika-server form methods
(TIKA-1547).
* A basic wrapper around the UNIX file command was
added to extract Strings. In addition a parse to
handle Strings parsing from octet-streams using Latin1
charsets as added (TIKA-1541, TIKA-1483).
* Add test files and detection mechanism for Gridded
Binary (GRIB) files (TIKA-1539).
* The RAR parser was updated to handle Chinese characters
using the functionality provided by allowing encoding to
be used within ZipArchiveInputStream (TIKA-936).
* Fix out of memory error in surefire plugin (TIKA-1537).
* Build a parser to extract data from GRIB formats (TIKA-1423).
* Upgrade to Commons Compress 1.9 (TIKA-1534).
* Include media duration in metadata parsed by MP4Parser (TIKA-1530).
* Support password protected 7zip files (using a PasswordProvider,
in keeping with the other password supporting formats) (TIKA-1521).
* Password protected Zip files should not trigger an exception (TIKA-1028).
Release 1.7 - 1/9/2015
* Fixed resource leak in OutlookPSTParser that caused TikaException
when invoked via AutoDetectParser on Windows (TIKA-1506).
* HTML tags are properly stripped from content by FeedParser
(TIKA-1500).
* Tika Server support for selecting a single metadata key;
wrapped MetadataEP into MetadataResource (TIKA-1499).
* Tika Server support for JSON and XMP views of metadata (TIKA-1497).
* Tika Parent uses dependency management to keep duplicate
dependencies in different modules the same version (TIKA-1384).
* Upgraded slf4j to version 1.7.7 (TIKA-1496).
* Tika Server support for RecursiveParserWrapper's JSON output
(endpoint=rmeta) equivalent to (TIKA-1451's) -J option
in tika-app (TIKA-1498).
* Tika Server support for providing the password for files on a
per-request basis through the Password http header (TIKA-1494).
* Simple support for the BPG (Better Portable Graphics) image format
(TIKA-1491, TIKA-1495).
* Prevent exceptions from being thrown for some malformed
mp3 files (TIKA-1218).
* Reformat pom.xml files to use two spaces per indent (TIKA-1475).
* Fix warning of slf4j logger on Tika Server startup (TIKA-1472).
* Tika CLI and GUI now have option to view JSON rendering of output
of RecursiveParserWrapper (TIKA-1451).
* Tika now integrates the Geospatial Data Abstraction Library
(GDAL) for parsing hundreds of geospatial formats (TIKA-605,
TIKA-1503).
* ExternalParsers can now use Regexs to specify dynamic keys
(TIKA-1441).
* Thread safety issues in ImageMetadataExtractor were resolved
(TIKA-1369).
* The ForkParser service is now registered in Activator
(TIKA-1354).
* The Rome Library was upgraded to version 1.5 (TIKA-1435).
* Add markup for files embedded in PDFs (TIKA-1427).
* Extract files embedded in annotations in PDFS (TIKA-1433).
* Upgrade to PDFBox 1.8.8 (TIKA-1419, TIKA-1442).
* Add RecursiveParserWrapper (aka Jukka's and Nick's)
RecursiveMetadataParser (TIKA-1329)
* Add example for how to dump TikaConfig to XML (TIKA-1418).
* Allow users to specify a tika config file for tika-app (TIKA-1426).
* PackageParser includes the last-modified date from the archive
in the metadata, when handling embedded entries (TIKA-1246)
* Created a new Tesseract OCR Parser to extract text from images.
Requires installation of Tesseract before use (TIKA-93).
* Basic parser for older Excel formats, such as Excel 4, 5 and 95,
which can get simple text, and metadata for Excel 5+95 (TIKA-1490)
Release 1.6 - 08/31/2014
* Parse output should indicate which Parser was actually used
(TIKA-674).
* Use the forbidden-apis Maven plugin to check for unsafe Java
operations (TIKA-1387).
* Created an ExternalTranslator class to interface with command
line Translators (TIKA-1385).
* Created a MosesTranslator as a subclass of ExternalTranslator
that calls the Moses Decoder machine translation program (TIKA-1385).
* Created the tika-example module. It will have examples of how to
use the main Tika interfaces (TIKA-1390).
* Upgraded to Commons Compress 1.8.1 (TIKA-1275).
* Upgraded to POI 3.11-beta1 (TIKA-1380).
* Tika now extracts SDTCell content from tables in .docx files (TIKA-1317).
* Tika now supports detection of the Persian/Farsi language.
(TIKA-1337)
* The Tika Detector interface is now exposed through the JAX-RS
server (TIKA-1336, TIKA-1336).
* Tika now has support for parsing binary Matlab files as part of
our larger effort to increase the number of scientific data formats
supported. (TIKA-1327)
* The Tika Server URLs for the unpacker resources have been changed,
to bring them under a common prefix (TIKA-1324). The mapping is
/unpacker/{id} -> /unpack/{id}
/all/{id} -> /unpack/all/{id}
* Added module and core Tika interface for translating text between
languages and added a default implementation that call's Microsoft's
translate service (TIKA-1319)
* Added an Translator implementation that calls Lingo24's Premium
Machine Translation API (TIKA-1381)
* Made RTFParser's list handling slightly more robust against corrupt
list metadata (TIKA-1305)
* Fixed bug in CLI json output (TIKA-1291/TIKA-1310)
* Added ability to turn off image extraction from PDFs (TIKA-1294).
Users must now turn on this capability via the PDFParserConfig.
* Upgrade to PDFBox 1.8.6 (TIKA-1290, TIKA-1231, TIKA-1233, TIKA-1352)
* Zip Container Detection for DWFX and XPS formats, which are OPC
based (TIKA-1204, TIKA-1221)
* Added a user facing welcome page to the Tika Server, which
says what it is, and a very brief summary of what is available.
(TIKA-1269)
* Added Tika Server endpoints to list the available mime types,
Parsers and Detectors, similar to the --list-<foo> methods on
the Tika CLI App (TIKA-1270)
* Improvements to NetCDF and HDF parsing to mimic the output of
ncdump and extract text dimensions and spatial and variable
information from scientific data files (TIKA-1265)
* Extract attachments from RTF files (TIKA-1010)
* Support Outlook Personal Folders File Format *.pst (TIKA-623)
* Added mime entries for additional Ogg based formats (TIKA-1259)
* Updated the Ogg Vorbis plugin to v0.4, which adds detection for a wider
range of Ogg formats, and parsers for more Ogg Audio ones (TIKA-1113)
* PDF: Images in PDF documents can now be extracted as embedded resources.
(TIKA-1268)
* Fixed RuntimeException thrown for certain Word Documents (TIKA-1251).
* CLI: TikaCLI now has another option: --list-parser-details-apt, which outputs
the list of supported parsers in APT format. This is used to generate the list
on the formats page (TIKA-411).
Release 1.5 - 02/04/2014
* Fixed bug in handling of embedded file processing in PDFs (TIKA-1228).
* Added SourceCodeParser to support java, Groovy, C++ files (TIKA-1224).
* Updated Tika Server to support multipart/form-data payloads (TIKA-1198).
* Updated Tika Server to CXF 2.7.8 (TIKA-1197).
* Updated Tika Server to accept requests over wildcard addresses (TIKA-1196).
* Added option to use alternate NonSequentialPDFParser (TIKA-1201).
* Content from PDF AcroForms is now extracted (TIKA-973).
* Fixed invalid asterisks from master slide in PPT (TIKA-1171).
* Added test cases to confirm handling of auto-date in PPT and PPTX (TIKA-817).
* Text from tables in PPT files is once again extracted correctly (TIKA-1076).
* Text is extracted from text boxes in XLSX (TIKA-1100).
* Tika no longer hangs when processing Excel files with custom fraction format (TIKA-1132).
* Disconcerting stacktrace from missing beans no longer printed for some DOCX files (TIKA-792).
* Upgraded POI to 3.10-beta2 (TIKA-1173).
* Upgraded PDFBox to 1.8.4 (TIKA-1230).
* Made HtmlEncodingDetector more flexible in finding meta
header charset (TIKA-1001).
* Added sanitized test HTML file for local file test (TIKA-1139).
* Fixed bug that prevented attachments within a PDF from being processed
if the PDF itself was an attachment (TIKA-1124).
* Text from paragraph-level structured document tags in DOCX files is now extracted (TIKA-1130).
* RTF: Fixed ArrayIndexOutOfBoundsException when parsing list override (TIKA-1192).
* CLI: TikaCLI now escapes invalid filename characters as hex
characters (TIKA-1078).
Release 1.4 - 06/15/2013
* Removed a test HTML file with a poorly chosen GPL text in it (TIKA-1129).
* Improvements to tika-server to allow it to produce text/html and
text/xml content (TIKA-1126, TIKA-1127).
* Improvements were made to the Compressor Parser to handle g'zipped files
that require the decompressConcatenated option set to true (TIKA-1096).
* Addressed a typographic error that was preventing from detection of
awk files (TIKA-1081).
* Added a new end-point to Tika's JAX-RS REST server that only detects
the media-type based on a small portion of the document submitted
(TIKA-1047).
* RTF: Ordered and unordered lists are now extracted (TIKA-1062).
* MP3: Audio duration is now extracted (TIKA-991)
* Java .class files: upgraded from ASM 3.1 to ASM 4.1 for parsing
the Java bytecodes (TIKA-1053).
* Mime Types: Definitions extended to optionally include Link (URL) and
UTI, along with details for several common formats (TIKA-1012 / TIKA-1083)
* Exceptions when parsing OLE10 embedded documents, when parsing
summary information from Office documents, and when saving
embedded documennts in TikaCLI are now logged instead
of aborting extraction (TIKA-1074)
* MS Word: line tabular character is now replaced with newline
(TIKA-1128)
* XML: ElementMetadataHandlers can now optionally accept duplicate
and empty values (TIKA-1133)
Release 1.3 - 01/19/2013
* Mimetype definitions added for more common programming languages,
including common extensions, but not magic patterns. (TIKA-1055)
* MS Word: When a Word (.doc) document contains embedded files or
links to external documents, Tika now places a <div
class="embedded" id="_XXX"/> placeholder into the XHTML so you can
see where in the main text the embedded document occurred
(TIKA-956, TIKA-1019). Embedded Wordpad/RTF documents are now
recognized (TIKA-982).
* PDF: Text from pop-up annotations is now extracted (TIKA-981).
Text from bookmarks is now extracted (TIKA-1035).
* PKCS7: Detached signatures no longer through NullPointerException
(TIKA-986).
* iWork: The chart name for charts embedded in numbers documents is
now extracted (TIKA-918).
* CLI: TikaCLI -m now handles multi-valued metadata keys correctly
(previously it only printed the first value). (TIKA-920)
* MS Word (.docx): When a Word (.docx) document contains embedded
files, Tika now places a <div class="embedded" id="XXX"/> into the
XHTML so you can see where in the main text the embedded document
occurred. The id (rId) is included in the Metadata of each
embedded document as the new Metadata.EMBEDDED_RELATIONSHIP_ID
key, and TikaCLI prepends the rId (if present) onto the filename
it extracts (TIKA-989). Fixed NullPointerException when style is
null (TIKA-1006). Text inside text boxes is now extracted
(TIKA-1005).
* RTF: Page, word, character count and creation date metadata are
now extracted for RTF documents (TIKA-999).
* MS PowerPoint (.pptx): When a PowerPoint (.pptx) document contains
embedded files, Tika now places a <div class="embedded" id="XXX"/> into the
XHTML so you can see where in the main text the embedded document
occurred. The id (rId) is included in the Metadata of each
embedded document as the new Metadata.EMBEDDED_RELATIONSHIP_ID
key, and TikaCLI prepends the rId (if present) onto the filename
it extracts (TIKA-997, TIKA-1032).
* MS PowerPoint (.ppt): When a PowerPoint (.ppt) document contains
embedded files, Tika now places a <div class="embedded" id="XXX"/> into the
XHTML so you can see where in the main text the embedded document
occurred (TIKA-1025). Text from the master slide is now extracted
(TIKA-712).
* MHTML: fixed Null charset name exception when a mime part has an
unrecognized charset (TIKA-1011).
* MP3: if an ID3 tag was encoded in UTF-16 with only the BOM then on
certain JVMs this would incorrectly extract the BOM as the tag's
value (TIKA-1024).
* ZIP: placeholders (<div class="embedded" id="<entry name>"/>) are
now left in the XHTML so you can see where each archive member
appears (TIKA-1036). TikaCLI would hit FileNotFoundException when
extracting files that were under sub-directories from a ZIP
archive, because it failed to create the parent directories first
(TIKA-1031).
* XML: a space character is now added before each element
(TIKA-1048)
Release 1.2 - 07/10/2012
---------------------------------
* Tika's JAX-RS based Network server now is based on Apache CXF,
which is available in Maven Central and now allows the server
module to be packaged and included in our release
(TIKA-593, TIKA-901).
* Tika: parseToString now lets you specify the max string length
per-call, in addition to per-Tika-instance. (TIKA-870)
* Tika now has the ability to detect FITS (Flexible Image Transport System)
files (TIKA-874).
* Images: Fixed file handle leak in ImageParser. (TIKA-875)
* iWork: Comments in Pages files are now extracted (TIKA-907).
Headers, footers and footnotes in Pages files are now extracted
(TIKA-906). Don't throw NullPointerException on passsword
protected iWork files, even though we can't parse their contents
yet (TIKA-903). Text extracted from Keynote text boxes and bullet
points no longer runs together (TIKA-910). Also extract text for
Pages documents created in layout mode (TIKA-904). Table names
are now extracted in Numbers documents (TIKA-924). Content added
to master slides is also extracted (TIKA-923).
* Archive and compression formats: The Commons Compress dependency was
upgraded from 1.3 to 1.4.1. With this change Tika can now parse also
Unix dump archives and documents compressed using the XZ and Pack200
compression formats. (TIKA-932)
* KML: Tika now has basic support for Keyhole Markup Language documents
(KML and KMZ) used by tools like Google Earth. See also
http://www.opengeospatial.org/standards/kml/. (TIKA-941)
* CLI: You can now use the TIKA_PASSWORD environment variable or the
--password=X command line option to specify the password that Tika CLI
should use for opening encrypted documents (TIKA-943).
* Character encodings: Tika's character encoding detection mechanism was
improved by adding integration to the juniversalchardet library that
implements Mozilla's universal charset detection algorithm. The slower
ICU4J algorithms are still used as a fallback thanks to their wider
coverage of custom character encodings. (TIKA-322, TIKA-471)
* Charset parameter: Related to the character encoding improvements
mentioned above, Tika now returns the detected character encoding as
a "charset" parameter of the content type metadata field for text/plain
and text/html documents. For example, instead of just "text/plain", the
returned content type will be something like "text/plain; charset=UTF-8"
for a UTF-8 encoded text document. Character encoding information is still
present also in the content encoding metadata field for backwards
compatibility, but that field should be considered deprecated. (TIKA-431)
* Extraction of embedded resources from OLE2 Office Documents, where
the resource isn't another office document, has been fixed (TIKA-948)
Release 1.1 - 3/7/2012
---------------------------------
* Link Extraction: The rel attribute is now extracted from
links per the LinkConteHandler. (TIKA-824)
* MP3: Fixed handling of UTF-16 (two byte) ID3v2 tags (previously
the last character in a UTF-16 tag could be corrupted) (TIKA-793)
* Performance: Loading of the default media type registry is now
significantly faster. (TIKA-780)
* PDF: Allow controlling whether overlapping duplicated text should
be removed. Disabling this (the default) can give big
speedups to text extraction and may workaround cases where
non-duplicated characters were incorrectly removed (TIKA-767).
Allow controlling whether text tokens should be sorted by their x/y
position before extracting text (TIKA-612); this is necessary for
certain PDFs. Fixed cases where too many </p> tags appear in the
XHTML output, causing NPE when opening some PDFs with the GUI
(TIKA-778).
* RTF: Fixed case where a font change would result in processing
bytes in the wrong font's charset, producing bogus text output
(TIKA-777). Don't output whitespace in ignored group states,
avoiding excessive whitespace output (TIKA-781). Binary embedded
content (using \bin control word) is now skipped correctly;
previously it could cause the parser to incorrectly extract binary
content as text (TIKA-782).
* CLI: New TikaCLI option "--list-detectors", which displays the
mimetype detectors that are available, similar to the existing
"--list-parsers" option for parsers. (TIKA-785).
* Detectors: The order of detectors, as supplied via the service
registry loader, is now controlled. User supplied detectors are
prefered, then Tika detectors (such as the container aware ones),
and finally the core Tika MimeTypes is used as a backup. This
allows for specific, detailed detectors to take preference over
the default mime magic + filename detector. (TIKA-786)
* Microsoft Project (MPP): Filetype detection has been fixed,
and basic metadata (but no text) is now extracted. (TIKA-789)
* Outlook: fixed NullPointerException in TikaGUI when messages with
embedded RTF or HTML content were filtered (TIKA-801).
* Ogg Vorbis and FLAC: Parser added for Ogg Vorbis and FLAC audio
files, which extract audio metadata and tags (TIKA-747)
* MP4: Improved mime magic detection for MP4 based formats (including
QuickTime, MP4 Video and Audio, and 3GPP) (TIKA-851)
* MP4: Basic metadata extracting parser for MP4 files added, which includes
limited audio and video metadata, along with the iTunes media metadata
(such as Artist and Title) (TIKA-852)
* Document Passwords: A new ParseContext object, PasswordProvider,
has been added. This provides a way to supply the password for
a document during processing. Currently, only password protected
PDFs and Microsoft OOXML Files are supported. (TIKA-850)
Release 1.0 - 11/4/2011
---------------------------------
The most notable changes in Tika 1.0 over previous releases are:
* API: All methods, classes and interfaces that were marked as
deprecated in Tika 0.10 have been removed to clean up the API
(TIKA-703). You may need to adjust and recompile client code
accordingly. The declared OSGi package versions are now 1.0, and
will thus not resolve for client bundles that still refer to 0.x
versions (TIKA-565).
* Configuration: The context class loader of the current thread is
no longer used as the default for loading configured parser and
detector classes. You can still pass an explicit class loader
to the configuration mechanism to get the previous behaviour.
(TIKA-565)
* OSGi: The tika-core bundle will now automatically pick up and use
any available Parser and Detector services when deployed to an OSGi
environment. The tika-parsers bundle provides such services based on
for all the supported file formats for which the upstream parser library
is available. If you don't want to track all the parser libraries as
separate OSGi bundles, you can use the tika-bundle bundle that packages
tika-parsers together with all its upstream dependencies. (TIKA-565)
* RTF: Hyperlinks in RTF documents are now extracted as an <a
href=...>...</a> element (TIKA-632). The RTF parser is also now
more robust when encountering too many closing {'s vs. opening {'s
(TIKA-733).
* MS Word: From Word (.doc) documents we now extract optional hyphen
as Unicode zero-width space (U+200B), and non-breaking hyphen as
Unicode non-breaking hyphen (U+2011). (TIKA-711)
* Outlook: Tika can now process also attachments in Outlook messages.
(TIKA-396)
* MS Office: Performance of extracting embedded office docs was improved.
(TIKA-753)
* PDF: The PDF parser now extracts paragraphs within each page
(TIKA-742) and can now optionally extract text from PDF
annotations (TIKA-738). There's also an option to enable (the
default) or disable auto-space insertion (TIKA-724).
* Language detection: Tika can now detect Belarusian, Catalan,
Esperanto, Galician, Lithuanian (TIKA-582), Romanian, Slovak,
Slovenian, and Ukrainian (TIKA-681).
* Java: Tika no longer ships retrotranslated Java 1.4 binaries along
with the normal ones that work with Java 5 and higher. (TIKA-744)
* OpenOffice documents: header/footer text is now extracted for text,
presentation and spreadsheet documents (TIKA-736)
Tika 1.0 relies on the following set of major dependencies (generated using
mvn dependency:tree from tika-parsers):
org.apache.tika:tika-parsers:bundle:1.0
+- org.apache.tika:tika-core:jar:1.0:compile
+- edu.ucar:netcdf:jar:4.2-min:compile
| \- org.slf4j:slf4j-api:jar:1.5.6:compile
+- org.apache.james:apache-mime4j-core:jar:0.7:compile
+- org.apache.james:apache-mime4j-dom:jar:0.7:compile
+- org.apache.commons:commons-compress:jar:1.3:compile
+- commons-codec:commons-codec:jar:1.5:compile
+- org.apache.pdfbox:pdfbox:jar:1.6.0:compile
| +- org.apache.pdfbox:fontbox:jar:1.6.0:compile
| +- org.apache.pdfbox:jempbox:jar:1.6.0:compile
| \- commons-logging:commons-logging:jar:1.1.1:compile
+- org.bouncycastle:bcmail-jdk15:jar:1.45:compile
+- org.bouncycastle:bcprov-jdk15:jar:1.45:compile
+- org.apache.poi:poi:jar:3.8-beta4:compile
+- org.apache.poi:poi-scratchpad:jar:3.8-beta4:compile
+- org.apache.poi:poi-ooxml:jar:3.8-beta4:compile
| +- org.apache.poi:poi-ooxml-schemas:jar:3.8-beta4:compile
| | \- org.apache.xmlbeans:xmlbeans:jar:2.3.0:compile
| \- dom4j:dom4j:jar:1.6.1:compile
+- org.apache.geronimo.specs:geronimo-stax-api_1.0_spec:jar:1.0.1:compile
+- org.ccil.cowan.tagsoup:tagsoup:jar:1.2.1:compile
+- asm:asm:jar:3.1:compile
+- com.drewnoakes:metadata-extractor:jar:2.4.0-beta-1:compile
+- de.l3s.boilerpipe:boilerpipe:jar:1.1.0:compile
+- rome:rome:jar:0.9:compile
\- jdom:jdom:jar:1.0:compile
The following people have contributed to Tika 1.0 by submitting or commenting
on the issues resolved in this release:
Andrzej Bialecki
Antoni Mylka
Benson Margulies
Chris A. Mattmann
Cristian Vat
Dave Meikle
David Smiley
Dennis Adler
Erik Hetzner
Ingo Renner
Jeremias Maerki
Jeremy Anderson
Jeroen van Vianen
John Bartak
Jukka Zitting
Julien Nioche
Ken Krugler
Mark Butler
Maxim Valyanskiy
Michael Bryant
Michael McCandless
Nick Burch
Pablo Queixalos
Uwe Schindler
Žygimantas Medelis
See http://s.apache.org/Zk6 for more details on these contributions.
Release 0.10 - 09/25/2011
-------------------------
The most notable changes in Tika 0.10 over previous releases are:
* A parser for CHM help files was added. (TIKA-245)
* TIKA-698: Invalid characters are now replaced with the Unicode
replacement character (U+FFFD), whereas before such characters were
replaced with spaces, so you may need to change your processing of
Tika's output to now handle U+FFFD.
* The RTF parser was rewritten to perform its own direct shallow
parse of the RTF content, instead of using RTFEditorKit from
javax.swing. This fixes several issues in the old parser,
including doubling of Unicode characters in certain cases
(TIKA-683), exceptions on mal-formed RTF docs (TIKA-666), and
missing text from some elements (header/footer, hyperlinks,
footnotes, text inside pictures).
* Handling of temporary files within Tika was much improved
(TIKA-701, TIKA-654, TIKA-645, TIKA-153)
* The Tika GUI got a facelift and some extra features (TIKA-635)
* The apache-mime4j dependency of the email message parser was upgraded
from version 0.6 to 0.7 (TIKA-716). The parser also now accepts a
MimeConfig object in the ParseContext as configuration (TIKA-640).
Tika 0.10 relies on the following set of major dependencies (generated using
mvn dependency:tree from tika-parsers):
org.apache.tika:tika-parsers:bundle:0.10
+- org.apache.tika:tika-core:jar:0.10:compile
+- edu.ucar:netcdf:jar:4.2-min:compile
| \- org.slf4j:slf4j-api:jar:1.5.6:compile
+- org.apache.james:apache-mime4j-core:jar:0.7:compile
+- org.apache.james:apache-mime4j-dom:jar:0.7:compile
+- org.apache.commons:commons-compress:jar:1.1:compile
+- commons-codec:commons-codec:jar:1.4:compile
+- org.apache.pdfbox:pdfbox:jar:1.6.0:compile
| +- org.apache.pdfbox:fontbox:jar:1.6.0:compile
| +- org.apache.pdfbox:jempbox:jar:1.6.0:compile
| \- commons-logging:commons-logging:jar:1.1.1:compile
+- org.bouncycastle:bcmail-jdk15:jar:1.45:compile
+- org.bouncycastle:bcprov-jdk15:jar:1.45:compile
+- org.apache.poi:poi:jar:3.8-beta4:compile
+- org.apache.poi:poi-scratchpad:jar:3.8-beta4:compile
+- org.apache.poi:poi-ooxml:jar:3.8-beta4:compile
| +- org.apache.poi:poi-ooxml-schemas:jar:3.8-beta4:compile
| | \- org.apache.xmlbeans:xmlbeans:jar:2.3.0:compile
| \- dom4j:dom4j:jar:1.6.1:compile
+- org.apache.geronimo.specs:geronimo-stax-api_1.0_spec:jar:1.0.1:compile
+- org.ccil.cowan.tagsoup:tagsoup:jar:1.2.1:compile
+- asm:asm:jar:3.1:compile
+- com.drewnoakes:metadata-extractor:jar:2.4.0-beta-1:compile
+- de.l3s.boilerpipe:boilerpipe:jar:1.1.0:compile
+- rome:rome:jar:0.9:compile
\- jdom:jdom:jar:1.0:compile
The following people have contributed to Tika 0.10 by submitting or commenting
on the issues resolved in this release:
Alain Viret
Alex Ott
Alexander Chow
Andreas Kemkes
Andrew Khoury
Babak Farhang
Benjamin Douglas
Benson Margulies
Chris A. Mattmann
chris hudson
Chris Lott
Cristian Vat
Curt Arnold
Cynthia L Wong
Dave Brosius
David Benson
Enrico Donelli
Erik Hetzner
Erna de Groot
Gabriele Columbro
Gavin
Geoff Jarrad
Gregory Kanevsky
gunter rombauts
Henning Gross
Henri Bergius
Ingo Renner
Ingo Wiarda
Izaak Alpert
Jan H√∏ydahl
Jens Wilmer
Jeremy Anderson
Joseph Vychtrle
Joshua Turner
Jukka Zitting
Julien Nioche
Karl Heinz Marbaise
Ken Krugler
Kostya Gribov
Luciano Leggieri
Mads Hansen
Mark Butler
Matt Sheppard