新闻 Apache Tika 1.12 发布，内容抽取工具下载

漂亮的石头 · 2016-02-16

Apache Tika 1.12 发布，Tika是一个内容抽取的工具集合(a toolkit for text extracting)。它集成了POI, Pdfbox 并且为文本抽取工作提供了一个统一的界面。其次，Tika也提供了便利的扩展API，用来丰富其对第三方文件格式的支持。

该版本包含不少改进和 bug 修复。具体内容包括：

* Slide notes are now linked to the slide XHTML in the PPT output
(TIKA-1840).
* JSON tests in Tika server were updated to remove impossible casts
(Github-73).
* Fix bug in GeoTopicParser where NER is reused instead of instantiated
with each request (TIKA-1834).
* Upgrade rome to 1.5.1 && Downgrade Rome dependency to 0.9 to avoid
nasty NPE (TIKA-1820, TIKA-1516)
* The NamedEntityParser was enhanced to generate text content
in addition to metadata (TIKA-1815, TIKA-1816).
* A significant speed-up is made to the GeoTopicParser by
using the new REST server capabilities from Lucene Geo
Gazetteer (TIKA-1803).
* A parser to compute motion properties in Videos, e.g.,
Histogram of Oriented Gradients and Histogram of Optical Flows
using the Pooled Time Series algorithm, was added (TIKA-1798).
* Provide NamedEntityParser which exposes Named Entity Recognition
from OpenNLP and Stanford NER providers (TIKA-1787, GitHub-61,
GitHub-62).
* Allow XHTMLContentHandler to pass attributes of html element
via Markus Jelsma (TIKA-1782).
* Fix regression with spacing in PPT via Andreas Beeker (TIKA-1777).
* Tika Facade parse methods for Path and File added which take a
Metadata object, to mirror the existing InputStream one (GitHub-60)
* GeoParser fix for loading the NER model from a jar file (TIKA-1791)
Apache Tika 1.12 发布，内容抽取工具下载地址

Log in or Sign up

新闻 Apache Tika 1.12 发布，内容抽取工具下载

漂亮的石头版主 Staff Member

Log in or Sign up

新闻 Apache Tika 1.12 发布，内容抽取工具 下载

漂亮的石头 版主 Staff Member

新闻 Apache Tika 1.12 发布，内容抽取工具下载

漂亮的石头版主 Staff Member