1. XenForo 1.5.14 中文版——支持中文搜索!现已发布!查看详情
  2. Xenforo 爱好者讨论群:215909318 XenForo专区

新闻 Apache Tika 1.8 发布,内容抽取工具集合 下载

本帖由 漂亮的石头2015-04-21 发布。版面名称:软件资讯

  1. 漂亮的石头

    漂亮的石头 版主 管理成员

    注册:
    2012-02-10
    帖子:
    486,313
    赞:
    46
    Apache Tika 1.8 发布,此版本主要有以下更新:


    • Fix null pointer when processing ODT footer styles (TIKA-1600).


    • Upgrade to com.drewnoakes' metadata-extractor to 2.0 and
      add parser for webp metadata (TIKA-1594).


    • Duration extracted from MP3s with no ID3 tags (TIKA-1589).


    • Upgraded to PDFBox 1.8.9 (TIKA-1575).


    • Tika now supports the IsaTab data standard for bioinformatics
      both in terms of MIME identification and in terms of parsing
      (TIKA-1580).


    • Tika server can now enable CORS requests with the command line
      "--cors" or "-C" option (TIKA-1586).


    • Update jhighlight dependency to avoid using LGPL license. Thank
      @kkrugler for his great contribution (TIKA-1581).


    • Updated HDF and NetCDF parsers to output file version in
      metadata (TIKA-1578 and TIKA-1579).


    • Upgraded to POI 3.12-beta1 (TIKA-1531).


    • Added tika-batch module for directory to directory batch
      processing. This is a new, experimental capability, and the API will
      likely change in future releases (TIKA-1330).


    • Translator.translate() Exceptions are now restricted to
      TikaException and IOException (TIKA-1416).


    • Tika now supports MIME detection for Microsoft Extended
      Makefiles (EMF) (TIKA-1554).


    • Tika has improved delineation in XML and HTML MIME detection
      (TIKA-1365).


    • Upgraded the Drew Noakes metadata-extractor to version 2.7.2
      (TIKA-1576).


    • Added basic style support for ODF documents, contributed by
      Axel D枚rfler (TIKA-1063).


    • Move Tika server resources and writers to separate
      org.apache.tika.server.resource and writer packages (TIKA-1564).


    • Upgrade UCAR dependencies to 4.5.5 (TIKA-1571).


    • Fix Paths in Tika server welcome page (TIKA-1567).


    • Fixed infinite recursion while parsing some PDFs (TIKA-1038).


    • XHTMLContentHandler now properly passes along body attributes,
      contributed by Markus Jelsma (TIKA-995).


    • TikaCLI option --compare-file-magic to report mime types known to
      the file(1) tool but not known / fully known to Tika.


    • MediaTypeRegistry support for returning known child types.


    • Support for excluding (blacklisting) certain Parsers from being
      used by DefaultParser via the Tika Config file, using the new
      parser-exclude tag (TIKA-1558).

    详细信息请查看发行页面

    此版本现已提供下载:

    http://www.apache.org/dyn/closer.cgi/tika/apache-tika-1.8-src.zip

    [​IMG]

    Tika是一个内容抽取的工具集合(a toolkit for text extracting)。它集成了POI, Pdfbox 并且为文本抽取工作提供了一个统一的界面。其次,Tika也提供了便利的扩展API,用来丰富其对第三方文件格式的支持。

    Tika的API十分便捷,核心是Parser interface,其中定义了一个parse方法:
    public void parse(InputStream stream, ContentHandler handler, Metadata metadata)
    用stream参数传递需要解析的文件流, 文本内容会被传入handler,而元数据会更新至metadata。

    可以使用Tika的ParserUtils工具来根据文件的mime-type来得到一个适当的Parser来进行解析工作。或者Tika还提供了一个AutoDetectParser根据不同的二进制文件的特殊格式 (比如说Magic Code),来寻找适合的Parser。
    Apache Tika 1.8 发布,内容抽取工具集合下载地址
     
正在加载...