1. XenForo 1.5.14 中文版——支持中文搜索!现已发布!查看详情
  2. Xenforo 爱好者讨论群:215909318 XenForo专区

新闻 Apache Nutch 1.1.3 发布,Web 爬虫 下载

本帖由 漂亮的石头2017-04-03 发布。版面名称:软件资讯

  1. 漂亮的石头

    漂亮的石头 版主 管理成员

    注册:
    2012-02-10
    帖子:
    487,979
    赞:
    47
    Apache Nutch 项目管理委员宣布 Apache Nutch 1.13 发布,建议所有当前的用户和 1.X 系列的开发人员升级到此版本。

    Nutch是一个成熟的、可用于生产的 Web 爬虫。 Nutch 1.x 可以依靠 Apache Hadoop™ 数据结构进行细粒度配置,这对于批处理非常有用。

    更新内容:

    Sub-task


    • [NUTCH-2246] - Refactor /seed endpoint for backward compatibility

    Bug


    • [NUTCH-1553] - Property 'indexer.delete.robots.noindex' not working when using parser-html.


    • [NUTCH-2242] - lastModified not always set


    • [NUTCH-2291] - Fix mrunit dependencies


    • [NUTCH-2337] - urlnormalizer-basic to strip empty port


    • [NUTCH-2345] - FetchItemQueue logs are logged with wrong class name


    • [NUTCH-2349] - urlnormalizer-basic NPE for ill-formed URL "http:/"


    • [NUTCH-2357] - Index metadata throw Exception because writable object cannot be cast to Text


    • [NUTCH-2359] - Parsefilter-regex raises IndexOutOfBoundsException when rules are ill-formed


    • [NUTCH-2364] - http.agent.rotate: IllegalArgumentException / last element of agent names ignored


    • [NUTCH-2366] - Deprecated Job constructor in hostdb/ReadHostDb.java

    改进


    • [NUTCH-1308] - Add main() to ZipParser


    • [NUTCH-2164] - Inconsistent 'Modified Time' in crawl db


    • [NUTCH-2234] - Upgrade to elasticsearch 2.3.3


    • [NUTCH-2236] - Upgrade to Hadoop 2.7.2


    • [NUTCH-2262] - Utilize parameterized logging notation across Fetcher


    • [NUTCH-2272] - Index checker server to optionally keep client connection open


    • [NUTCH-2286] - CrawlDbReader -stats to show fetch time and interval


    • [NUTCH-2287] - Indexer-elastic plugin should use Elasticsearch BulkProcessor and BackoffPolicy


    • [NUTCH-2299] - Remove obsolete properties protocol.plugin.check.*


    • [NUTCH-2300] - Fetcher to optionally save robots.txt


    • [NUTCH-2327] - Seeds injected in REST workflow must be ingested into HDFS


    • [NUTCH-2329] - Update Slf4j logging for Java 8 and upgrade miredot plugin version


    • [NUTCH-2336] - SegmentReader to implement Tool


    • [NUTCH-2352] - Log with Generic Class Name at Nutch 1.x


    • [NUTCH-2355] - Protocol plugins to set cookie if Cookie metadata field is present


    • [NUTCH-2367] - Get single record from HostDB

    新特性


    • [NUTCH-2132] - Publisher/Subscriber model for Nutch to emit events

    Task


    下载地址:

    http://nutch.apache.org/downloads.html
    Apache Nutch 1.1.3 发布,Web 爬虫下载地址
     
正在加载...