1. XenForo 1.5.14 中文版——支持中文搜索!现已发布!查看详情
  2. Xenforo 爱好者讨论群:215909318 XenForo专区

新闻 Apache Spark 2.0.0 发布,APIs 更新 下载

Discussion in '软件资讯' started by 漂亮的石头, 2016-07-28.

  1. 漂亮的石头

    漂亮的石头 版主 Staff Member

    Joined:
    2012-02-10
    Messages:
    487,974
    Likes Received:
    47
    Apache Spark 2.0.0 发布了,Apache Spark 是一种与 Hadoop 相似的开源集群计算环境,但是两者之间还存在一些不同之处,这些有用的不同之处使 Spark 在某些工作负载方面表现得更加优越,换句话说,Spark 启用了内存分布数据集,除了能够提供交互式查询外,它还可以优化迭代工作负载。

    该版本主要更新APIs,支持SQL 2003,支持R UDF ,增强其性能。300个开发者贡献了2500补丁程序。

    Apache Spark 2.0.0 APIs更新记录如下:


    • Unifying DataFrame and Dataset: In Scala and Java, DataFrame and Dataset have been unified, i.e. DataFrame is just a type alias for Dataset of Row. In Python and R, given the lack of type safety, DataFrame is the main programming interface.


    • SparkSession: new entry point that replaces the old SQLContext and HiveContext for DataFrame and Dataset APIs. SQLContext and HiveContext are kept for backward compatibility.


    • A new, streamlined configuration API for SparkSession


    • Simpler, more performant accumulator API


    • A new, improved Aggregator API for typed aggregation in Datasets

    Apache Spark 2.0.0 SQL更新记录如下:


    • A native SQL parser that supports both ANSI-SQL as well as Hive QL


    • Native DDL command implementations


    • Subquery support, including


      • Uncorrelated Scalar Subqueries


      • Correlated Scalar Subqueries


      • NOT IN predicate Subqueries (in WHERE/HAVING clauses)


      • IN predicate subqueries (in WHERE/HAVING clauses)


      • (NOT) EXISTS predicate subqueries (in WHERE/HAVING clauses)

    • View canonicalization support

    一些新特性:


    • Native CSV data source, based on Databricks’ spark-csv module


    • Off-heap memory management for both caching and runtime execution


    • Hive style bucketing support


    • Approximate summary statistics using sketches, including approximate quantile, Bloom filter, and count-min sketch.

    性能增强:


    • Substantial (2 - 10X) performance speedups for common operators in SQL and DataFrames via a new technique called whole stage code generation.


    • Improved Parquet scan throughput through vectorization


    • Improved ORC performance


    • Many improvements in the Catalyst query optimizer for common workloads


    • Improved window function performance via native implementations for all window functions


    • Automatic file coalescing for native data sources

    更多发布信息,可查看发布说明

    下载地址:http://spark.apache.org/downloads.html
    Apache Spark 2.0.0 发布,APIs 更新下载地址
     
Loading...