Top 10 Pointers in New Apache Spark 1.6 Release
In this new year 2016, we should be excited that Apache Spark community have released and announced the availability of Apache Spark 1.6, which is the 7th release on the 1.x line.
- Committers – Contributors to Spark had crossed 1000, which is doubled.
- Patches – Apache Spark 1.6 version includes & covers 1000 patches.
- Run SQL query on files – This feature helps user and application to run SQL queries on files directly without create a table. And it’s similar to the feature available in Apache Drill. For an example select id from json.`path/to/json/files` as j.
- Star (*) expansion for StructTypes – This features makes it easier to nest and unnest arbitrary numbers of columns. It is pretty common for customers to do regular extractions of update data from an external datasource (e.g. mysql or postgres). While this is possible today in the new release with some small improvements to the analyzer. And goal is to allow users to execute the following two queries as well as their dataframe equivalents to find the most recent record for each key to unnest the struct from above group by query.
- Parquet Performance – It has been the most commonly used data formats with in the Apache Spark, and Parquet scan performance has pretty big impact on many large applications. Before this version it depends on parquet-mr to read and decode Parquet files, with that often many times are spent in record assembly, which is a process that reconstructs records from Parquet columns. But in Spark 1.6. they have introduced a new Parquet reader that bypasses the old parquert-mr’s record assembly and uses a more optimized code path for flat schemas. It seems and benchmarks results in 50% improvement.
- Automatic Memory Management – In older version of Apache Spark (lesser than 1.6), it just splits the available memory into two regions which is called execution memory and cache memory. Execution memory is the region that is used in sorting, hashing, and shuffling, while cache memory is used to cache hot data. And now in the Spark 1.6 version it introduces a new memory manager which will automatically tunes the size of different memory regions. The runtime automatically grows and shrinks regions according to the needs of the executing application. Hence many applications will be get benefited for operators like joins and aggregations, without any user optimization and tuning.
- Streaming State Management – State management is very vital function in streaming applications in Spark, often used to maintain aggregations or session information. Apache Spark 1.6 introduces a new mapWithStateAPI that scales linearly to the number of updates rather than the total number of records. The mapWithState has an efficient implementations of deltas, rather than always requiring full scans over data. It helped user with greatness of performance improvements.
- Spark Datasets – The lesser version of Apache Spark(less than 1.6) is the lack of support for compile-time type safety. To solve this problem in Spark 1.6 team introduced a typed extension of the DataFrame API called Datasets. The Dataset API extends the DataFrame API to supports static typing and user functions that run directly on existing Scala or Java types. When compared with the traditional RDD API, Datasets provide better memory management as well as in the long run better performance.
- Machine Learning Pipeline Persistence – In lesser version of Apache Spark(less than 1.6) lot of machine learning applications leverage Spark’s ML pipeline feature to construct learning pipelines. In the past we have to implement custom persistence code to store the pipeline externally which could be used by big data applications. But in Spark 1.6, the pipeline API offers functionality to save and reload pipelines from a previous state and apply models built previously to new data later.
- Addition of New Algorithms – In Apache Spark 1.6 release they have increased algorithm coverage in machine learning like univariate and bivariate statistics, survival analysis, normal equation for least squares, bisecting K-means clustering, online hypothesis testing, latent dirichlet allocation(LDA), R-like statistics, feature interactions in R formula, instance weights, univariate and bivariate statistics in DataFrames, LIBSVM data source, non-standard JSON data.
Reference – databricks.com, issues.apache.org, Big Data Analytics Community.
Interesting ? Please subscribe to our blogs at www.dataottam.com to keep yourself trendy on Big Data, Analytics, Apache Spark and IoT.
And as always please feel free to suggest or comment.