The Top 15 Bullet Points of Apache Spark !
- Apache Spark is fast and general engine for big data processing with libraries for SQL, streaming, advanced analytics.
- RDD is great abstraction for data sets, Immutable collection of data, which stands for Resilient Distributed Storage
- In Spark all work is expressed in following – Creating new RDDs, Transforming existing RDDs, Calling Operations on RDDs (eg.val conf = new SparkConf().setMaster(master))
- RDD is fault tolerant, distributed collection of objects, Partitioned across cluster and core concept in Apache Spark
- Transformation & Action are two types of RDDs operations
- RDD Transformation(filter, map, union, distinct) yields another RDD, which is lazy evaluate
- RDD Action(count, reduce, first, take, collect) triggers computation returns back to master/storage system
- Key capabilities of Spark is persisting/caching an RDD in cluster memory
- With RDD, our data is loaded in parallel into structured collections
- RDDs are objects and exposed a rich set of methods
- RDD Types(Pair RDD and Double RDD) and Shared Variables (Broadcast Variables and Accumulators)
- Spark Master – assigns cluster resources to applications
- Spark Worker – manages executors running on a machine
- Spark Executor – started by worker, workhorse of the spark application
- RDDs can be generated from various sources like HDFS, Parquet, Textfiles, Parallelized Collections, JSON, HBASE, mongoDB, Cassandra, Hive, MySQL, elasticsearch, PostgreSQL
To conclude, goal of Apache Spark is to have one engine for all data sources, workloads and environments.