Self-Learn Yourself Apache Spark in 21 Blogs – #6
What is RDD, Actions, and Transformations ?
In Blog 6, we will see The RDD, and RDDs Input with Hands-on. Click to have quick read on the other blogs in this learning series.
Hey, my dear friends. Before getting in to more deep dive into let’s have a look at who are the Spark Core Maintainers – Matei, Reynold, Patrick, and Josh, who are the kernel of Spark Core. Now let’s discuss about core functionalities of Apache Spark which is RDD. And it has two types of functionalities like Transformation of Data, and doing something with that data Action.
Little backward and let’s understand of the working nature Spark’s Internal. All the Apache Spark applications and systems are managed via core point called Driver. And the Driver is the coordinator for the worker as per the tagged and configured. Within the driver any application is getting started with SparkContext. And all the Spark applications are built around this core Driver along with SparkContext. And very much important task of the Driver/Spark Context are Task Creator, Data locality, Scheduler, and Fault tolerance. Although we can able to create multiple SparkContext in same processor, as best practice and thumb rule we should not create SparkContext more than one in a processor. And the SparkContext is represented as sc making simpler and ease to use it.
Please find the big data Spark word count program instead of command line methodology.
Once we are ready with our jar built, then we can submit our applications as below,
Now let’s look in to the RDDs. Per, Apache Spark Technical documentations RDD is a collection of elements portioned across the nodes of the cluster that can be operated on in parallel. And the RDD can be expanded as Resilient Distributed Dataset. RDDs are built with failures in core, so if one fails the other will compute and give the results. It leads to the most of the functions in Apache Spark are lazy. And the instructions are stored as DAG (Direct Acyclic Graph) for later use. And these DAG will continue to grow and providing transformations like map, filter; which are all lazy evaluations. In Apache Spark the failures are handled gracefully. Lazy is fantastic, but we need Actions like collect, count, and reduce that trigger the DAG executions and results in to some value against the data. Then it will return to driver program or it will be saved with some persistence storage. And lets introduce one more dimension for RDD, the RDD is immutable meaning once its created we can’t no longer able to change it. But this just a lineage means we can able to get the inception of the data.
In Blog 7 – Let’s have The RDDs Load, Transformation, and more..
If you see something here that interests you, we’d love to have you involved.
Please subscribe at www.dataottam.com to keep you trendy and for future reads on Big Data, Analytics, and IoT.
And as, always please feel free to comment via firstname.lastname@example.org to make it the best.