Relationship between MapReduce, Spark, YARN, and HDFS!
Relationship between MapReduce, Spark, YARN, and HDFS !
In Big Data era Hadoop is the de facto standard for developing of big data applications by using MapReduce framework. And Hadoop is composed of one or more master nodes and any number of slave nodes depends up on the data needed. Hadoop simplifies distributed applications by saying that the data center is the computer, and by providing map () and reduce () functions that allow application developers or programmers to utilize those data centers. Hadoop implements the MapReduce paradigm efficiently and is quite simple to learn; it is a powerful tool for processing large amounts of data in the range of terabytes and petabytes.
Apache Hadoop is a well know and de-facto framework for processing large Big data sets through parallel & distributed computing. YARN(Yet Another Resources Negotiator ) allowed Hadoop to evolve from a simple MapReduce engine to a big data ecosystem that can run heterogeneous (MapReduce and non-MapReduce) apps concurrently or we can call YARN takes Hadoop to next level.
Both Hadoop and Spark framework are open source and enable us to perform a very huge volume of computations and data processing in distributed environments.
And these frameworks implements scale-out methods to address the scalability problem which is efficient. And they can be set up to run intensive computations in the MapReduce paradigm on thousands of servers. Spark’s API has a higher-level abstraction than Hadoop’s API; for this reason, we are able to express Spark solutions in a single Java driver class.
Hadoop and Spark are two different distributed software frameworks. Hadoop is a MapReduce framework on which we can run jobs by supporting the functions like map(), combine(), and reduce(). The MapReduce paradigm works well at one-pass computation it means map() function should be first and then reduce() function. But actually it is inefficient for multiple pass of data and algorithms. Also please note Apache Spark is not a MapReduce framework, but can be easily used to support a MapReduce framework’s functionality like map() and redcue() which has the proper API to handle map() and reduce() functions. The best of Spark design is it is not tied to a map phase and then a reduce phase. A Spark job can be an arbitrary DAG (directed acyclic graph) of map or reduce or shuffle phases. Spark programs may run with or without Hadoop, and Spark may use HDFS or other persistent storages like S3, CFS. In a nutshell, for a given Spark program or job, the Spark engine creates a DAG of task stages to be performed on the cluster, while MapReduce, on the other hand, creates a DAG with two predefined stages, map and reduce. Note that DAGs created by Spark can contain any number of stages. This allows most Spark jobs to complete faster than they would in MapReduce, with simple jobs completing after just one stage and more complex tasks completing in a single run of many stages, rather than having to be split into multiple jobs. As mentioned, Spark’s API is a higher-level abstraction than MapReduce. For example, a few lines of code in Spark might be equivalent to 30–40 lines of code in MapReduce.
Even though frameworks such as Hadoop and Spark are built on a shared-nothing paradigm, they do support sharing immutable data structures among all cluster nodes. In Hadoop, we may pass these values to mappers and reducers via Hadoop’s Configuration object and in Spark, we may share data structures among mappers and reducers by using Broadcast objects. In addition to Broadcast read-only objects, Spark supports write-only accumulators.
So now list outs the benefits for big data processing by Hadoop and Spark are Reliability, Scalability, Distributed processing, and Parallelism.
Both Hadoop and Spark provide more than map() and reduce() functionality and both provide plug-in model for custom record reading, secondary data sorting, and many more. A 100 feet view of the relationship between MapReduce, Spark, YARN, and HDFS is depicted in below picture.
This relationship shows that there are many ways to run MapReduce and Spark using HDFS (and non-HDFS filesystems). So MapReduce refers to the general MapReduce framework paradigm and Spark refers to a specific implementation of Spark using HDFS as a persistent storage or a compute engine . Spark can run without Hadoop using standalone cluster mode, which may use HDFS, NFS, and any other persistent data store. Spark can run with Hadoop using Hadoop’s YARN or MapReduce framework.
Let’s now quickly list out the major applications can be built using MapReduce, and Spark. They are Query log processing, Crawling, indexing, and search, Analytics, text processing, and sentiment analysis, Machine learning, Recommendation systems, Document clustering and classification, Bioinformatics, and Genome analysis.
Reference – Data Algorithms, Mahmoud Parsian and big data community.
Interesting? Please subscribe to our blogs at www.dataottam.com to keep yourself trendy on Big Data, Analytics, and IoT.
And as always please feel free to suggest or comment firstname.lastname@example.org.