Top 11 Apache Hadoop YARN Frameworks
Top 11 Apache Hadoop YARN Frameworks
Part of the core Hadoop project, YARN is the architectural center of Hadoop that allows multiple data processing engines such as interactive SQL, real-time streaming, data science and batch processing to handle data stored in a single platform, unlocking an entirely new approach to analytics. YARN is the foundation of the new generation of Hadoop and is enabling organizations everywhere to realize modern data architecture.
YARN is new data operating systems.
It has most exciting aspects like ability to support multiple programming models and application frameworks. In Hadoop version 1, the only processing model available to users is MapReduce. In Hadoop version 2, MapReduce is separated from the resource management layer of Hadoop and placed into its own application framework. YARN forms a resource management platform, which provides services such as scheduling, fault monitoring, data locality, and more to other frameworks like Distributed-Shell, Hadoop MapReduce, Apache Tez, Apache Giraph, Hoya, Dryad on YARN, Apache Spark, Apache Storm, REEF, Hamster, and Apache Flink
It represents a simple method for running shell commands and scripts in containers in parallel on a Hadoop YARN cluster. Distributed-Shell is a simple mechanism for running shell commands and scripts in containers on multiple nodes in a Hadoop cluster. There are multiple existing implementations of a distributed shell that administrators typically use to manage a cluster of machines, and this application is a way to demonstrate how such a utility can be implemented on top of YARN. The classes of the org.apache.hadoop.yarn.applications.distributedshell packages are Client, ApplicationMaster, and DSConstants.
2. Apache Tez:
Apache Tez is one of the great examples for a new YARN framework that exploits its YARN’s power. Many Hadoop jobs consist of executing a complex directed acyclic graph (DAG) of tasks using separate MapReduce stages, but Apache Tez generalizes this process and allows these tasks spread across stages to be run as a single, all-encompassing job. Tez can be used as a MapReduce replacement for projects such as Apache Hive and Apache Pig. It provides them with a more natural model for their execution plans, together with faster response times and extreme throughput at a petabyte scale.
3. Apache Giraph:
Apache Giraph is an iterative graph processing system built for high scalability which is open-source implementation based on Google’s Pregel. In Google Pregel is used to calculate page rank. Apache Giraph is used by Facebook, Twitter, and LinkedIn to create social graphs of their clients and users. Both Giraph and Pregel are based on the Bulk Synchronous Parallel (BSP) model of distributed computation and Giraph adds several features beyond the basic Pregel model, including master computation, shared aggregators, edge-oriented input, out-of-core computation, and more.
Giraph was originally written to run on standard Hadoop version 1, but the native Giraph implementation under YARN provides the user with an iterative processing model. The support for YARN has been present in Giraph since its own version 1.0 release. Giraph’s YARN-related abstraction is easy to extend or use as a template for new projects. Giraph takes advantage of the ApplicationMaster to perform a more natural job control, which includes the ability to spawn and retire tasks as part of each Bulk Synchronous Parallel step.
4. Hoya: HBase on YRAN
HBase on YARN is the Hoya project creates dynamic and elastic Apache HBase clusters on top of YARN. In Hoya applications, the YARN copy all files listed in the client’s application-launch request from HDFS into the local file system of the chosen server, and then executes the command to start the ApplicationMaster. When the Hoya ApplicationMaster starts, it starts an HBase Master on the local machine, which is the sole HBase Master that Hoya currently manages. In parallel with the Master start-up, Hoya asks YARN for a number of containers matching the number of HBase region servers it needs. For each of these containers, Hoya provides the commands to start the region server and does not run any Hoya-specific code on the worker nodes. The Hoya ApplicationMaster points YARN at those files that need to be on the worker nodes and the necessary commands and the YARN then does the rest of the work.
5. Dryad on YARN:
Similar to Apache Tez, Microsoft’s Dryad provides a DAG as the abstraction of execution flow. It is ported to run natively on YARN. Dryad on YARN is fully compatible with its non-YARN version. The ported code is written completely in native C++ and C# for worker nodes. The ApplicationMaster leverages a thin layer of Java interfacing with the ResourceManager for the native Dryad graph manager to schedule work. Eventually, the Java layer will be substituted by direct interaction with protocol-buffer interfaces. Overall, this project demonstrates, as-an-aside, YARN’s enablement of writing applications in programming languages of choice.
6. Apache Spark:
Spark was initially developed for applications where keeping data in memory helps performance, such as iterative algorithms, which are common in machine learning, and interactive data mining. Spark is often compared to MapReduce because it provides parallel processing over HDFS and other Hadoop input sources. Spark differs from MapReduce in two important ways – The First, Spark holds intermediate results in memory, rather than writing them to disk—an approach that drastically decreases query response times. The Second, Spark supports more than just MapReduce functions, greatly expanding the set of possible analyses that can be executed over HDFS data stores. Spark offers a general execution model that can optimize arbitrary operator graphs, and it supports in-memory computing, which lets it query data faster than disk-based engines like MapReduce. It also provides clean, concise APIs in Scala, Java, and Python. Users can also use Spark interactively from the Scala and Python shells to rapidly query big data sets.
The advantage of porting and running Spark on top of YARN is the common resource management and a single underlying data fabric. Spark users can continue to use the same data for building models and share the same physical resources with other Hadoop frameworks.
7. Apache Storm:
Apache Storm allows processing of unbounded streams of data in real time. It is designed to be used in any programming language. The basic Storm use cases are real-time analytics, online machine learning, continuous computation, distributed RPC, ETL, and more. Storm provides fast performance, is scalable, is fault tolerant, and gives processing guarantees. Traditional MapReduce jobs are expected to eventually finish, but Storm continuously processes messages until it is stopped. This behavior makes it ideal for a YARN cluster. There are two kinds of nodes on a Storm cluster: the master node and the worker nodes, which can be fully implemented with an ApplicationMaster. The master node runs a daemon called “Nimbus” that is responsible for distributing code around the cluster, assigning tasks to machines, and monitoring for failures. Each worker node runs a daemon called the “Supervisor,” which listens for work assigned to its machine and starts and stops worker processes as necessary based on what Nimbus has assigned to it. Efforts are under way to run Storm directly under YARN and take advantage of the common resource management substrate.
8. REEF – Retainable Evaluator Execution Framework:
YARN’s flexibility sometimes requires significant effort on the part of application implementers. Writing a custom application on YARN includes building one’s own ApplicationMaster, performing client and container management, and handling aspects of fault tolerance, execution flow, coordination, and other concerns. The REEF project by Microsoft recognizes this challenge and factors out several components that are common to many applications, such as storage management, data caching, fault detection, and checkpoints. Framework designers can build on top of REEF more easily than they can build directly on YARN, and can reuse these common services/libraries. REEF’s design makes it suitable for both MapReduce and DAG-like executions as well as iterative and interactive computations.
9.Hamster: Hadoop and MPI on the Same Cluster:
The Message Passing Interface (MPI) is widely used in high-performance computing. Message Passing Interface is primarily a set of optimized message-passing library calls for C, C++, and Fortran that operate over popular server interconnects such as Ethernet and InfiniBand. Because users have full control of their YARN containers, there is no reason why MPI applications cannot run within a Hadoop cluster. The Hamster effort is a work-in-progress that provides a good discussion of the issues involved in mapping MPI to a YARN cluster
10: Apache Flink:
Apache Hadoop YARN allows running various distributed applications on top of a cluster. Flink runs on YARN next to other applications. Users do not have to setup or install anything if there is already a YARN setup. A session will start all required Flink services (JobManager and TaskManagers) so that we can submit programs to the cluster. When starting a new Flink YARN session, the client first checks if the requested resources (containers and memory) are available. After that, it uploads a jar that contains Flink and the configuration to HDFS. The next step of the client is to request a YARN container to start the ApplicationMaster.
11: Hadoop MapReduce:
As mentioned earlier, MapReduce was the first YARN framework and drove many of YARN’s requirements. One important aspect of the YARN design is the increased “user agility” in choosing different versions of MapReduce to use on a cluster. Indeed, with YARN it is possible to have production jobs using a stable MapReduce algorithm, even as test versions of MapReduce are running concurrently.
Reference – Apache Hadoop YARN, Arun C Murthy, Vinod Kumar and Big Data Community.
Interesting? Please subscribe to our blogs at www.dataottam.com to keep yourself trendy on Big Data, Analytics, and IoT.
And as always please feel free to suggest or comment email@example.com.