Big Data Stack 2.0 and Beyond!
Big Data Stack 2.0 and Beyond!
The Google File System (GFS), MapReduce, and Bigtable are Googles & data industries Big Data revolution, which constructs Big Data Stack 1.0. Dough Cutting actually integrated the above released concepts into a tool called Hadoop.
GFS + MapReduce + Bigtable > HDFS + MapReduce + HBase; which is together called Hadoop.
Now lets slice and dice the what is Big Data Stack 2.0 and beyond. GFS, MapReduce, and Bigtable made it possible for Google to scale out its infrastructure. But over the years the below problem emerged,
- MapReduce is hard, and it will be very time consuming to restart the failures partially rather than as such whole.
- Google File System (GFS) is great distributed storage system, but coming to its metadata then it has its SPOF (Single Point of Failure) problem.
Hence the below made Big Data Stack 2.0, which is ruling and back bone of Google’s Big Data;
- Colossus – Colossus is the successor to the Google File System (GFS); some times it’s called GFS II. BigQuery relies on Colossus, Google’s latest generation distributed file system. Each Google datacenter has its own Colossus cluster, and each Colossus cluster has enough disks to give every BigQuery user thousands of dedicated disks at a time. Colossus also handles replication, recovery (when disks crash) and distributed management (so there is no single point of failure). Colossus is fast enough to allow BigQuery to provide similar performance to many in-memory databases, but leveraging much cheaper yet highly parallelized, scalable, durable and performant infrastructure. BigQuery leverages the ColumnIO columnar storage format and compression algorithm to store data in Colossus in the most optimal way for reading large amounts of structured data. Colossus allows BigQuery users to scale to dozens of Petabytes in storage seamlessly, without paying the penalty of attaching much more expensive compute resources — typical with most traditional databases.
- Megastore – Megastore is a storage system developed to meet the requirements of today’s interactive online services. Megastore blends the scalability of a NoSQL data store with the convenience of a traditional RDBMS in a novel way, and provides both strong consistency guarantees and high availability. We provide fully serializable ACID semantics within fine-grained partitions of data. This partitioning allows us to synchronously replicate each write across a wide area network with reasonable latency and support seamless failover between datacenters. The Megastore’s leverages the Paxos replication algorithm.
- Spanner – Cloud Spanner is the first and only relational database service that is both strongly consistent and horizontally scalable. With Cloud Spanner we can enjoy all the traditional benefits of a relational database: ACID transactions, relational schemas (and schema changes without downtime), SQL queries, high performance, and high availability. But unlike any other relational database service, Cloud Spanner scales horizontally, to hundreds or thousands of servers, so it can handle the highest of transactional workloads. With automatic scaling, synchronous data replication, and node redundancy, Cloud Spanner delivers up to 99.999% (five 9s) of availability for your mission critical applications. In fact, Google’s internal Spanner service has been handling millions of queries per second from many Google services for years.
- FlumeJava – MapReduce and similar systems significantly ease the task of writing data-parallel code. However, many real-world computations require a pipeline of MapReduces, and programming and managing such pipelines can be difficult. We present FlumeJava, a Java library that makes it easy to develop, test, and run efficient dataparallel pipelines. At the core of the FlumeJava library are a couple of classes that represent immutable parallel collections, each supporting a modest number of operations for processing them in parallel. Parallel collections and their operations present a simple, high-level, uniform abstraction over different data representations and execution strategies. To enable parallel operations to run efficiently, FlumeJava defers their evaluation, instead internally constructing an execution plan dataflow graph. When the final results of the parallel operations are eventually needed, FlumeJava first optimizes the execution plan, and then executes the optimized operations on appropriate underlying primitives (e.g., MapReduces). The combination of high-level abstractions for parallel data and computation, deferred evaluation and optimization, and efficient parallel primitives yields an easy-to-use system that approaches the efficiency of hand-optimized pipelines. FlumeJava is in active use by hundreds of pipeline developers within Google.
- Dremel – Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data, and has thousands of users at Google. It’s working based on novel columnar storage representation for nested records and discuss experiments on few-thousand node instances of the system.
And yes not only from Google, we have already started solving the big data’s new problem with great big data products like Cloudera Impala, Amazon’s Redshift, Apache Drill, Facebook’s Presto, and more.
Will Google release Colossus, and other big data stack 2.0 architecture in this 2017 Google I/O; to create next wave of big data stack?
Ref. Google BigQuery Analytics By Jordan, Siddartha.
interested? questions? feedback? Let us have firstname.lastname@example.org !
Please subscribe to www.dataottam.com to keep yourself trendy on ABCD of Data (Analytics, Big Data, Cloud Computing, and Digital).