The top 79 beautiful lines for taking big data architecture from drawing board to production!
The top 79 beautiful lines for taking big data architecture from drawing board to production!
Dear Data Community,
Instead of titling this blog is “The top 79 beautiful lines for taking big data architecture from drawing board to production”, It would be very suitable if we call it as book talk, which is inspired by Big Data: Principles and best practices of scalable real-time data systems by Nathan Marz with James Warren. Still remember, in the 2015 Christmas holidays season one of my mentor Ranganathan Ramakrishnan suggested to read, I got the hard copy and read once. But I could not able to get any deep architectural insightful, but able to get hold of the tools and technologies.
But currently am fully engaged into lot of big data analytics white boarding discussion with clients and other stake holders, urged to me to refer the book back from the shelf again. Wow some, what a insight fullness about big data system it gives me; hence am dedicating this blog to Nathan & James team. The whole book holds lot of information’s, which can lead to thousands plus, but for me the below 79 quotes are every day inspirational quotes, which I tool printout and paste in my room.
Hence, I would like to invite you to journey with me to explore how we can design, architect the big data analytics systems; with agility, testability, deploy ability, scalability, and availability that lifts us up with the data digital transformations for enterprise.
- By using the Lambda Architecture, we avoided the complexities that plague traditional architectures. By avoiding those complexities, we became dramatically more productive.
- The past decade has seen a huge amount of innovation in scalable data systems. These include large-scale computation systems like Hadoop and databases such as Cassandra and Riak. These systems can handle very large amounts of data, but with serious trade-offs.
- Hadoop, for example, can parallelize large-scale batch computations on very large amounts of data, but the computations have high latency.
- NoSQL databases like Cassandra achieve their scalability by offering us a much more limited data model than we are used to with something like SQL
- A robust data system should answer all our questions based on information that was acquired in the past up to the present.
- The rawest information we have; information we hold to be true simply because it exists. Let’s call this is information data.
- The Lambda Architecture provides a general-purpose approach to implementing an arbitrary function on an arbitrary dataset and having the function return its results with low latency.
- Lambda Architecture is not recommending, we must always use the exact same technologies every time when we implement a big data analytical data system.
- Desired properties of a Big Data systems are robustness & fault tolerance, low latency reads & updates, scalability, generalization, extensibility, ad hoc queries, minimal maintenance, and debuggability.
- The problems with fully incremental architectures are is operational complexities, extreme complexity of achieving eventual consistency, lack of human-fault tolerance.
- The main idea of the Lambda Architecture is to build Big Data systems as a series of layers like Speed Layer, Serving Layer, and Batch Layer. My favorite 😊
- batch view = function (all data). My favorite 😊
- real-time view = function (real-time view, new data). My favorite 😊
- query = function ( batch view, real-time view). My favorite 😊
- query = function(all data).
- Batch Layer means to store master dataset, computes arbitrary views.
- Serving Layer means to have random access to batch views, updated by batch layer.
- Speed Layer means compensate for high latency of updates to serving layer, fast, incremental algorithms, batch layer eventually override speed layer.
- Batch and Serving layers satisfy almost of all big data properties like robustness, fault tolerance, scalability, generalization, extensibility, ad-hoc queries, minimal maintenance, and debuggability.
- Although SQL and NoSQL databases are often painted as opposites or as duals of each other, at a fundamental level they are really the same.
- Although the Lambda Architecture is generic and flexible, the individual components comprising the system are specialized. There is very little magic happening behind the scenes, as compared to something like a SQL query planner. This leads to more predictable performance.
- Even if we were to lose all our serving layer datasets and speed layer datasets, we could reconstruct our data application from the master dataset.
- There are two components to the master dataset; which is the data model we use and how we physically store the master dataset.
- Information is the general collection of knowledge relevant to our big data analytical system; which is synonymous with the colloquial usage of the word data.
- Data refers to the information that can’t be derived from anything else. Data serves as the axioms from which everything else derives.
- Queries are questions we ask of our data. For example, the query will fetch our financial transaction history to determine our current bank account balance.
- Views are information that has been derived from our base data. They are built to assist with answering specific types of queries.
- Immutable data may seem like a strange concept if we are well versed in relational databases, but it tackles the human-fault tolerance and simplicity.
- Benefits of the fact-based model are it is query able at any time in its history, tolerates human errors, handles partial information, has the advantages of both normalized and de-normalized.
- The fact-based model provides a simple yet expressive representation of our data by naturally keeping a full history of each entity over time.
- The master dataset is the source of truth within the Lambda Architecture. — My favorite 😊
- There must be an easy and effective means of transforming the data into batch views to answer actual queries.
- Any question we could ask of our dataset can be implemented as a function that takes all of our data as input.
- In the Lambda Architecture, the batch layer precomputes the master dataset into batch views so that queries can be resolved with low latency.
- Lambda Architecture is the trade-offs between re-computation algorithms, the style of algorithm emphasized in the batch layer, and incremental algorithms, the kind of algorithms typically used with relational databases.
- The batch layer precomputes functions over the master dataset, hence processing the entire dataset introduce high latencies. But the serving layer serves the precomputed results with low-latency reads.
- The speed layer fills the latency gap by querying recently obtained data.
- A native strategy for computing on the batch layer would be to precompute all possible queries and cache the results in the serving layer; and because our master dataset is continually growing, we must have a strategy for updating our batch views when new data becomes available. We should choose a re-computational algorithm or incremental algorithm.
- If your algorithm is re-computation based, all that’s required is to fix the algorithm and redeploy the code, so our batch view will be correct the next time the batch layer runs; and this is because the re-computation based algorithm re-computes the batch view from scratch.
- Re-computation algorithms is essential to supporting a robust data-processing system.
- Incremental algorithm can help us to increase the efficiency of our system, but only as a supplement to re-computation algorithms.
- Lambda Architecture is a journey to makes all the data stake holders working with big data systems is to be elegant, simple, and fun.
- The key takeaway is that we must always have re-computation versions of our algorithms. This is the only way to ensure human-fault tolerance for our system, and human-fault tolerance is a non-negotiable requirement for a data robust system.
- Scalability is the ability of a system to maintain performance under increased load by adding more resources. Load in a big data system context is a combination of the total amount of data we have, how much new data we receive every day, how many requests per second our data application serves, and so forth.
- More important than a system being scalable is a system being linearly scalable; a linearly scalable system can maintain performance under increased load by adding resources in proportion to the increased load. A nonlinearly scalable system, despite being scalable, isn’t useful.
- The batch layer is the core of the Lambda Architecture. The batch layer is high latency by its nature, and we should use the high latency as an opportunity to do deep analysis and expensive calculations which we can’t do in real time.
- The way we express your computations is crucially important if we want to avoid complexity, prevent bugs, and increase productivity.
- Now the pieces of the batch layer; brings and formulating a schema for our data, storing a master dataset, and running computations at scale with a minimum of complexities.
- The serving layer consists of databases that index and serve the results of the batch layer and it indexes the views and provides interfaces so that the precomputed data can be quickly queried.
- In the Lambda Architecture, the serving layer provides low-latency access to the results of calculations performed on the master dataset. The serving layer views are slightly out of date due to the time required for batch computation. But this is not a concern, because the speed layer will be responsible for any data not yet available in the serving layer.
- Batch Layer is tightly tied to the batch layer because the batch layer is responsible for continually updating the serving layer views.
- When designing these indexes in Server Layer, we must consider two main performance metrics; throughput and latency.
- This technique of redundantly storing information to avoid joins is called de-normalization.
- The computation on the batch layer reads the master dataset in bulk, so there’s no need to design the schema to optimize for random-access reads.
- The serving layer is completely tailored to the queries it serves, so we can optimize as needed to attain maximal performance. These optimizations in the serving layer can go far beyond de-normalization.
- Requirements for a serving layer database are batch writable, scalable, random reads, fault-tolerant, but the random writes are completely irrelevant to the serving layer because the views are only produced in bulk.
- Random writes do exist in the Lambda Architecture, but they isolated within the speed layer to achieve low-latency updates.
- Because the serving layer doesn’t require random writes, it doesn’t require online compaction, so the complexity along with its associated operational burden is completely vanished in the serving layer.
- Lambda Architecture capacity is to store normalized data in the batch layer and de-normalized data in the serving layer.
- To lower the latency of updates as much as possible, the speed layer must take a fundamentally different approach than the batch and serving layers. As such, the speed layer is based on incremental computation instead of batch computation.
- First, the speed layer is only responsible for data yet to be included in the serving layer views and this data is at most a few hours old and is vastly smaller than the master dataset.
- Second, the speed layer views are transient. Once the data is absorbed into the serving layer views, it can be discarded from the speed layer.
- The power of the Lambda Architecture lies in the separation of roles in the different layers; but in traditional data architectures such as those based on relational databases, all that exists is a speed layer.
- A simple strategy which mirrors the batch/serving layer and computes the real-time views using all recent data as input, whereas the batch uses the all data as input.
- The underlying storage layer for real-time view in speed layer must therefore meet the following requirements; random reads, random writes, scalability, and fault tolerance.
- Speed layers store relatively small amounts of state because they only represent views on recent data and this is a benefit because real time views are much more complex than serving layer views, and the complexities are online compaction and concurrency.
- Itâ€™s important to note that the speed layer is under less pressure because it stores considerably less data than the serving layer.
- The separation of roles and responsibilities within the Lambda Architecture limits complexity in the speed layer.
- The proper way to present the CAP theorem is that â€˜when a distributed data system is partitioned, it can be consistent or available but not both.
- To implement eventually consistent in Speed layer and to have counting correctly, we need to make use of structures called conflict-free replicated data types; which is commonly referred to as CRDTs.
- Architectures without a batch layer backing up the real time, incremental portions would have permanent corruption.
- The architectural advantages of asynchronous updates better throughput and better management of load spikes suggest implementing asynchronous updates unless we have a good reason not to do so.
- Incremental algorithms and random-write databases make the speed layer far more complex than the batch and serving layers, but one of the key benefits of the Lambda Architecture is the transient nature of the speed layer.
- Ideally a speed layer database would provide support to directly expire entries, but this is typically not an option with current available databases.
- The speed layer is very different from the batch layer. Rather than compute functions on our entire dataset, we instead should compute using more complex incremental algorithms on more complex forms of storage. But the Lambda Architecture allows us to keep the speed layer small and therefore more manageable.
- Batch-local computation and state-full computations are two main core concepts of micro-batch stream processing.
- The Lambda Architecture is the result of starting from first principles; the general formulation of data problems as functions of all data we’ve ever seen and making mandatory requirements like human-fault tolerance, horizontal scalability, low-latency reads, and low-latency updates.
- Lambda Architecture is based on functions of all data, the Lambda Architecture is by nature general-purpose, giving you the confidence to attack any big data analytical problem.
- Flow of data through the processing and serving layers of a generic lambda architecture. Lambda architecture is a data-processing architecture designed to handle massive quantities of data by taking advantage of both batch- and stream-processing methods.
Let us have email@example.com.
Please subscribe dataottam blogs to keep yourself up-to-the-minute on ABC of Data (Artificial Intelligence, Big Data, Cloud, Cognitive, Chatbot ).
Bye, until see you in next blog…. Happy Data !