Understand Lambda Architecture in 2 minutes What is Lambda Architecture ? Lambda architecture which provides us a combined solution of realtime data with batch data. What is the Need for Lambda Architecture ? lambda Architecture was implemented mainly due to the Latency provided by the Map reduce paradigm, where the batch views was created on […]
SPORK Below is a snapshot from the actual post written by Mayur Rustagi of sigmoid As a data mashing platform the first key initiative is to bring the power & simplicity of Apache Pig on Apache Spark making existing ETL pipelines 100x faster than before. SPORK enables you to use Pig on Spark. We choose Apache Spark […]
RDD – Resilient Distributed Dataset The Only way to share data between the MR jobs are storing the data in disk. When performing more than one transformation on a huge set of data the O/P of every transformation has to be stored in disk and replicated for fault tolerance. And if we had to […]
BlinkDB a project being developed by the Berkeley University where the evolution of Spark started is a massively parallel interactive Query Engine processing tens of TB of data with response time of just a blink of an eye. BlinkDB allows users to trade-off query accuracy for response time, enabling interactive queries over massive data by […]
As the size of the data increases the usual processing time of the engines also increases there is always new tools comming up tackling this problem starting with mapreduce and now Spark and something replacing spark in the near future. But sticking to our title, Spark process data in-memory while map reduce pushes the data […]
Apache Spark: 4+ years old Suited for sophisticated analytics at lighting speed Runs 1oo times faster in memory Runs 10 times faster in disk Supports in-memory processing Suits for interactive computing at blazzing fast speeds Supports developer with Java, Python & Scala API It runs on existing Hadoop cluster Compatible with HDFS, HBase and any […]
Apache Spark is fast and general engine for big data processing with libraries for SQL, streaming, advanced analytics. RDD is great abstraction for data sets, Immutable collection of data, which stands for Resilient Distributed Storage In Spark all work is expressed in following – Creating new RDDs, Transforming existing RDDs, Calling Operations on RDDs (eg.val […]
Spark Streaming is Sparks module for applications such are benefits from data as soon as it lands/arrives from various sources. E.g. page view in real time, train a machine learning model, automatically detect anomalies. Developer can use a API which is very similar to batch jobs, also we can reuse the same API skills and […]
Many thanks for your cherished time, this time we like to share with you the details on what is 3 S’s of Spark as we all know the 3 V’s of Big Data is Volume, Variety & Velocity. And even added with kernel V’s like Veracity & Values. Big Data is defined as a collection of […]
Spark began life in 2009 as a project within the AMPLab at the University of California, Berkeley. More specifically, it was born out of the necessity to prove out the concept of Mesos, which was also created in the AMPLab. Spark was first discussed in the Mesos white paper titled Mesos: A Platform for Fine-Grained Resource […]
The below tips are not written by me (Kumar Chinnakali). It is actually learnt from mammothdata.com and felt it could help our big data community, where Apache Spark is currently changing the world of Analytics & Big Data. Mamothdata team, tons of thanks for sharing with us. Spark is written in Scala, so new features […]
Thanks for your time; I definitely try to value yours. In part 1 – we discussed about Apache Spark libraries, Spark Components like Driver, DAG Scheduler, Task Scheduler, and Worker. Now in Part 2 -we will be discussing on Basics of Spark Concepts like Resilient Distributed Datasets, Shared Variables, SparkContext, Transformations, Action, and Advantages of […]
Big Data Meets Microsoft Azure ! For Big Data & Cloud...
How to Ingest HDFS in JSON format using Apache Sqoop ?...
The 4 Key Concepts in the Anatomy of an Apache Spark Job!...
The 1-2-3-4-5-6-7-8-9 of Cognitive Computing ! Dear Data...