BlinkDB a project being developed by the Berkeley University where the evolution of Spark started is a massively parallel interactive Query Engine processing tens of TB of data with response time of just a blink of an eye. BlinkDB allows users to trade-off query accuracy for response time, enabling interactive queries over massive data by […]
As the size of the data increases the usual processing time of the engines also increases there is always new tools comming up tackling this problem starting with mapreduce and now Spark and something replacing spark in the near future. But sticking to our title, Spark process data in-memory while map reduce pushes the data […]
Framework of an Apache Spark Job Run! Now our community the big data analytics has started to use Apache Spark in full-swing for big data processing. The processing could for ad-hoc queries, prebuilt queries, graph processing, machine learning, and even for the data streaming. Hence the understanding of Spark Job Submission is very vital for […]
Listened and got it from atscale webinar, felt great and use full hence happy to share with our big data & analytics community. It’s very clear Hadoop is budding from its batch processing origins into a flexible, economical hub where enterprise store raw data, keep archival data active, and grow their options for data investigation, […]
Apache Spark: 4+ years old Suited for sophisticated analytics at lighting speed Runs 1oo times faster in memory Runs 10 times faster in disk Supports in-memory processing Suits for interactive computing at blazzing fast speeds Supports developer with Java, Python & Scala API It runs on existing Hadoop cluster Compatible with HDFS, HBase and any […]
Apache Spark is fast and general engine for big data processing with libraries for SQL, streaming, advanced analytics. RDD is great abstraction for data sets, Immutable collection of data, which stands for Resilient Distributed Storage In Spark all work is expressed in following – Creating new RDDs, Transforming existing RDDs, Calling Operations on RDDs (eg.val […]
Spark SQL is a spark interface for both structured and semi-structured data Loads data from a variety of structured sources like Hive Tables, JSON and Parquet columnar storage Spark SQL allows to query data using SQL, both in internal & external to Spark core engine It provides robust integration between SQL and Python/Java/Scala code Spark SQL […]
Spark Streaming is Sparks module for applications such are benefits from data as soon as it lands/arrives from various sources. E.g. page view in real time, train a machine learning model, automatically detect anomalies. Developer can use a API which is very similar to batch jobs, also we can reuse the same API skills and […]
Team, this time i go with the title called “Top 3 methods of skipping big data’s bad data using Hadoop !“ which describes about how to get corrupt records out from the large data sets which has different format of data. While doing our analysis if the corrupt records are in small percentage we can ignore or […]
Team thanks for reading & engaging ! This time am planned to share with you the my learning on Hadoop Schedulers; titled “Simplified Hadoop Schedulers Overview !” With the help of choosing suitable scheduler, we can make the response times faster for all smaller jobs and also for all the production jobs it’s guaranteed with SLA’s (Service […]
Thank you for your valuable time & it’s much appreciated. This time i like to share the blog called “Quick Card On – Apache Hive Joins !” – a handy Apache Hive Joins reference card or cheat sheet. An SQL JOIN clause is used to combine rows from two or more tables, based on a common […]
Hadoop compression techniques bring us more benefits in the Hadoop I/O operations, such as space savings and processing speeds. We’ve lot compression formats and algorithm, with pros and cons. Here nothing new is added, just consolidated to have it handy to use it in production implementations. All techniques exhibit a space/time trade-off. We’ve options from […]
Big Data Meets Microsoft Azure ! For Big Data & Cloud...
How to Ingest HDFS in JSON format using Apache Sqoop ?...
The 4 Key Concepts in the Anatomy of an Apache Spark Job!...
The 1-2-3-4-5-6-7-8-9 of Cognitive Computing ! Dear Data...