The Data Movement in Big Data Ecosystem
Tons of thanks for your valuable time, this time we like to share with you the details on how data movement is happening in the big data ecosystem. It’s named as “The Data Movement in Big Data Ecosystem”.
Ingesting data in to Hadoop is so vital from systems like RDBMS, Mainframes, logs, machine-generated data, event data and vice versa for extracting from Hadoop is also very important equally as how we considered around storing & modeling of big data. The data ingestion is very much equally important like data storage & computing in big data space.
Approximately we have to consider the below considerations for any kind of data movement in the Hadoop Ecosystem.
- Data Ingestion and Access Time Lines
- Data Updates for Incremental Events
- Data Access and Processing
- Data Structured and Source System Defined
- Data Transformation, Partitioning and Splitting
- Data Storage Format
Data Ingestion and Access Time Lines: The time line is all about time taken for the data available in Hadoop Ecosystem from data ingestion ready system. To do data ingestion architecture we need to consider the following classifications such as, Macro batch (15 mins+), Micro batch (2 Mins+), NRT Decision Support (2 Secs+), NRT Event Processing (100 ms range), Real time ( <100 ms). HDFS commands, Sqoop, Flume, Kafka, Storm, Spark Streaming are few options available to do the data ingestions from Non-Hadoop ecosystem to Hadoop and vice versa.
Data Updates for Incremental Events: Once we ingested the data then it we would be needed to consider the data incremental whether it can be appending or modify. For append only the HDFS will be best fit. If it’s modify then we needs to do workaround by having delta file which includes the changes that need to be made to the existing data. We will use a compaction job to handle the modifications. Below give the example with compaction job, Data Access and Processing: Data & Processing is designed based on the underlying requirements and information delivery.
Data Structured and Source System Defined: While we are ingesting data from a file system we need to consider read speed (disk I/O), original file type (delimited, JSON, SML, Avro, fixed length, variable length, copy books, etc.), Compression (Gzip, LZO, Snappy, BZip), Streaming Data (Twitter Feeds, JMS queue, events: Tools used(Flume/Kafka) and Log files.
Data Transformation, Partitioning and Splitting: Normally transformations refer to making modifications on incoming data, distributing the data into partitions or bucket and sending the data to more than one location or store. E.g. XML or JSON is converted into delimited data, partitioning (date wise) & splitting (HBase / HDFS) the data based on use cases need. Decision on how to transform data will depend on the timelines of the requirement.
Data Storage Format: Here we need to decide what format the data will be stored in. We have multiple options and here to go with few options File format (plain text, Hadoop specific formats like SequenceFile, Avro, Parquet),Compression, Data Storage ( HDFS, GlusterFS, Quantcast File System, Isilon OneFS and NetApp)
Reference: Open Source Community & Hadoop Application Architecture Book.
It’s true that the list is not complete; please feel free to comment and suggest.