Is HDFS heart of Hadoop ?
Team, tons & thousands of thanks for reading and engaging !
This time am pleasure to share with you all my learning’s in Data Import, Export from Hadoop’s file system; which is core component to pump the data to Database, Warehouse, Analytics and Business. We titled as “Heart of the Hadoop is HDFS“.
It’s no doubt, the data for the Hadoop system is collected from many diversified system with the help of Ingestion/Import techniques. Once the processing is done then filtered, transformed, aggregated data will be exported to multiple external systems. Hence the sources and destinations could be included such as local file system, relational databases, NoSQL databases, NewSQL Databases, distributed databases, MPP Databases and even the other Hadoop clusters.
Export & Import Techniques:
- Shell Command Prompt: Commands are built top on HDFS FileSystem API, which comes with shell script called hadoop. It helps to load ad hoc data. Eg. hadoop fs –copyFromLocal or put (Import), hadoop fs –copyToLocal or get(Export). It uses CRC (Cyclic Redundancy Check) to verify the data copied.
- Distributed Copy(distcp): It’s a tool helps to import or export the very large amount of data between clusters. The real usage will be helping moving the data between environments such as development, research, and production. To have effective distcp commands to work, we need to disable the speculative execution in source cluster. Eg. hadoop distcp, hadoop distcp –overwrite, hadoop distcp –update. The distcp copy should be run from the destination cluster.
- Sqoop: It’s jsut SQL to Hadoop and this tool is similar to distcp, built on top of MapReduce to have benefits of fault tolerance and parallelism. It helps to move data from RDBMS to Hadoop and vice versa, by leveraging JDBC connector, or native RDBMS connector. eg., sqoop import -m 1 –connect jdbc:mysql://:/logs –username hdp_usr –password test1 –table weblogs –target-dir /data/weblogs/import. The split will be based on the key in default, but we can tailor made by –split-by arguments. It uses the metadata from DBWritable class for each field, which is dynamically generated. The other example is…sqoop export -m 1 –connect
- MongoOutputFormat: The adapter helps to import data from MongoDB to HDFS, and export from HDFS to MongoDB. The package called com.mongodb.hadoop.*, helps to do our import and export job in the via MapReduce programming.
- Pig<->MongoDB: The Mango Hadoop adptor shipped with Mongo Java Driver, and leverages of the class called MongoInputFormat & MongoOutputFormat. We need to create pig script instead of MapReduce(but in turn it will run as MapReduce), that will read and store them in the MongoDBCollection. eg., define MongoStorage
weblogs = load ‘/data/weblogs/ ’ as
(md5:chararray, url:chararry, date:chararray,
store weblogs into ‘mongodb://<host>:<port>/test.weblogs_from_pig’
- Flume: This projects helps and designed to pull efficiently to load streaming data from many different sources into Hadoop file system. It’s common usage will be loading weblog data. eg., flume dump ‘text(“/path/to/ dump ‘text(“/path/to/ ”)’. It uses the Sources and Sinks abstractions, which is a pipe-level data flow to link both source and sinks. Here to go with the predefined sources like null, stdin, rpcSource, text, tail and sinks including null, collectorSink, console, formatDfs, rpcSink
Once again, thanks for your time and engaging. And do we have any other alternatives to make Import/Export from Hadoop’s Filesystem ?
To Conclude – Yes, Hadoop’s core and heart is HDFS in storage perspective.