Top 16 Hadoop Built-in Ingress and Egress Tools !
Top 16 Hadoop Built-in Ingress and Egress Tools !
Hadoop has revolutionized data ingestion, data processing and enterprise data warehousing, but its explosive growth has come with a large amount of uncertainty, hype, and confusion. With this blog, enterprise decision makers will receive short quick insights on what all the 16 Hadoop build-in Ingress and Egress Tools.
- Command Line: It’s easy to copy to and from HDFS using the command-line interface (CLI). The put and get options will perform us these task for us. And the put option is more useful than the copyFromLocal option because it supports multiple file sources and it can also work with standard input. And there is also moveFromLocal and moveToLocal options that can be useful for ingress / egress operations where we want to remove the sources after the copy is successful.
- Java API: Hadoop has an org.apache.hadoop.fs package that contains the filesystem classes. The FileSystem class is the abstracted class that has several implementations including DistributedFileSystem for HDFS. And it exposes basic file system operations such as create, open, and delete.
- Python / Perl / Ruby with Thrift: Apache Thrift is am open source client-server RPC protocol library. Hadoop has a contributed to the projects that contains a Thrift Server and bindings for various client languages including Python, Ruby, and Perl. And the disadvantage of using the Thrift Server is that it adds another layer of indirection on top of HDFS, which means that we reads and write won’t be as fast as they could be. And also because the Thrift server is the one performing all the interactions with HDFS, we have lost any client-side data locality that may result when a Thrift client is running on a DataNode.
- Hadoop FUSE: Hadoop comes with a component called FuseDFS, which allows HDFS to be mounted as Linus volume via Filesystem in Userspace(FUSE). And because of the FUSE is user space filesystem, there are quite a number of hops between the client application and HDFS. And the main issues are around performance and consistency. Hadoop FUSE is executed in user space and involves many layers between the client and the eventual HDFS operations. And to conclude although the Hadoop FUSE sounds like an interesting idea, but it’s not a ready for production environments.
- NameNode embedded HTTP: The advantage of using HTTP to access HDFS is that it relieves the burden of having to have the HDFS client code installed on any host that requires access. And further HTTP is ubiquitous and many tools and most programming languages have default support for HTTP, which makes HDFS that much more accessible. And the NameNode has an embedded with Jetty HTTP/ HTTPS web server, which is used for the SecondaryNameNode to read images and merge them back. And it also supports distCp utilities to enable cross-cluster copies.
- HDFS Proxy: The HDFS Proxy is a component in the Hadoop project that provides a web app proxy front end to HDFS. And it’s advantages over the embedded HTTP server are an acess control layer and support for multiple Hadoop versions.
- Hoop: Hoop is a REST, JSON based HTTP/HTTPS server that provides access to HDFS. And it’s advantage over the current Hadoop HTTP interface is that it supports writes as well as reads. And this project is created by Cloudera.
- WebHDFS: It is included in Hadoop as a whole new API in Hadoop providing REST / HTTP read / write access to HDFS. We will use WebHDFS to create a directory, write a file to that directory, and finally remove the file. WebHDFS might be turned off by default to enable it we have to set dfs.webhdfs.enabled to true in hdfs-site.xml and restart HDFS.
- Distributed Copy: Hadoop has a command line too for copying data between Hadoop clusters called distCp. And it performs the copy in a MapReduce job, where the mappers copy from one filesystem to another. And one of the useful characteristic of distCp is that it can copy between multiple versions of Hadoop. distCp does support FTP as a source, but unfortunately not HTTP.
- WebDAV: Web-based Distributed Authoring and Versioning (WebDAV) is a series of HTTP methods that offer file collaboration facilities.
- MapReduce: MapReduce is a great mechanism to get data into HDFS. And few notes about our implementations, its speculative-execution is safe, as opposed to distCp. The connection and read timeouts can be controlled via the httpdownload.connect.timeout.millis and httpdownload.read.timeout.millis configuration settings respectively.
- Apache Flume: Apache Flume is a distributed system for collecting streaming data which is originally developed by Cloudera. And it is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of streaming data into the Hadoop Distributed File System (HDFS). It has four primary components like nodes, agents, collectors, and masters.
- Chukwa: Chukwa is an Apache sub project of Hadoop that also offers a large-scale mechanism to collect and store data in HDFS. Its reliability supports two levels end to end reliability and fast path delivery which minimizes the latencies. After writing data into HDFS Chukwa runs a MapReduce job to de multiplex the data into separate streams. It also offers a tool called Hadoop Infrastructure Care Center (HICC) which is web interface for visualizing system performance.
- Scribe: Scribe is rudimentary streaming log distribution service, developed and used heavily by Facebook. And a scribe server that collects logs runs on every node and forwards them to a central Scribe server. Scribe supports multiple data sinks, including HDFS, regular file systems, and NFS.
- HDFS File Slurper: It is an open source project which can copy files of any format in and out of HDFS. It’s is a simple utility that supports copying files from a local directory into HDFS and vice versa. The Slurper reads any files that exist in a soruce directory and optionally consults with a script to determine the file placement in the destination directory. And a key feature in the Slurper’s design is that it doesn’t work with partially written files.
- Sqoop: Sqoop is SQL to Hadoop. Apache Sqoop efficiently transfers bulk data between Apache Hadoop and structured datastores such as relational databases. Sqoop helps offload certain tasks (such as ETL processing) from the EDW to Hadoop for efficient execution at a much lower cost. Sqoop can also be used to extract data from Hadoop and export it into external structured datastores. Sqoop works with relational databases such as Teradata, Netezza, Oracle, MySQL, Postgres, and HSQLDB.
Apart from the above we have many techniques like Kafka, Storm, SparkStreaming, and more, which will be discussed in future coming blogs.
Reference – Hadoop in Practice, Alex Holmes and Big Data Communities.
Interesting? Please subscribe to our blogs at www.dataottam.com to keep yourself trendy on Big Data, Analytics, and IoT.
And as always please feel free to suggest or comment firstname.lastname@example.org.