Sqoop Use Cases
- Sqoop was originally developed by Cloudera
- You can import data from relational Database to HDFS as well export it back to relational database from HDFS
- Sqoop supports many RDBMS and not limited to just MySQL etc. it also supports Legacy systems like Mainframes DB2.
Sqoop Use cases:
ELT: Extract Load Transform
- Extract operational data from RDBMS and Process in Hadoop and optionally we can send the end result to RDBMS
- Eg: let’s say billing data needs to be run every week, in this case we can run the processing of billing as batch in Hadoop taking the advantage of parallel processing and then send the summarized billing data back to RDBMS
- Scoop helps us in copying the data from RDBMS to Hadoop and back to RDBMS after processing
ETL: Extract Transform and Load
- Extract the data from RDBMS, transform the data and load it into the Data Warehouse (which is again a Relational database but used for data intelligence and data analytics), which in some organizations informatica/Talend currently in place. Relational database for data warehouse purposes requires structured data and special schema models as well ETL are executed in sparse intervals in order to reduce the load on the source databases.
- Sqoop can be used to import the data from RDBMS into Hadoop and use Hadoop as an intermediate parallel processing engine which is part of the overall ETL process, basically Hadoop is becoming the T(Transform of the ETL) the end results then can be copied to the traditional data warehouse.
- Both use cases above uses Hadoop to transform the data and copy the data back to relational databases
- Difference is in ELT we copied the operational data from RDBMS to Hadoop for processing and copied the result back to RDBMS
- But in ETL we used Hadoop as a data warehouse Transformation engine copying the results to relational data warehouse
- All of this data copy from and to RDBMS is taken care by Sqoop
- The organizations currently using the business intelligence applications already deployed can continue working with the current system, just off load the processing and transformation job to Hadoop thus not breaking their current system.
- Usually RDBMS should be small and lean as possible, as RDBMS are not really scalable, so we can have our recent operational data in RDBMS and move all the historical data to Hadoop.
- But for most of the data analysis processing we need to combine the current data in RDBMS and the Historical/Row data in Hadoop. We can use Sqoop to copy data on regular intervals say every 15 mins from RDBMS to Hadoop and perform the data analytics on both current and historical data, wherein the data is available for business analytics, data scientists, data engineers etc.
- RDBMS are not really scalable and they are limited to SQL. In Hadoop we can scale virtually without any limits and we can use various API’s for data processing.
- We have this question why can’t we store the raw data in Hadoop itself instead of copying data on regular intervals using sqoop, this is because Hadoop is a backend data platform not a front end database like RDBMS. Hadoop does not support transactions for many critical front end applications, also some applications can work only with RDBMS and making them work directly with Hadoop requires some effort.
- Moving expired or historical data that is less accessed in RDBMS to Hadoop.
- Unlike moving the data to tapes, moving the data to Hadoop makes data accessible and we can work on analytics on top of it
- Data those are not frequently accessed are moved to Hadoop using Sqoop, keeping their RDBMS small and lean.
- Hadoop being the central data lake for most of the organizations requires data from various sources to perform data analytics for business insights.
- An organizations data can be widely spread across various RDBMS like oracle Database for ERP, MySQL database for Web data, SQL Server for CRM data and even Legacy systems like mainframe DB2 for Billing data etc.
- Sqoop helps in exporting all these data to the Hadoop data lake and perform data analytics on top of it.
- This in turn would reduce the batch operation on these operational databases, as it has to serve machine critical transactions as well these data copied to Hadoop helps in data analytics of different sources.
- Basically data warehousing and reporting were running on RDBMS, a big datawarehouse running on Oracle, MySQL, etc. were the data analysts would connect their business intelligence tools like informatica/Talend to the data warehouse for reporting capabilities.
- Basically these BI tools depend on the Schema model and rely on the RDBMS and SQL queries.
- In case the reporting workload becomes so much demanding that our relational database simply cannot scale any further to meet that demand and our BI reports needs to access historical data for trend analytics. Then we need to move some of the workload to Hadoop without breaking the current BI solutions as these tools by design requires RDBMS.
- Sqoop can directly copy the data from the RDBMS and create HIVE tables on Hadoop, and then we can run SQL statements to query data on it, thus these BI tool can directly connect to Hadoop using the connection string and perform the same activity, the BI tool users are abstracted and gives them the same feel of working on RDBMS even though they are connected to Hadoop.
- Using Sqoop for all these purposes will let the BI people to perform some workload on Hadoop with breaking the compatibility.