Top 3 methods of skipping big data’s bad data using Hadoop !
Team, this time i go with the title called “Top 3 methods of skipping big data’s bad data using Hadoop !“ which describes about how to get corrupt records out from the large data sets which has different format of data.
While doing our analysis if the corrupt records are in small percentage we can ignore or skip, else we need to retry the task by assuming hardware or network failures. After four times of retrying, it will be marked failed if it’s not succeed.
Method One: TextInputFormat. We can use the Hadoop’s default input format called TextInputFormat to set the maximum expected length of records or instance. This can be set by mapred.linerecordreader.maxlengthparameter in bytes. The best to give is to give greater than the length of the records. This helps us to skip the record with out task failing.
Method Two: Writing the Map Reduce code to handle the corrupt record is one of the best method in practice. Either in our Mapper side or Reducer side we can detect the bad record and ignore it, or we can terminate the job by throwing an exception. With the help of MapReduce program we can use in built Hadoop’s counters to see the total number of bad records or skipped records. You can also count the total number of bad records in the job using counters to see
how widespread the problem is impacting our outcome. And few subjective counters are Hadoop built in counters are MAP_SKIPPED_RECORDS, REDUCE_SKIPPED_GROUPS, REDUCE_SKIPPED_RECORDS, FAILED_SHUFFLE.
Method Three: Hadoop’s Skipping mode. While working with vast amounts of data, if the failure is not handled properly bad data can easily cause to error to the applications. In built Hadoop’s skipping mechanism can be help us to pinpoint the bad data and log it for review and validations. To enable the skipping of ‘n’ bad records in a map and reduce job, we need to add to the run() where the job config is setup. SkipBadRecords.setMapperMaxSkipRecords(conf, n)andSkipBadRecords.setReduceMaxSkipGroups(conf, n) respectively. And please be informed that skipping mode is off by default.
Map/Reduce tasks report the records being processed back to the tasktracker if the skipping mode is enabled. When the task fails, the tasktracker retries the task, skipping the records that caused the failure. The skipping mode is turned on for a task only after it has failed twice. Thus, for a task consistently failing on a bad record, the tasktracker runs the following
task attempts with these outcomes:
1. Task fails.
2. Task fails.
3. Skipping mode is enabled. Task fails, but failed record is stored by the tasktracker.
4. Skipping mode is still enabled. Task succeeds by skipping the bad record that failed in the previous attempt.
Please feel free to comment or suggest and if you find any other method to skip the bad data using Hadoop ecosystem please do share that.
Thank you for your valuable time !