Archiving your HDFS data in AWS Cloud for disaster recovery
To take a backup of our Cluster data for disaster recovery
We are going to use the Glacier Storage provided by AWS.
About Glacier Storage
Glacier is designed to address the shortcomings of a number of traditional archive solutions, like TAPE and DISK archiving none of which is completely satisfactory.
Glacier leverages the AWS infrastructure to provide archival storage that addresses the shortcomings of both tape and disk solutions:
- It’s inexpensive:Glacier costs start at less than $.02 per gigabyte of archival storage. That’s significantly less expensive than disk archive, and even less expensive than tape archive, the previous low-cost archive solution.
- It’s durable:Glacier uses the S3 infrastructure, which means it can offer the same 99.999999999-percent durability as the S3 service. That’s a lot more reliable than the previous archive solutions.
- It’s convenient:You just send and retrieve archive files over the Internet, making it simple to extend your current backup solution to Glacier. Many of today’s newer, commercial backup solutions provide deduplication functionality, so if you use one of those, you can be sure that it will soon have an Archive to Glacier option.
- It’s highly scalable:An archive file can be as large as 40TB, which should be big — enough for anyone.
- It’s secure:Data is transmitted to and from Glacier over SSL encryption, and the archives themselves are encrypted as well while in storage.
- It’s fast:Data can be pulled from Glacier in as little as five hours, making it significantly faster than tape archive solutions, which require schlepping out to the archive storage facility.
So how do you send huge data over the Internet, AWS has a couple solutions to this issue
- AWS Import/Export is a service that allows lets you to send Amazon physical disk drives with your data on them.
- AWS Direct Connect is a service offered by Amazon in partnership with network service providers that place a high bandwidth connection between their facilities (or, indeed, your own data center) and AWS.
Amazon S3 allows users to manage objects’ lifecycles in order to lower the cost of S3 objects that should be removed in a certain period, archive log files that should be stored for backup, or auditing in the future. For example, you can configure a lifecycle policy to automatically delete objects in a week because the objects are a collection of data to create reports and are not needed anymore after that. Otherwise you can archive objects into Amazon Glacier in a month because the objects are system log files, which need not be examined immediately for auditing but need to be examined in a couple of days when it’s needed. In addition,
It takes around 3 to 5 hours to restore the data back to S3