Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

monthly offsite backup to Amazon glacier #3

Open
jdimatteo opened this issue Oct 7, 2013 · 8 comments
Open

monthly offsite backup to Amazon glacier #3

jdimatteo opened this issue Oct 7, 2013 · 8 comments
Labels

Comments

@jdimatteo
Copy link
Member

@jdimatteo
Copy link
Member Author

It would probably make sense to plan on transferring the data to Amazon cloud in such a way that it is available both for backups (e.g. in Amazon Glacier) and also for computation (e.g. Amazon Elastic Map Reduce).

@charlesylin
Copy link
Member

Agreed. I will get the ball rolling.

-Charles

On Nov 5, 2013, at 9:05 PM, jdimatteo [email protected] wrote:

It would probably make sense to plan on transferring the data to Amazon cloud in such a way that it is available both for backups (e.g. in Amazon Glacier) and also for computation (e.g. Amazon Elastic Map Reduce).


Reply to this email directly or view it on GitHub.

@jdimatteo
Copy link
Member Author

We might want to consider duply, which is supposed to ease duplicity use.

This might be a useful overview: http://blog.phusion.nl/2013/11/11/duplicity-s3-easy-cheap-encrypted-automated-full-disk-backups-for-your-servers/

After we get duplicity working and see how well it works, we might want to consider making that our sole backup solution so we don't have to maintain both.

@charlesylin
Copy link
Member

This sounds like an awesome idea.

Do you want me to set up an S3 bucket?

-Charles

Charles Y. Lin, Ph.D.
Dana-Farber Cancer Institute
Department of Medical Oncology
[email protected]:[email protected]
http://bradnerlab.com

On Wed, Oct 29, 2014 at 9:45 PM, John DiMatteo [email protected]
wrote:

We might want to consider duply http://duply.net, which is supposed to
ease duplicity use.

This might be a useful overview:
http://blog.phusion.nl/2013/11/11/duplicity-s3-easy-cheap-encrypted-automated-full-disk-backups-for-your-servers/

After we get duplicity working and see how well it works, we might want to
consider making that our sole backup solution so we don't have to maintain
both.


Reply to this email directly or view it on GitHub
#3 (comment)
.

@jdimatteo
Copy link
Member Author

@bradnerComputation : would you like to create a bradnerlab_backup bucket?

I'm testing out duplicity now. I just installed duplicity/duply, and I think that was successful but it seems like the unrelated libpam-systemd package is in a broken state. Have you seen this error on TOD before?

jdm@tod:~$ sudo apt-get install duplicity duply python-boto
...
Setting up libpam-systemd:amd64 (204-5ubuntu20.6) ...
Can't locate Debconf/Client/ConfModule.pm in @INC (you may need to install the Debconf::Client::ConfModule module) (@INC contains: /usr/local/lib/perl5/site_perl/5.18.1/x86_64-linux /usr/local/lib/perl5/site_perl/5.18.1 /usr/local/lib/perl5/5.18.1/x86_64-linux /usr/local/lib/perl5/5.18.1 .) at /usr/sbin/pam-auth-update line 28.
BEGIN failed--compilation aborted at /usr/sbin/pam-auth-update line 28.
dpkg: error processing package libpam-systemd:amd64 (--configure):
 subprocess installed post-installation script returned error exit status 2
Setting up python-lockfile (1:0.8-2ubuntu2) ...
Setting up duplicity (0.6.23-1ubuntu4.1) ...
Setting up duply (1.5.10-1) ...
Setting up python-boto (2.20.1-2ubuntu2) ...
Errors were encountered while processing:
 libpam-systemd:amd64
E: Sub-process /usr/bin/dpkg returned an error code (1)
jdm@tod:~$

I saw this related thread. I guess we can just ignore it for now, since it doesn't seem fixed in Ubuntu Trusty yet.

@jdimatteo
Copy link
Member Author

@bradnerComputation I finally got duplicity running with a small sample.

I don't think a full backup to ec2 of something as big as tod:/grail is feasible. I just measured tod's upload bandwidth to be about 30 megabits/second. Grail is about 12TB, and at that rate it would take about 37 days of uninterrupted upload. Your IT department might not appreciate you using all this upload bandwidth for 37 days either.

We might want to consider instead copying /Grail to 6 or 7 of internal 2TB disks and mailing them to Amazon via their import program (http://calculator.s3.amazonaws.com/index.html?s=importexport). You could probably buy the drives for about $500 and then the Amazon upload cost would be about $200 I think, so around $700 (and they would return the drives so you could use them for something else). This would allow us to get the initial backup done, then we could do incremental backups over TOD's internet connection. This seems like significant hassle -- are we sure the Dana Farber IT doesn't have some other offsite solution (e.g. maybe an internal network to storage in another building that could effectively count as offsite)? If going forward we anticipate processing bams on EC2, then maybe this hassle is worth it.

Please let me know your thoughts. I think last time we discussed this you suggested just trying the backup over the internet and seeing how it goes -- let me know if you'd like me to just start the 30 day transfer.


I successfully followed the instructions at http://blog.phusion.nl/2013/11/11/duplicity-s3-easy-cheap-encrypted-automated-full-disk-backups-for-your-servers/ after figuring out the following:

  1. The TARGET_USER and TARGET_PASS correspond to my jdimatteo IAM "Access Key ID" and "Secret Access Key"
  2. the target should be specified in the following format:
TARGET='s3://s3.amazonaws.com/bradnerlab_private/jdimatteo/gunk/backup-test/'

@charlesylin
Copy link
Member

I think I can get 40TB of storage from them. would that be sufficient?

-Charles

Charles Y. Lin, Ph.D.
Dana-Farber Cancer Institute
Department of Medical Oncology
[email protected]:[email protected]
http://bradnerlab.com

On Wed, Nov 5, 2014 at 5:03 AM, John DiMatteo [email protected]
wrote:

@bradnerComputation https://github.com/bradnerComputation I finally got
duplicity running with a small sample.

I don't think a full backup to ec2 of something as big as tod:/grail is
feasible. I just measured tod's upload bandwidth to be about 30
megabits/second. Grail is about 12TB, and at that rate it would take about
37 days of uninterrupted upload. Your IT department might not appreciate
you using all this upload bandwidth for 37 days either.

We might want to consider instead copying /Grail to 6 or 7 of internal 2TB
disks and mailing them to Amazon via their import program (
http://calculator.s3.amazonaws.com/index.html?s=importexport). You could
probably buy the drives for about $500 and then the Amazon upload cost
would be about $200 I think, so around $700 (and they would return the
drives so you could use them for something else). This would allow us to
get the initial backup done, then we could do incremental backups over
TOD's internet connection. This seems like significant hassle -- are we
sure the Dana Farber IT doesn't have some other offsite solution (e.g.
maybe an internal network to storage in another building that could
effectively count as offsite)? If going forward we anticipate processing
bams on EC2, then maybe this hassle is worth it.

Please let me know your thoughts. I think last time we discussed this you
suggested just trying the backup over the internet and seeing how it goes

-- let me know if you'd like me to just start the 30 day transfer.

I successfully followed the instructions at
http://blog.phusion.nl/2013/11/11/duplicity-s3-easy-cheap-encrypted-automated-full-disk-backups-for-your-servers/
after figuring out the following:

  1. The TARGET_USER and TARGET_PASS correspond to my jdimatteo IAM
    "Access Key ID" and "Secret Access Key"
  2. the target should be specified in the following format:

TARGET='s3://s3.amazonaws.com/bradnerlab_private/jdimatteo/gunk/backup-test/'


Reply to this email directly or view it on GitHub
#3 (comment)
.

@jdimatteo
Copy link
Member Author

40TB sounds good. You talking about 40TB of storage at another building at Dana Farber, right? Have you tested the upload bandwith to that 40TB?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants