If you're developing a web application you probably have multiple environments that closely resemble your production environment. Recently I had one of those requirements that seems simple at first glance and turn out to be much more complicated once you get down to the nitty gritty: a customer asked us for a simple way to periodically create a clone of their production RDS instance and restore it to their staging environment, replacing whatever is there.
We decided to use the python boto3 library to wrap all of the following in a script.
best practices make things harder
Now, of course we use (most if not all) of best practices when building customer environments. This means that the RDS instances are encrypted and the production environment is running in an account completely separate from the AWS account that hosts the staging environment. Unfortunately this complicates things quite a bit. The normal workflow the customer used in their old on-premise environment was:
- Back up production database
- Drop staging database
- Restore backup from step 1 to staging
However, now the process is more involved. We need to split our activities between the source AWS account and the target AWS account.
In the source AWS account:
- Make a snapshot of the production instance. The production instance is encrypted with the standard KMS key when you turn on encryption at rest in RDS. When you create a snapshot of this instance, you cant choose another KMS key directly, so you need to
- Create a new KMS customer managed key and share it with the AWS account that we're going to share the snapshot with
- Create a copy of the snapshot and re-encrypt it with the key we created in the previous step
- Share the snapshot with the target AWS account
In the target AWS account:
- Make a copy of the shared encrypted snapshot to this account. Even though both the source KMS key and the source (encrypted) snapshot are shared with the target account we can't restore an RDS instance from a shared encrypted snapshot owned by another AWS account.
- Remove the currently running RDS instance (without making a final snapshot as we simply don't care)
- Restore our copied snapshot into a running RDS instance
Now, along the way there's a number of gotchas:
- the boto3 library is simply a wrapper around AWS API. Many of the RDS API calls return instantly, making you think the action has been executed. However, making a snapshot of a larger instance is obviously a taks that might take a while. In addition to making sure the API call for creating a snapshot executes successfully, you then need to monitor the snapshot for it's status. Boto3 implements what it calls `waiters` for this, but they don't allow to print a status update or some kind of progress while the waiter is waiting. Watching a script execute with no visible activity for more then about a minute is (imho) bad UX, so we implemented a simple loop that polls for the status every x seconds
- We use terraform to deploy our environments and part of that is to create a VPC for resources to run in. That means that you need to be able to specify the VPC you want the RDS instance created in in the target account. Unfortunately this is not directly possible, you need to specify something called a dbsubnetgroup on creation of the instance. The dbsubnetgroup essentially identifies the subnets an RDS instance lives in, which implicitly also determines the VPC.
- In order to make sure the instance is only reachable from the places it needs to be accessed from we need to specify the securitygroup for the instance. Somehow it is not possible to specify this when restoring an instance from a snapshot, so we need to restore the instance and then modify it to replace the securitygroups
After quite a bit of testing we came up with this python script that takes all of the above into account: