Last weekend, we took our entire service offline for our largest ever database restructure. It was a sluggish process – requiring an extra 16 hours on top of the 14 we had predicted – and represents the biggest change we’ve made since creating the Memory database itself!
It’s opened up a lot of new opportunities from a development perspective, but has also helped us mature as a company, so we’d like to share a few thoughts on exactly what went down and what it means for Timely’s future.
What we did
As an automatic time tracking tool, Timely processes a ton of data every day; a number which has mushroomed since we added AI into the tool. To give you an idea, over the past six months alone, our database has increased in size by 300%. It’s a giant.
So we decided to nip it in the bud before it became unmanageable. First we tried to archive data to S3, which would reduce our database to around 1/10 of its size, but since Amazon RDS does not allow you to decrease the storage of an existing RDS instance, it was clear we needed to move and restore the entire database.
We also needed to make a few structural updates. We use the Rails framework on our backend, which creates all database IDs as “int” by default on our version. To make the database “future ready”, we migrated table primary keys and foreign keys into “bigint”. We also migrated our Redis and Memcached servers to to make the whole communication with our application server more secure, and to decrease latency.
As a result of the migration, we’ve been able to cut our Memory database down to 1/7 of its original size. We also made a few updates in the process, encoding utf8mb4 to enable Timely to support and store more unicode characters – including emojis ✌🏼.
For the technically interested, here’s a full run-down on what we did:
- Backed up existing data to S3 to decrease size of the database
- Restored the data to new database instance with 1/7 storage size
- Dropped all foreign keys in the database
- Migrated all string/text columns, table level and database level encoding to utf8mb4
- Migrated int columns to bigint
- Recreated all foreign keys
- Set up all user access and grants for the database
- Created a read replica instance
- Migrated application servers, Memcached and Redis
- Moved our table row format from antelope to barracuda which provides better compression with a few extra features, like the large index key prefix.
What it enables
This restructure will help stabilize Timely’s processing speed as we continue to add more Timely features and expand existing ones. It’s all about preparing our database for the future; being able to handle rapid increases in data creation, while still offering a seamlessly smooth user experience.
But we’ve also added the basis for supporting all languages and characters in Timely – as well as any emoji you could ever dream of 👍🏽! We want to get to a position where you can use Timely naturally in any language you want, using any system of symbols you want. And now we’re a little closer to achieving that.
Why it took so long
Transferring such a huge amount of data was never going to be a walk in the park, but we also hit an unforeseen issue during the migration. These two activities required more time than we’d put aside:
- Restoring the database to a much smaller storage space
- Migrating database IDs to bigint and encoding utf8mb4
The main issue was the slow network speed over the Amazon network. While restoring our database to the smaller storage space, our input/output operations per second (IOPS) was extremely slow (3000 IOPS), and we encountered limits on IOPS speed when migrating int and text columns.
The technical explanation:
We wanted our database to enable Multi AZ, but it requires the data to be written to all availability zones within the region being used for failsafe. When we restored our database, the Multi AZ was enabled and, due the immense size of the data being restored, we exhausted all our IOPS at a rate which was higher than the replenishment rate. Writing data to all the availability zones also slowed down the transfer, taking almost 17 hours to complete.
Learnings for other developers:
Switch off Multi-AZ when restoring the mysql dump, and then enable it once the restore is completed.
- Pros: the restore will be faster as the data writes will happen to one availability zone and not all the zones
- Cons: if a failover happens in that zone then you would have to start the restore again from the beginning.
- The whole migration process speed (including the restore) is directly proportional to the storage size.
- IOPS can be a bottleneck to speeds even if you have a good bandwidth available between the database instance and the EBS storage.
The most important bit
On a personal note, this migration represents something more important for us at Timely: that we need to get a whole lot better at communicating disruptive maintenance work with you.
We are sincerely sorry that so many of you were left confused and shut out over the weekend. We know you rely on us and that your work takes a massive hit whenever we limit our service – especially when you aren’t sure when you’ll be able to access your data again.
From now on, we’re going to post thorough background articles like this ahead of any scheduled maintenance work, so you have a clear idea of what we’re doing, why we’re doing it and when we’ll be back on our feet again. We’re also going to get a lot better at just being available when something like this happen, so your questions and concerns aren’t left hanging in the ether.
Thank you so much for being completely straight with us this weekend. We’re still figuring out how to communicate changes like this with you, and your honest feedback is the best thing we have to get better at it.
If there’s anything you think we can do to support you through future maintenance work, please let us know at [email protected]
In the meantime, enjoy your data! Keep your Memory Tracker running, open hundreds of tabs in your browser, pull as many reports you can think of, log your time in hieroglyphs, add as many people to your account… safe in the knowledge Timely can take it 💪🏽!