TL;DR by removing redundant data, we reduced our Amazon Redshift costs by over 80%.
We use Redshift to store information about our websites, our social media accounts, and related performance over time. This information is exposed via a business intelligence (BI) tool for analytics and reporting purposes.
At a certain point we realized that our Redshift cluster, composed of several powerful dc2.8xlarge instances, was very expensive. In order to reduce the cost of our database, we needed to reduce the number of instances in the cluster. Since the capacity of the database grows with the number of instances, a reduction in the amount of data stored in the database leads to a reduction in the number of instances required to store that data.
We were storing a lot of redundant data, and removing that data would make it possible to shed instances from our cluster.
Why We Have Redundant Data
Much of the data that we store in our Redshift database comes from various APIs (e.g. Facebook’s Graph API, Twitter’s REST API, and the Kinja core API). In order to keep the data up-to-date, we make intermittent requests to these APIs for new data. For instance, every hour we make a request to each of our Facebook page’s /posts feed to gather all posts published within the previous day. Instead of using UPDATE/INSERT/DELETE operations which are expensive and error-prone in Redshift, we simply add an accessed_at timestamp to each record, and append to the appropriate table via the very efficient COPY operation. Our BI tool typically exposes only the latest version of each piece of data by using the latest value of the accessed_at timestamp.
Though our database can efficiently ingest data from many sources, the append-only nature of the ingestion process means that we are storing older copies of data that are no longer useful, and the size of our database grows larger than it needs to be.
How We Removed Redundant Data
In order to remove the redundant data, we devised an operation which we call the MOP operation. The term MOP is not an acronym, the name is a nod to the VACUUM routine which is a built-in operation used to clean up tables in Redshift (we run VACUUM on our tables nightly). Here is an outline of the MOP operation:
- For any table <tablename> which includes an accessed_at timestamp column, we define a set of grouping columns or column expressions: c1, c2, ..., cn.
- We make a SELECT statement on the table which selects only the most recent data. The selected data is exported to S3 using the efficient UNLOAD operation:
SELECT DISTINCT t.*FROM <tablename> AS t INNER JOIN (SELECT c1, c2, ..., cn, MAX(accessed_at) AS last_update FROM <tablename> GROUP BY c1, c2, ..., cn) AS latest ON (t.c1 = latest.c1 AND ... AND t.cn = latest.cn)
- We create a temporary _mop_new_<tablename> table with the same structure as the original table, and use the COPY operation to copy the data from S3 into the temporary table.
- In a single transaction, we rename the original <tablename> table to _mop_old_<tablename>, and we rename the _mop_new_<tablename> table to <tablename>.
- Finally we drop the _mop_old_<tablename> table which contains the redundant data.
We schedule this job to run on all of our tables on a weekly basis. We also ensure that no data is appended to the original table while this operation is occurring, otherwise we’d lose that new data.
By removing the redundant data, and also removing some sources of data that were not being used, we were able to reduce the amount of data in the database by about 70%.
This reduction in the size of our data enabled us to remove 75% of the instances in our cluster. We reduced costs further by switching from on-demand to reserved pricing for the remaining instances. The removal of instances, and use of reserved instances resulted in a cost reduction of about 83%.