Howdy folks, and welcome to another clypd technical blog post. Today’s topic: Bigger data (than before) and Redshift. clypd has grown a lot in the past few years, and it’s high time we shared our story of handling bigger data that exceeded our ability to utilize RDS and Postgres for storage and analysis. We’ll start by going back a few years to our first bigger dataset to document our starting point. No engineering process is a straight line, and the failures are as important as the successes in terms of learning material. We’ll take a brief detour through a couple of implementations that worked, but were not ideal. Finally, we’ll arrive at our Redshift infrastructure, and I’ll share some of the techniques we implemented to get the most out of it. I hope that you’ll be able to learn from our mistakes as well as our accomplishments.
Our story begins back in 2017. As our business grew, the clypd platform needed to crunch through larger and larger datasets, such as the Nielsen Respondent Level Data (when did Nielsen Panelists watch TV) or MRI Fusion dataset (additional attributes about Nielsen panelists) in order to support the capabilities our customers desired. At first, we stored these in our PostgreSQL database engine alongside the rest of our customer data. We already had database capabilities, and it seemed prudent to continue to leverage those as long as we could. The first major attempt to store a large dataset was our historical impression dataset. This includes information about the average audience size in any given hour, dissectable by various attributes such as network or demographic. We only used this in background analysis, so peak performance wasn’t essential. Nevertheless, it’s over 100M rows, which takes PostgreSQL some time to sort through and aggregate. It also increased the size of our database backups and the amount of time it took to restore scrubbed versions of those backups onto developer sandboxes. We knew it was workable for the time being, but not a good solution.
With what we discovered from the average audience solution, we looked to take a different approach the next time we encountered a lot of data. This came when we started posting advanced target impression estimates for our clients’ campaigns. These couldn’t use aggregated numbers, as we needed to look at the programming involved at the exact time their advertisement spots ran. However, we did know that we were looking at relatively small windows of data at a time; we only needed to post new spots. The problem seemed like a bad fit for Postgres. Because Postgres stores data on the disk in a row-oriented fashion, columnar aggregations like the ones we needed for this feature are costly. We decided to do a little bit of ETL work to make the raw data more readable and leave the transformed data on S3. When it came time to post, we would gather the times for which we needed to post, and then scan the relevant file for those times. It worked… but not particularly well. As we rapidly discovered, we’re not always posting only the most recent spots. In addition to the inevitable errors that every system should be robust to, advanced targets get posted multiple times as new data become available. This meant we were frequently crawling through multiple raw files of data and the process took increasingly long amounts of time as campaign budgets climbed. The vast majority of the bytes that had to be retrieved from S3 and read over were irrelevant. As the system stuck around, we became more and more stuck in on the format we chose for S3. We started looking for a real solution.
Unfortunately, the next time came before we found it. Nevertheless, we were determined not to repeat past mistakes. Since the attempt at average audiences, we had spent some time researching Postgres partitioning. When it came time to forecast audience reach (the number of unique viewers of an advertisement over the life of the campaign), we attempted to partition our postgres database. We broke up the exposure probabilities by broadcast quarter and stored each quarter in its own partition. This came out to about 50M rows per quarter. To alleviate some of the problems with database backups, we excluded these from the backups and loaded them separately. However, the pain started to build rapidly on the developer side. Performance of the feature quickly became hard to test, as the production Postgres instance would have cached relevant portions of the data, but developer sandboxes would start cold. This increased the developer scheduler runtime to over 20 minutes for the first run. Developers stopped updating their sandbox databases as frequently and started making their own workarounds. We attempted to pre-warm that data via a variety of means, but we can’t really call any of them a success. While this solution left something to be desired, it was still superior to both of the previous attempts.
The next feature to come around was our advanced audience planning tool (ADAPT), which would crawl through the attributes attached to Nielsen panel respondents in order to build audience segments for other features to use. In addition to being a major value-add to our business, it required crunching through an unprecedented (by our standards) 2 billion records. This time, we were ready. We had spent time researching Amazon’s Redshift product, and it seemed like the next logical step in the direction we were going. Redshift automatically partitions data across however many compute nodes one connects to the cluster, which extends from our partitioning work with reach. Furthermore, it’s smart about how it retrieves that data so that (if configured optimally), it runs parts of the query in parallel on the compute nodes themselves. We spent most of our time figuring out the infrastructure side of running another, more expensive database, so we were running with a pretty much stock configuration. This time, we were really happy with our solution from a performance perspective, especially considering that we didn’t have to invest in tuning it. Where reach took minutes, we were able to crunch through an order of magnitude more attributes in under 30s in most cases.
I mentioned that we spent a lot of time on infrastructure, and it’s worth discussing a little more about how that is set up. The data on our Redshift cluster changes fairly infrequently. Most changes are on a weekly cadence, some are quarterly. Furthermore, many operations are read-only. We have ETL processes to load new data, but most of the client-facing features and the vast majority of development on the clypd platform can use a read-only copy of the database. This has enabled a significant cost- and time-saving opportunity: Most uses can share a Redshift instance. By default, all developers share the same Redshift instance with our staging environment. What this means is that the staging environment, as a production replica, is responsible for loading new data into the Redshift cluster and all developer sandboxes get that data for free. This has dramatically cut down on the launch time and total cost of developer sandboxes, and we’re a more nimble organization for it.
As we pleased as we were with our redshift success, we’re never content to rest on our laurels. In addition to powering new forecasting capabilities with Redshift, we’ve been hard at work converting our previous attempts detailed above to work on our new Redshift infrastructure. ADAPT was relatively isolated from the rest of our day-to-day business. As we started to host more and more data on Redshift, we found that we often needed it to interact with data stored in Postgres, whether that is audiences defined by clients or the definitions of the broadcast calendar. Redshift is not a low-latency database, so we chose not to allow our API server to interact with Redshift directly. However, our jobs system can and does interface with both databases. To that end, we rely heavily on temporary tables to store data in redshift that uses Postgres as its respective source of truth. This nicely elides any problems with data getting out of sync between the two systems. It’s cheap to load small tables into Redshift for these purposes, and we’ve built tooling around converting them into permanent tables for debugging purposes. These tools combine into an even more powerful debugging tool: intermediate states. We’ve so far resisted implementing complicated queries in Redshift. Instead, we use chains of simple queries to transform data sets one step at a time. These intermediate states can be dropped on S3 via the UNLOAD command, and then installed back into redshift via COPY. From that, we can reconstruct a long-running process and dissect the results to find the specific simple query that may have introduced a bug. We’ve found this capability is a very powerful tool for observability. If we don’t engineer it in from the start, we usually end up putting in a similar capability haphazardly every time we have to debug. It’s a necessary debugging/QA tool and we’ve found it worth it to pay that price upfront, rather than every time.
We’ve also gotten more proficient in understanding the idiosyncrasies of Redshift. We decided on a distribution strategy around dimensions (networks in most cases) that rarely interact with each other. Our understanding of our use cases guides our uses of sort keys and other table configurations. When we develop new capabilities based on data hosted on Redshift, we usually begin by prototyping out the process and designing many of the queries we eventually intend to string together as a feature. In doing so, we not only vet that the process does what we expect it to, but we also gain valuable insight into how the tables we’re going to create will be used. This insight feeds back into the design process, and we create or tweak our table definitions with this in mind. Overall, we’re richly rewarded from the extra time spent in prototyping and design on these features with increased stability and performance.
These performance gains get passed on to our customers in terms of faster responses from the platform, and onto engineers who have to intervene in the system less often and with a richer set of tools. In terms of performance, we saw improvements across the board:
|Feature||Average Duration Before Redshift||Current Average Duration|
|Advanced Target Upload||50 minutes||2 minutes|
|Advanced Target Posting||30 minutes per week processed||1 minute total|
(20 minutes for engineers)
As far as developer debugging goes, I think Marge Colberg (Principal Data Scientist) put it best:
If you’re only able to compare the end results, untangling the differences becomes a herculean task. By comparing the results starting at the first stage, you can immediately identify any divergences between the two systems you’re comparing and have the ability to identify multiple differences in the systems.
As you can see, we’re tremendously pleased with the outcome of this journey. It chronicles roughly our journey from scrappy media start-up to the market-leading media services company we are today. In the meantime, we learned how to manage bigger datasets and how to build the right infrastructure around them so that we can support new and better features for our customers. We look forward to further development of our capabilities and new and exciting challenges in the realm of bigger-er data!