Taboola is the leading recommendation engine in the open web marketplace, serving close to 1 million requests per second during peak, maintaining p99 response time below one second. During the recommendation flow, caches are heavily used to increase performance, save complicated calculations, and reduce the load on external data stores. At Taboola, we build cache services based on memcached and have one memcached cluster per data center dedicated to the recommendation stream. These Memcache clusters serve more than 10 million requests per second.
In this article, we discuss a particular case where our memcached cluster network became saturated and unstable.
the problem
As Taboola continues to grow (partnership of Yahoo, Apple News and Stocks, and many more), so do our referral requests. In turn, our Memcached cluster feed recommendation requests start to face a skewed access pattern, as illustrated in the following graph.
Each line represents a memcache node: positive for transmit traffic and negative for receive traffic. In particular, the orange line is a biased node with much larger network traffic than others. And the biased node’s network traffic is saturated to the point where clients connecting to the node begin to experience occasional unpredictable packet loss.
Observability in cache distribution
Before we dive into the issue, we need to understand how caching works in Taboola. As a simplified flow, clients see the read cache while there are two underlying cache layers: in-memory cache and in-memory cache.
- The cache access hits the in-memory cache first.
- If the in-memory cache fails, you reach memcached.
- If both in-memory and in-cache are missing, get them from the datastores and/or perform any necessary calculations.
In a Memcached cluster, each Memcached node operates independently and is not aware of each other. Cache entry fragmentation is performed by our clients using consistent hashing. New nodes are added from time to time, and the number of nodes is generally relatively stable.
The first theory of the problem is the celebrity problem, meaning that certain inputs are hot and accessed much more frequently than others, thus leading to the session. But this is quickly removed as hot entries probably reside in the cache in memory all the time and are served in memory without needing memcache retrieval. We also confirm that there are no suspicious high evictions between in-memory caches.
Our theory then becomes a biased distribution of cache entries among memcached nodes. The initial memory check indicates that there are some discrepancies between the memcache nodes’ memory usage, although much less extreme than the network traffic. However, since our memcached cluster serves dozens of different caches, it is likely that the skewed input distribution of one or a few caches is hidden from the coarse-grained memory view.
To find the possible biased input distribution, we need to be able to analyze the input distribution per cache. Although memcached has the tool to dump cache statistics, it is an aggregate view and lacks the knowledge to determine which entry belongs to which cache. In order to better understand entry distribution, we first add a new convention to our caching framework and gradually migrate all of our caches to prefix cache keys with cache names. Second, we use memcached lru_crawler to dump the cache keys and together with the cache name prefix to collect the distribution of cache entries. We automate this process and feed the statistics into our continuous monitoring pipeline powered by Prometheus.
After the statistics are available, unfortunately none of the dozens of caches show a skewed input distribution. On the bright side, we dispelled another theory and created another nice observable on our Memcached cluster.
Find the bad guy
Then something catches our eye: the biased Memcached node moves from one node to another at a given time, as the graph below illustrates, while it has been consistent on a particular node for a long time. Looking deeper, we notice that the change time correlates with the migration of adding the cache name prefix to the cache keys. While the root cause is still unknown, we now have an easy way to a) validate whether it’s one or a few caches causing the problem, and ib) find exactly which caches are causing problems.
Then the research is simple with the binary search technique. We divide dozens of caches into two groups and perform another cache key tuning migration on these two groups separately and observe which group’s migration leads to biased node displacement. And starting from the group that has change effect, we repeat the process of dividing it further into two subgroups to test the change effect. After a few interactions, we can pinpoint exactly one cache that is causing the distortion.
So now we find the bad guy, but the puzzle is still not solved. The cache hit rate on the problem cache is 83%, although not great, but not too bad, and its eviction rate is reasonable. Why is it causing the problem then?
The root cause
There are several attempts to address the problem, including reducing the size of the largest entries up to hundreds of kilobytes and some experiments to reduce cache access, but no help and the session persists. Another anomaly is then noticed: while the problematic cache has a reasonable in-memory hit rate, the stale access ratio is >90%.
Stale access is an optional feature of Taboola to help with latency. The basic idea behind stale access is simple: in some scenarios where there is a contention between latency and data freshness, we can optionally favor latency by postponing data fetching and using slightly stale data. When stale access is enabled, as is the case with the problematic cache, the flow becomes:
- The in-memory cache now has two TTL values, as shown in the following graphic.
- Freshness TTL to determine whether an entry is considered fresh or stale.
- Entry TTL to determine when an entry should be flushed from the cache.
- The invariant is TTL freshness < TTL input.
- When an in-memory cache access returns an entry whose freshness exceeds the freshness TTL but is still within the entry’s TTL, it is flagged as a stale access and then the entry is posted as usual .
- Meanwhile, a background task is started to get data from memcached and data sources if memcache fails, and then update the new cached copy to memory.
In general, we expect the stale access ratio to be low, since subsequent access to the same entry from a stale access is likely to get new data as a result of the background update. And the high stale access ratio indicates that something unexpected is happening. After some digging, it turns out that this is actually where the error is!
By convention we assign the same value to both the freshness TTL and the memcache TTL. But for the problematic cache, there is a bug that is being pushed around, which makes the memcached TTL much larger than the freshness TTL. This causes the memcached update stale data background task to succeed, but updates a stale entry with another stale entry, when memcached entries exceed the freshness TTL. It basically means that stale entries are not updated at all until they pass the memcached TTL. And it also explains the symptom that the memory cache hit rate is high, since the stale access is still a hit.
In short, stale entries are not updated, resulting in massive stale access and in turn massive Memcache access. This essentially turns the problem into the celebrity problem that we remove in the first place, meaning that hot entries are not actually buffered and most access results in a memcache access.
With the root cause found, the solution is fairly trivial. After applying the correction, the network traffic of the biased node is reduced by one-fifth and the asymmetry is resolved, as shown in the following graph.
conclusion
This exploration shows that a simple problem becomes a challenge when it’s live in production with millions of accesses per second and dozens of different clients involved. We hope you enjoy reading this troubleshooting journal with us and find value in the techniques we used to solve this particular problem.