Refreshing Of Almost Expired Records: Keeping The Cache Hot
We’re planning to do a series of posts to highlight some of the features we have been working in recent releases or the Recursor. We also plan to discuss some C++ programming techniques we used in the Recursor that might be interesting. The first post in this series describes a technique that was introduced in PowerDNS Recursor 4.5: Refresh Almost Expired.
When the Recursor receives a query from a client it answers it by first finding the name server that is authoritative for the domain in question. It does that by starting at a root name server and then walking DNS delegations to find the name servers (and their addresses) that are authoritative for the question asked. Once it knows which name servers are authoritative for the domain, it picks a specific name server address and asks the question to that name server. After the Recursor receives the answer, it will pass it on the the client asking the question.
To be able to answer fast, the Recursor caches information it receives from authoritative name servers. Both information about delegations and specific queries is stored in the record cache, which can be seen as a local in-memory copy of parts of the global DNS tree.
The recursor also has a few other caches, for example the Packet Cache and the Negative Cache, but these are not the subject of this post.
The Record Cache
The Record Cache contains entries of the following form
name, type, time to die, record set, ...
The entries are indexed and searchable by a name and type combination; time to die (TTD) is the time the entry will expire and record set is the set of DNS records associated with the name and type. When inserting an entry into the Record Cache the TTD is computed by taking the time to live (TTL) of the record set received from the authoritative name server and adding the current time to it. That way, it becomes easy to see if a record in the Record Cache is expired: just compare the TTD to the current time.
With the Record Cache, query processing becomes a bit more complicated:
1. Check to see if the Record Cache contains an entry (name, type) 2. If an entry is present and not expired: return the answer using the record set 3. If the entry is not present, continue by finding the authoritative server (using the Record Cache or the internet) as before and asking it the question. 4. If an answer is received from an authoritative server, insert it into the Record Cache and return it to the client.
Clients will have a better experience for cached record sets: the Recursor will be able to answer cached queries much faster than when it has to go to the internet to contact one or more authoritative servers to get an answer.
When using a cache, it is always important to keep consistency in mind: the cache should reflect the contents of the records of the authoritative name servers on the internet. Here we have an issue: we keep a record set in the cache for a maximum period equal to the TTL. If an authoritative server changes a record’s content in between, we do not know that until we re-fetch the record set and we won’t do that until the record expires. The general consensus is that we accept this issue, since caching DNS records make the Recursor very fast compared to always asking authoritative name servers. This caching period is the main reason why it takes time for an updated DNS record to appear in the caches of resolvers and returned to clients.
Another issue is making sure the cache is effective: in an ideal world, we only want to store records that are queried again within the TTL period. Otherwise we waste resources storing data that just expires after a while without being used. Sadly we do not have a way to know beforehand if a record is going to be retrieved from the cache after we have stored it. We somewhat solve this issue by making sure that if records need to be evicted from the cache because it is too full, we pick the records using a least recently used (LRU) method in addition to having a preference to evict expired records.
The Client Experience
When a client ask a “popular” question, often it will get an answer from the cache, as popular record sets tend to be present in the cache. But once in a while an unlucky client has to wait a bit, as the record in the cache is expired and the Recursor has to go out on the net to find the answer. The chart below illustrates what happens when querying a record set with a short TTL 60 times with a one second interval, starting with a recently filled cache:
Of the 60 queries done, most are answered within about 1ms, but three take significantly more .The peaks are 20s apart, the TTL of the record set in question.
Wouldn’t it be nice if there was a way to avoid that some clients will have to wait much longer for an answer than other clients? What if we would be able to fetch an updated record set before it expires from the record cache?
Keeping the Cache Hot
The jargon word for a cache that contains data you are looking for is a hot cache.
To try to make sure that entries in the cache are up to date the Recursor can be configured to issue queries to refresh them in the background when it sees that a record set is popular and almost expired. The Recursor uses a rather simple way to decide if a record is popular: if a query comes in and it has an entry in the record cache that is almost expired, it is marked as popular and a task is scheduled to refresh it. The client asking the question will still get the answer from the cache quickly and when a client then asks for the same record, it will find it refreshed (if the scheduled task has been run in the meantime).
Below we show a new graph adding the query times with the “Refresh Almost Expired” feature enabled by setting
refresh-on-ttl-perc = 10. This means that if a cache entry is re-queried with less than 10% of its original TTL remaining, it will be re-fetched. This feature is available since Recursor 4.5.
What we see is that the yellow bars of the graph do not have the high peaks that the orange ones have. The relative small price we have to pay is that we do slightly more queries to authoritative name servers.
To be able to do decide when to refresh a Record Cache entry, the Record Cache has been extended to include the original TTL. That way, The Recursor can decide if a cache entry is almost expired: only a set percentage of the original TTL remains. We have chosen to use this simple method since it does not require the Recursor to keep track of statistical data to see which questions are popular, while still being able to not refresh cache entries that are not re-queried within the sample period that is a percentage of the TTL. Doing that would be a waste of time and resources.
An asynchronous task subsystem was built to implement background refreshing. This subsystem is also used for other tasks in the Recursor that do not require the Recursor to wait for them. We will likely see the task subsystem in future posts again.
Short TTL values become more and more popular, as they allow for quick updates of record sets, but they decrease cache effectiveness, put more load on resolvers and authoritative name servers and cause more clients having to wait for resolvers fetching record sets. The Refresh Almost Expired functionality mostly solves the issue of clients having to wait on answers but we would prefer domain administrators to use large TTL values as it reduces load on the whole of the DNS infrastructure.