Better statistical regression tests: Release your inner German!

Jun 13, 2010

Hi everybody,

The PowerDNS Recursor 3.2 release is holding up well for almost all users, but still some slight issues have crept up. One of the issues involved, where we needed to work around an artefact/issue/quirk/oddity/bug in the UltraDNS servers (depending on who you talk to), turned out to be.. a faulty workaround.
The workaround looked like it should have caused a lot of problems in production, but apparently did not. The PowerDNS Recursor is very well tested before each release (by replaying billions of anonymized packets donated by large scale Recursor users). Such testing catches large scale problems, but small scale problems can get lost in the noise of the internet – huge amounts of DNS queries produce failures not because of PowerDNS, but because the domains themselves are broken.
To fix this, and also to determine the exact impact of the failed workaround, we now have an automated test tool that tries to resolve all 1 million domains which Alexa regards as the most important. There is a strong WWW bias in their domain names, but we can still be reasonably sure that any regression in PowerDNS that is important is sure to be reflected in the success of resolving these 1 million domains.
The testing tool we wrote, ‘dnsbulktest’ behaved as expected, and immediately uncovered bugs in our parallel packet sending/receiving infrastructure (part of the PowerDNS Authoritative Server prereleases). In addition, the amounts of traffic generated blew away several firewalls, leading to network downtime. Way to go!
After those issues were addressed, the numbers from the regression tests turned out not to add up. To have any confidence in numbers produced, it helps if the number of timeouts plus the number of received packets eventually equals the number of packets sent. Getting everything to match up took quite some time, but again fixed some bugs here and there unrelated to the testing tool.
And now, my “inner German” is satisfied, and all the numbers match up perfectly:
In this case, all 1 million Alexa domains were queried once with ‘www.’ prepended, and once without. Quite a number of domains return ‘No Data’ without the ‘www.’. The NXDOMAIN number is truly odd, but when ‘dnsbulktest’ is run against BIND, a similar number pops up. Apparently, quite a few of the ‘one million most popular domain names’ are unavailable after 24 hours. Makes you wonder.
Next up is scripting this tool so it will be run frequently and graphing the results, giving us a good indication of the state of the DNS as well as of the state of the PowerDNS Recursor!
Oh, and on a final note, fixing up the workaround mentioned earlier caused a repeatable 1.6% decrease in the number of ‘errors’. So that fix has been applied, now with the feeling that it actually fixes more than a single ‘broken domain’!

About the author

Bert Hubert

Bert Hubert

Principal, PowerDNS


Related Articles

PowerDNS Recursor 5.1.0-beta1 Released

We are proud to announce the first beta release of PowerDNS Recursor 5.1.0!

Otto Moerbeek Jun 6, 2024

PowerDNS Recursor 5.0.6 Released

Today we have released PowerDNS Recursor 5.0.6. This release is a maintenance release. The most important change is that the...

Otto Moerbeek Jun 5, 2024

PowerDNS Recursor 5.1.0-alpha1 Released

We are proud to announce the first alpha release of PowerDNS Recursor 5.1.0!

Otto Moerbeek May 15, 2024

PowerDNS Recursor 4.8.9, 4.9.6 and 5.0.5 Released

Today we have released PowerDNS Recursor 4.8.9, 4.9.6 and 5.0.5. These releases are maintenance releases that fix a few...

Otto Moerbeek May 14, 2024