5 minutes of BGP instability after leap second (posted 2015-07-06)
This July 30th, at 23:59:60, a leap second was added to Coordinated Universal Time (UTC). Dyn Research posted the following graph on Twitter that shows there was significant BGP update instability for five minutes after the leap second occurred:
Unfortunately, it's not clear why this happened. However, leap seconds have triggered all kinds of mishaps in the past. They're basically miniature Y2K problems. Time and time again, software engineers show that they can't be trusted to take corner cases into account properly.
This does remind me of a situation about a decade ago, where I had a customer that experienced BGP instability every night at the same time. They used Quagga running on Linux machines. We couldn't figure out what the problem was, until we realized that at that very moment, the ntpdate command was run from the cron. ntpdate synchronizes the system clock with an NTP server. As the machine in question had a very poor system clock, this meant that the system's time was adjusted a lot every night, I think a minute or more, but definitely more than 30 seconds.
Which meant that if Quagga had gotten a BGP keepalive message 8 seconds earlier, it now thought that was 38 seconds ago. If BGP is configured with a hold time of 30 seconds, this means that Quagga now thinks the other side has been quiet for longer than the hold time and it'll tear down the BGP session. This is what happened every night for a bunch of BGP sessions. We solved this by running the NTP daemon continuously, so there was never a big adjustment in system time. (Alternatively, just letting the time drift would also have worked.)
The minimum BGP hold time is 3 seconds, so adjusting for an (improperly handled) leap second shouldn't be able to make BGP think the hold time for a session is expired. However, there could be bug somewhere else that impacted BGP.
I'm not sure whether these kinds of issues are a good argument in favor of abandoning leap seconds, as the bugs won't go away, they'll just show up at a less predictable time. But I don't like the current leap second practice, as they're unpredictable, and you can't calculate the time difference in seconds between two dates without taking the entire list of leap seconds into account. I think it would be better to save the leap seconds up and apply them all at the end of a century.