plus/minus epsilon
Certificate Revocation
23 Oct 2019There are two standard protocols for revoking certificates on the Internet: CRLs and OCSP. Neither of them work or are even widely implemented, which can make revocation a difficult task. Chrome and Firefox use proprietary mechanisms instead: Chrome's is called CRLSets and Firefox's is OneCRL, though Firefox is also currently experimenting with CRLite.
You may notice that both of the proprietary mechanisms are based on CRLs: they essentially just distribute the list of recently revoked certificates, with some filtering. However, this pattern has obvious scaling challenges when the number of revoked certificates is large. (And in fact, the size of standard CRLs is one of the reasons they're no longer used.) After Heartbleed was discovered, Cloudflare needed to revoke and reissue hundreds of thousands of certificates. Rather than add that much bloat to CRLSets or risk some of the revocations being omitted, Chrome elected to modify its source code and ship an update that would look for any certificate for "cloudflaressl.com" issued before the mitigation and reject it.
How do we get out of this? One possible option is to make certificate lifetimes so short that revocation is a non-issue. But this puts an unreasonable burden on the Certificate Transparency ecosystem, where CT log operators host public archives of every certificate issued for the past few years. These archives allow site operators to look for mis-issued certificates for their domain name and if they were 100x larger, almost nobody would be able to store/search them in their entirety.
A much more creative idea is something called OCSP Must-Staple. OCSP Stapling is already a well-known, optional optimization where a server proactively sends proof that it's certificate isn't revoked in the TLS handshake, meaning the client doesn't have to do any out-of-band validation. OCSP Must-Staple is exactly that, except the certificate has an extension telling browsers to refuse to accept it without a valid staple.
Must-Staple has lots of little problems that need to be ironed out before it can be deployed, though. For example, there's just not a lot of good OCSP stapling implementations in web servers, and the OCSP endpoints run by CAs can be unreliable. Historically, neither of these things have ever needed to be reliable, so they're not. And because of the way staples are generated by CAs, when things go wrong, you may only have two or three days to fix it before it takes your website down. Contrast this with the leisurely, scheduled, manual renewal of a 2-year certificate that's still somehow the source of countless outages.
Must-Staple also creates its own fair share of difficult scaling challenges. Cloudflare previously offered reliable OCSP stapling, but no longer does because our internal systems don't have the throughput to support it. If you take a system that's good at managing lots and lots of 1-year certificates and you want to also fetch OCSP staples for all of them, you have to scale that system by a factor of at least 50x (= 1 year per certificate / 7 days per staple). Though ideally, you would scale it by 100-200x and this reasonably requires re-building the system in a completely different way.
Fig. Flowchart explaining the process of protocol development on the Internet.