Do 12% of WebRTC calls really fail?

WebRTC logoI was talking with our old friend Philipp Hancke and discussing how it could be possible that 12% of the WebRTC calls were failing.  This number came as a surprise to us as, based on our reports, the number of failures is significantly lower when it comes to OpenTok calls, even though the exact numbers depend on the specific use case you have.

So, we decided to grab some data and try to prove that WebRTC, at least in our platform, is doing a much better job.

WebRTC connections are based on the ICE (Interactive Connectivity Exchange) framework.   Using ICE, browsers gather their IPs (using STUN and TURN protocols to get public and relay addresses) and try to establish a connection with the remote WebRTC endpoint.  They do this by sharing their addresses (ICE candidates) with each other and testing if there is any combination of local-remote candidates that works for that specific network configuration.

Analyzing WebRTC Data Sets

The Data Set

First, you need a large enough number of calls to look at.  We looked into 100,000 calls taking place only in our public demo apps.  This is web-only traffic and a combination of both 1:1 and multi-party sessions.

We included only the calls that had completed the ICE establishment either with success (ICE status is ‘connected’ or ‘completed’) or with failure.  We ignored the calls where the establishment was not completed (roughly a couple of thousand) as many of these are caused by users closing the browser during the establishment.  That said, this is something that we will analyze in future posts.

The Results

Success

74,455

Failure

1,541

Of the data set that we looked at, 2.0% of calls “failed”. This is much more of an acceptable number than the 12% given in the article.

Let’s try to dig into the reasons or types for these failures.

Our definition of ICE Failure is ‘does the ICE connection state ever change to failed’. Now welcome to the land of WebRTC bugs… We did see a big increase in ICE failures last december when Chrome 47 rolled out. This issue has still not been resolved but since we know about it, we can take that into account by selecting only those calls where the ice connection state went to failed without ever going to connected or completed.

But wait… we need to distinguish between the calls that worked and those where the ICE connection status changes to ‘fail’ after being connected or completed.

Success

74,455

Disconnections

1,349

Connection Failures

192

So, only 0.2% of OpenTok calls suffered real connectivity failures for web clients aka one tenth of the number quoted in the article.

But still…. How many of those ICE failures can be explained, e.g. by determining that a client was on a network which blocked access to the TURN servers?  At the API level, where we gather data, this can be determined by looking at whether relay candidates were gathered.

Given that most of our sessions are SFU based and the SFU has a public address, one of the most relevant aspects is to figure out if the browser was able to gather TURN candidates or not.

Failures w/o TURN Candidates

147

Failures w TURN Candidates

45

So we can tell that most of the failures were caused when  the browser was not able to gather TURN candidates.   One common scenario is a proxy which requires authentication, which can be an issue in enterprise environments. Chrome does not fully implement this and we suspect that fixing issue 439560 would help in some of those cases.  In the  case of Firefox there is an additional limitation due to the lack of TURN TLS support which are required to pass some enterprise proxies.  It is interesting that in all of these cases WebSockets over TLS were working.  This is because the TURN and WebSockets stacks are using different codebases with different limitation.  It is also possible, albeit unlikely, that the lack of TURN candidates is down to a bug in the browser or our infrastructure.

There are still 45 cases (0.05%) of failure without an explanation.  This could be due to incorrect logs, bugs in our SFU or corner cases that we are not aware of.   While we could debug these individually, as they are so uncommon this is not a priority.

These were the numbers and analysis for some of our calls, most of them SFU based and with 2-4 participants (with some reaching 20 participants).   One thing to keep in mind is that what we analyzed is the ICE failures and there are other potential failures not analyzed in this post (for example at device, encoding or transport level) that can prevent having end to end audio or video.