Early December saw the roll-out of Chrome 47. When doing anything with WebRTC, this is always an interesting time. A release brings new features or may break things, like removing the getUserMedia functionality for insecure origins.
Our metrics clearly track such roll outs as seen below:
The number of calls using Chrome 46 (green, which was released in mid-October) dropped significantly within a few days while Chrome 47 (purple) was ramping up. We did not hear very many issues with the security policy change (loading from insecure origins is not allowed) which was communicated by us to our partner ecosystem.
However, we noticed a strong increase in the number of ICE failures in Chrome 47 (orange) starting on December 1st:
ICE failures happen when two browsers (or the browser and the media server) are unable to establish a connection using the ICE protocol which is providing the lowest level of WebRTC connectivity. Typically this happens when a firewall is blocking things. See this TechTok talk for more information.
When we started looking into the issue, we noticed something interesting. Failure rates were up for both peer-to-peer sessions (established directly between browsers) as well as sessions routed via the OpenTok Media Router (SFU). To make things worse, an SDK update rolled out the same week. So what was causing this problem – a platform update, browser update or both?
Fortunately, Chrome recently started tracking success rates for ICE in the M46 release. If the increase was related to a change in Chrome, this should be visible in those statistics as well, independent of the OpenTok platform.
After checking with Google we heard back that there was indeed a three-time increase in the number of transitions to the ‘failed’ ICE state. Along with a theory for what happened.
The theory was that a change back in August caused the following transition of ice connection states in Chrome 47:
connected -> disconnected -> failed (after ~10 seconds)
Whereas in Chrome 46 and earlier versions this never happened:
connected -> disconnected
After pondering whether to increase the logging to include more details about the state transition , we finally found a simple way to reproduce the behavior:
- Go to: https://webrtc.github.io/samples/src/content/peerconnection/pc1/
- Make a call
- Call pc2.close() to simulate peer going away without signaling.
Being able to reproduce an issue with one of the official WebRTC samples is always helpful. It reduces the amount of time the Google or our internal development team have to spend trying to understand your code which means your bug gets fixed faster.
This showed that the theory was indeed correct – (more here ) issue.
This is a significant change in behavior. When a browser update breaks your WebRTC app…
Fortunately this change had very little impact on the actual user experience for our platform. The state transition only happens ten seconds after the connection is interrupted. So hopefully the users were already seeing a UI that explained the issue, such as those shown here.
So yes, it is an issue that somehow rolled into the stable Chrome version. That’s bad. But isn’t this supposed to be prevented by the staged Chrome release process? Can this not be detected in the Canary or Beta versions?
Well, let us look at the data from the first graph again. In mid-November, when Chrome 46 was stable:
- 0.5% of calls happened on Chrome 47 (then beta and affected by this issue)
- 0.4% of calls happened on Chrome 48 and 49 (dev and canary editions)
Since that bug affects less than one in hundred calls (roughly; the actual ratio seems even lower), this means only one in 10000 calls was affected. In order to get a statistically relevant sample size (say 100) and recognize a trend this means that one million calls have to be made.
In hindsight, this is more of an “artifact in the data” as Eric, our data science expert suspected. Also, this is just the kind of hard data we need for analyzing certain trends in the WebRTC landscape. If you are curious, we will cover this topic in a webinar in January 2016 – stay tuned for more details.
One of the trends this demonstrates is not only how important it is for platform providers such as ourselves to work closely with browser vendors but also the need for planned and well documented updates that enable everyone in the ecosystem to react in a timely fashion.