This post was co-authored by Gustavo Garcia Bernardo, Philipp Hancke and Charley Robinson.
When WebRTC stuff is really broken, it gets fixed very quickly.
Early in December 2015, shortly after the release of Chrome 47 to the general public, we started to notice a subtle and strange behavior in the Audio/Video of streams during our many daily meetings using WebRTC: the video occasionally wouldn’t stay caught up with the corresponding audio. As with many bugs noticed internally by developers, it took a while for any of us to believe that what we were seeing was a real issue. We call this the inverse of productive dogfooding: rather than assume we are just like our users, we can just as easily decide we are nothing like them.
Does this sound familiar? When there are so many moving pieces, it is very easy to fall into a belief that you, the developer, are seeing just an anomaly of the developer’s environment, nothing to worry about. As engineers, we can fabricate arrays of reasons why some problem should be ignored: maybe the VPN is having a hard day, maybe someone is checking out the Chromium codebase. No matter the rationale, the end belief is always the same: that the problem will either shake out or get horribly worse before it could ever show up as an insidious production bug. This is a much more safe and comfortable belief, so long as the believer is lucky enough to be correct.
We weren’t so lucky. Customer reports began to flow mere days later.
The audio and video components (“streams”) of a WebRTC call are treated mostly independently. They are encoded separately, treated differently on the network (because who cares how good you look on a call, if you cannot be heard?), and are only brought back together to represent the same moment of time at the very last moment before a human sees and hears them. We call the behavior that brings these two streams back together in a sensible manner the “lip-sync” mechanism. We slowly began to accept the reality: lip-sync was very broken. We had seen some issues back in 2012 already but they had been fixed. But why?
It would take a few more weeks to fully understand the why. In that time, we pursued entire shoals of Red Herring. We checked datacenter distributions, browser versions, JS SDK versions, and many numerical statistics. For all of our millions of lines of logged data, we couldn’t find a single factor that captured the issue. With the increasing pressure of customer reports suggesting that the impact of this issue could be quite large, we became desperate for data.
Analyzing the issue
We turned to some (super awesome) customers for help; our friends at SuiteBox. Gathering WebRTC statistics from an errant endpoint feels like equal parts black magic and luck. There are many steps that can go wrong. It’s not the prettiest. And even once you get the data back to an engineer who can look at it, it takes time to figure out what to look for. This is not something you want to hear as someone whose business may depend on it.
After a few failed attempts, we were able to get an intact data dump. Digging through the amount of data exposed by chrome (several megabytes) is difficult. We ended up visualizing the dumps content to see if anything looked amiss.
In this, we were lucky; amiss the data did look. The next morning, Gustavo showed this worrisome graph:
What we see in this graph is an aptly-named metric, googCurrentDelayMs. Growing. Linearly. Without even going into the specifics of what we are looking at, critical minds should at this point be suspicious of a thing called “delay” that grows over time, to 1.5 seconds after ten minutes. And we had heard of worse cases.
As luck would have it, we had been looking at googCurrentDelayMs for a separate customer report of “bad quality”. For comparison, their webrtc-internals dump looked like this:
When the quality was perceived as good in certain periods, googCurrentDelayMs typically follows another interesting metric, the jitter buffer value. After 50 seconds, the googCurrentDelay value starts to diverge. This corresponds to a ~200ms desync between audio and video and was likely caused by packet loss. But then the value goes down again.
Digging a little deeper into the webrtc.org source code, this metric is described as the
// Set the minimum playout delay required to sync video with audio.
void set_min_playout_delay(uint32_t min_playout_delay);
This means the video will be delayed to be in sync with audio when necessary. Sounds reasonable. It is an interesting metric to expose; in our case, it could enable us to tell the user that there is something wrong with the connection. We might even try to fix things by turning off video.
This also means googCurrentDelayMs should never grow linearly as we have seen in the first graph. So we went ahead and filed a bug with the Chrome team. We did not have a way to reproduce this yet, just a single set of data.
Filing a Chrome bug
We explained what we had found, why we were concerned, and that we had not been able to reproduce it yet. The Chrome team quickly responded with a theory about what could have caused this:
Just an idea, could be related to that we switched over to using camera driver timestamps for video packets.
But they had no luck reproducing the issue either.
In order to confirm the theory, we needed a Wireshark dump of the network traffic (see here if you are interested). While we were trying to get one (more on this later), another bug was filed and this time, the Chrome team managed to reproduce the faulty behaviour in their lab with a certain camera model. Then it was easy to identify the change that introduced this behaviour. The hunch turned out to be correct.
A change to improve the smoothness of the video on Windows was responsible. Justin Uberti, Google’s Technical Lead for WebRTC talked about this change last year at the KrankyGeek event in San Francisco.
This had some unexpected consequences. To quote another Google engineer:
Sounds to me like this breaks AV sync completely for an unknown number of clients on our largest platform. I think we should request reverts for this all the way into stable, because it’s way scarier to me than previous behavior.
Well, we agree. Also, our customers would agree. Reverting a change all the way into the stable version of Chrome is only done for very serious issues. And this turned out to be one.
So a rollback was released globally. Soon after the change rolled out, SuiteBox (an original reporter) sent us this positive feedback:
I’ve had my first test meeting in circumstances that would produce it 100% of the time. Video and audio remained completely in sync. That’s great news.
Which we passed on to the Chrome team. Because that is exactly the kind of feedback you want to hear when making the hard decision to revert a big change.
The first revision solved the issue on Windows, but we had also gotten some reports which involved FaceTime HD cameras typically used in Macbooks. In fact, we had already identified another problem affecting Macs. Isn’t it lovely when you get similar behaviour caused by two unrelated issues on different platforms?
The OS X issue was also fixed in Chrome Canary (the newest version of Chrome), and has been merged into the current beta version of Chrome (M49) as well. This is a reasonable decision; the issue was a lot less likely to happen so the risk of merging it into the stable version is bigger than the benefit.
Is there anything that can be done better?
The way the Chrome team handled the bug was perfect. Our initial report was quite vague. We had little data. We did not know how to reproduce it. All we had was some data and an initial analysis that indicated something looked odd. Working directly with Google paid off, as usual. The more concrete a bug report is, the easier it is to reproduce the issue and fix it.
Visualization helps, too. The graph we provided is easy to reproduce from the data, but the impact from a visual tool is much louder. Almost intrusively loud. Certainly not the first thing you want to be shown by your co-worker from Europe when you wake up. Or the last thing you want to see before logging out.
We ran into some problems. It was very difficult for us to get Wireshark dumps from users affected by this. Fortunately, the Chrome team has the same issue and so they had already been working on a feature for the chrome://webrtc-internals page that allows downloading some of the network traffic Chrome receives. In the next version of Chrome, instead of having to explain someone how to install Wireshark (good luck with that!) all you have to explain is how to generate such a dump. We put together a short blog post describing how to get the dump as well as some of the tooling we used to view and analyze those logs.
On Windows, this issue happened only with certain cameras and the Logitech C930 was not used in the automatic regression testing done by the Chrome team. This is going to change and there will be metrics added tracking the sync behaviour in order to discover such regressions in the future. And hopefully we can get the smoother video that was promised back!
Why do we care?
This issue has affected our customers. In a way that it threatens the success of their business. We take this very seriously. The desync for some customers was huge; over ten seconds in certain tests. None of their customers would tolerate this. On the flip side, once the problem was solved, they were appreciative of having a vendor to help them navigate the issue, regardless of the root cause.
TokBox prides itself on being our partner’s representative to the WebRTC developers at Google and Mozilla, and we are thankful to have a productive working relationship with them. WebRTC remains some of the best technology on the market to enable real-time communications, and we are tasked with balancing an entire ecosystem of browser vendors, platform providers, developers and customers deeply invested in this technology. This is no small task, as the ecosystem is in flux and condemned to experience growing pains – the concrete needs to cure and settle while not exposing any cracks!