TokBox builds it’s own internal messaging infrastructure.

At Tokbox, we believe in providing a high quality video experience by constantly upgrading our server infrastructure. In that interest, Tokbox built it’s lightweight, scalable, raw socket based messaging framework called Rumor.

One might wonder why OpenTok needs its own messaging infrastructure, being a video streaming API. The concept of an OpenTok session is similar to that of people in a room (session) talking to each other (publisher and subscribers). When someone new enters the room, those already there acknowledge their presence. Similarly, when a new client comes into an OpenTok session, the current participants are unaware of that client’s presence until they’re notified by the server that someone else has joined. Along the same lines, any actions performed by that client (such as publishing their camera) need to be relayed via the server to all the other participants on that session. Not only is it important to be assured everyone gets these messages, but it also needs to happen in a timely manner. This is where our scalable messaging architecture, Rumor, comes into place.

Due to it’s ability to directly handle raw TCP sockets, Rumor’s hallmark is speed. To test Rumor’s speed, we setup a controlled environment with 8 Rumor server nodes in a mesh configuration. We setup 2000 clients each of whom sent 1 message to everyone connected. The result expected was 4 million (2000 * 2000) messages to be sent through the system. We were able to deliver each and every message within 14 seconds with an average speed of 285K messages/second.

To provide higher end media quality, it was essential that the media server delegates the messaging responsibilities and focuses more upon delivering media. We successfully did that by handling all the messaging on Rumor and just the media on the media server. In doing this, we saw a great conservation of system resources on our media servers, enabling us to handle more streams per server. To prove this, we ran a scale test in a controlled environment on the old (RTMP) and new (Rumor) architectures using a single 8 core server running the media server and Rumor processes.

As you can see in the graph, there is a huge difference in the CPU consumption between the two tests. In the first scenario, the media server handles the RTMP connections responsible for sending message and as it can be seen, usage has a linear growth as connections to the server increase with a peak usage of about 500% around the 19000 connections mark.  The spikes in the graph occurred when messages were being transmitted over the network to the connected clients.  As you can see in the Rumor test, although the spikes were prominent, there was hardly any CPU consumption to maintain the connections to the server.

Another reason we designed and customized our own messaging infrastructure was to assure it was cross-platform compatible and we could design our own client implementations on our various SDKs (Adobe Flash, Javascript, iOS, Java, etc). Initially, Flash was the only platform supported by Opentok and RTMP ( Real Time Messaging Protocol) in the form of shared objects was used for messaging. It made a lot of sense at that time since it came with Flash for no additional cost and Flash was the only platform supported. Aside from the fact that Shared Objects seemed to be less effective with scale, as mobile platforms came into the picture and later with the advent of WebRTC, using RTMP for messaging was pretty much useless. We needed something to talk to clients irrespective of the protocol they use for their media.

Rumor was designed to solve these issues for us. Rumor not only provides scale but it also is a well thought out example of a messaging fabric that possesses the following features:

Lightweight: The module is abstracted away from any application logic and hence lightweight.
Fast: With 8 Rumor server mesh configuration, we could send around 20,000 255 byte messages in 1 second.
Push based notifications: A change in the state triggers an event notification and the message goes out to the clients.
Pub-Sub Mechanism: To further optimize the messaging capability of Rumor and make the mesh architecture effective, we introduced a subscription mechanism. Peer Rumor nodes will only receive messages that they are interested in, and have subscribed to. This saves a lot of bandwidth making the message delivery even faster.
Scalable: The fabric can be scaled incrementally since handling scale is just a matter of adding more hardware to the mix.
Auto Discovery: Additional nodes can be added to the network and exisiting servers will automagically detect the presence of the newer nodes and load balance accordingly.
Interoperability: With the simplicity of the used protocol, it is very easy to write Rumor clients in different programming languages.
Durable: System resources are very well handled and a node is very rarely required to be taken down for maintenance.

If you have any questions or comments, post them here.
You can also follow me on Twitter here.