Build an AR App with the Frame Metadata API

AR App using Framemeta Data API from OpenTok

If you had to choose one memorable thing about the upcoming iOS 12 when it was unveiled at WWDC’18, it would probably be the inclusion of ARKit 2.0, Apple’s Augmented Reality toolkit for iOS. I bet you still remember that cool demo by the Lego guys playing on the stage.

In fact, ARKit is probably the component which is going to grow the most in the new version of iOS, with many new features, improvements and even a new app to easily perform real-world measurements.

At TokBox, we released OpenTok version 2.14 a few weeks ago. Included in that release is a new API that will help push ARKit a little bit further. I’m talking of course about the Frame Metadata API.

Here we’re going to look at this new API and how it can help to improve AR use cases. To demonstrate, we’ll see an example in which this new TokBox API plays a decisive role in empowering a complete solution. We’ll also see code so that you can build an AR app using this API.

An Introduction to the Frame Metadata API

We first introduced the API in a previous post, but let’s start with a quick introduction to this API. At the most straightforward level, it allows us to develop some use cases that weren’t previously possible using other methods.

The most important part of a video conference is the video itself. A video stream is composed of thousands of video frames that are constantly flowing from a publisher to some number of subscribers. The Frame Metadata API allows you to insert some (small) information in each video frame as metadata of that specific frame.

There is no other way to get better “real-time” information than this: the metadata forms part of the video frame itself, so the whole WebRTC engine will ensure this data and the video is fully synchronized as they are delivered together to enable better real-time communication.

Given that this information is sent along with every video frame packet, which we don’t want to make too large, the metadata size that is available is just 32 bytes. That’s not a large amount of space, but it’s big enough to be able to bundle things like timestamps, data hashes, histograms or positional and angular data from a phone’s sensors.

The good news is that this API is one of the simplest things to use in OpenTok. You just need to call setMetadata method on the OTVideoFrame instance. This API is available for iOS, Android and Windows SDKs. If you want more details, please visit the samples linked to in the 2.14 blog entry

Build an AR app using signals

Now that we know the tools, let’s introduce the idea that we want to explore in this post. The first idea was to create an application that would allow a remote participant (a subscriber) to set 3D annotations in the AR view of a video publisher.

One example application for this use-case could be in the insurance sector. An agent can set annotations in the “world” of the car-owner publisher who is streaming video of their car which has been involved in some kind of accident. Other kinds of remote expert support applications also fit this pattern.

The steps for our simple first attempt were as follows:

  1. The car owner is the video publisher
  2. Their app publishes from ARKit allowing annotations and other objects to be placed into the car owner’s view
  3. The agent is a subscriber to the car owner’s video stream and watches this
  4. The agent taps their screen in their app when they want to annotate or highlight something important
  5. The agent’s app then sends an OpenTok signal requesting that an annotation be placed in the car owner’s view at a particular screen location
  6. The car owner’s app receives the signal and uses ARKit to add the annotation. This is then seen by the agent within the video stream that the car owner is continuing to publish

However, the challenge with this approach is the timing between steps 4 and 6. By the time the car owner’s app places the annotation its camera view may have changed so that their screen is now different from the agent’s screen at the moment they tapped. This results in the annotation often being misplaced. This is a simple result of network delay and the fact that OpenTok signals are delivered in a different “channel” with no guaranteed synchronization with the video frames being sent.

Using Frame Metadata to create real-time annotations

Here is where Frame Metadata comes to the rescue. We can embed the right information about the publisher’s state in each frame. That means the subscriber has real-time position information from the publisher thanks to Frame Metadata. Then, when the remote participant wants to create an annotation, they can first accurately calculate the 3D location of the annotation on the remote end.  

The only piece that is missing in this puzzle is what is “the right information” that should be provided? We want to create a 3D object in a 3D position in the subscriber that is in front of the image that the publisher is seeing. To do that, we are going to send the 3D coordinates of the publisher’s ARKit camera in form of a transform matrix to the publisher. We will see how to embed the transform matrix of the ARKit camera of the publisher in the section below where we explore our sample app code.

The steps for our enhanced application are now as follows:

  1. The car owner is still the video publisher
  2. Their app still publishes from ARKit allowing annotations and other objects to be placed into the car owner’s view
  3. Now in addition, the car owner app adds the continuously changing 3D camera information to every frame it publishes using the Frame Metadata API
  4. The agent is still a subscriber to this video stream and watches this
  5. The agent taps their screen in their app when they want to annotate or highlight something important
  6. The agent app now takes the 3D metadata from the frame the agent is watching at the moment they tap and uses this to calculate the correct 3D position for the annotation within the car owner’s view
  7. The agent’s app then sends an OpenTok signal which now includes the correct 3D position for the annotation, which is independent of whether or not the car owner’s camera view changes
  8. The car owner’s app receives the signal and uses ARKit to add the annotation at the correct 3D position. This is then seen by the agent within the video stream that the car owner is continuing to publish.

With this approach any delay in receiving the OpenTok signal no longer has any impact. The agent app has fully synchronized camera data for every video frame and so can position the annotation exactly, even if it then takes a fraction of a second for the signal to arrive and the annotation to actually be created by the user app.

As always, we have created a sample app that puts everything described in this blog post into practice. If you want to see it in action, please get this sample app from GitHub and follow the discussion below.

AR App Architecture

The sample app is an iOS app and the two main elements we are going to use are ARKit and the OpenTok Frame Metadata API. The graphic below shows a simple diagram of how the application works by using elements from both SDKs.

Framemeta data AR app architecture


In the publisher we use a ARSCNView which is a SCNKit scene with AR capabilities powered by ARKit. That view will feed with the view of the back camera and the AR Scene to a custom capturer that our Publisher will use to send frames to the subscriber. The custom capturer will bundle in the frame the camera 3D position and rotation in the frame metadata and will send it to the subscriber using the OpenTok SDK.

On the subscriber side, the frame will be shown. When the subscriber taps the view to create an annotation, the view will capture the x and y position of the touch. Using the 3D camera position of the publisher for the current frame, which is bundled as metadata in each frame by the publisher, it can calculate the 3D position of the annotation. Once that position is calculated, the subscriber will send a signal to the publisher using the OpenTok SDK with the position of this annotation.

When the publisher receives the signal it adds the annotation to the AR world, so it can be seen both by the publisher and subscriber. Since the subscriber is sending a complete 3D position based on the frame metadata at the moment the screen was touched, it does not matter how the publisher’s video view may have changed since then (unlike our original simplistic “annotate now” signal approach).

Code walkthrough

The sample has two main ViewControllers, PublisherViewController and SubscriberViewController that control each role of the app.


The role of this view controller is to hold the AR Session and render the world using SCNKit.

For the Frame Metadata API part, we use a custom capturer similar to the custom video driver swift sample. The most important modifications of that sample is that we added the capability of capturing the SCNKit frame along with the camera input, and the add of a delegate that is called just before the frame is shipped to the underlying OpenTok SDK.

In the PublisherViewController class we implement the delegate and pack the camera information in the frame metadata.

Since the limitation of the metadata is 32 bytes, we pack the float numbers in a Data array by using this code:

extension PublisherViewController: SCNViewVideoCaptureDelegate {
    func prepare(videoFrame: OTVideoFrame) {
        let cameraNode = sceneView.scene.rootNode.childNodes.first {
            $ != nil
        if let node = cameraNode, let cam = {
            let data = Data(fromArray: [

            var err: OTError?
            videoFrame.setMetadata(data, error: &err)
            if let e = err {
                print("Error adding frame metadata: \(e.localizedDescription)")

If you want to see how we convert an array of Float to data, please take a look at Data+fromArray.swift file from the sample.

In this class we have the code to add elements when the subscriber signals it.

The content of the signal from the subscriber will be:


With that information, we use two utilities methods from SCNKit: projectPoint and unprojectPoint to transform the 2D position of the tap and the object 3D position to calculate the final position of the annotation.

let nodePos = signal.split(separator: ":")
if  nodePos.count == 5,
    let newNodeX = Float(nodePos[0]),
    let newNodeY = Float(nodePos[1]),
    let newNodeZ = Float(nodePos[2]),
    let x = Float(nodePos[3]),
    let y = Float(nodePos[4])
    newNode.simdPosition.x = newNodeX
    newNode.simdPosition.y = newNodeY
    newNode.simdPosition.z = newNodeZ
    let z = sceneView.projectPoint(newNode.position).z
    let p = sceneView.unprojectPoint(SCNVector3(x, y, z))
    newNode.position = p



The main role of this class is to get the frame from the publisher, render it and when a tap is made, signal the publisher with the position of the annotation. For the rendering we use the same renderer as the custom video driver sample, but with the addition of delegation capability so it can expose the frame metadata through it.

The delegate saves the publisher scene camera with this code:

guard let metadata = videoFrame.metadata else {

let arr = metadata.toArray(type: Float.self)
let cameraNode = SCNNode()
cameraNode.simdPosition.x = arr[0]
cameraNode.simdPosition.y = arr[1]
cameraNode.simdPosition.z = arr[2]

cameraNode.eulerAngles.x = arr[3]
cameraNode.eulerAngles.y = arr[4]
cameraNode.eulerAngles.z = arr[5] = SCNCamera() = CAMERA_DEFAULT_ZFAR = Double(arr[6]) = CGFloat(arr[7])

self.lastNode = cameraNode

Please take a look at the section above where the Publisher bundled this information in the frame.

When the view is tapped, we calculate the annotation position with this code:

guard let lastCamera = lastNode else {
let loc = recoginizer.location(in: view)
let nodePos = lastCamera.simdWorldFront * FIXED_DEPTH

otSession.signal(withType: "newNode", string: 
"\(nodePos.x):\(nodePos.y):\(nodePos.z):\(loc.x):\(loc.y)", connection: nil, error: nil)

So there you have the main elements of this app. See the full GitHub code for how this all fits together into a complete sample app.

Conclusion and Further References

We have seen in this blog how our new Frame Metadata API can be used to build an AR app with features like real-time annotations. This is achieved by allowing fully-synchronized data to be sent with every video frame. In the simple example above we just had one stream, with the publisher giving the remote subscriber the right 3D perspective to calculate the placement of annotations. There are plenty more complex AR scenarios possible with multiple streams in various directions with multiple participants.

In addition, this metadata API can be used in other scenarios where it is very important to have synchronized real-time information, such as within computer vision applications. For example, our CTO discussed video quality improvement using face detection followed by image transformation in his presentation at Kranky Geek last year. In that example, fully-synchronized metadata is used to ensure that varying image transformations by the publisher are correctly reversed by each subscriber.

If you want to see other uses of ARKit and Opentok, please take a look at our sample code on GitHub:

If you build an AR app using OpenTok and any of these samples, we’d love to hear from you!