Build a live video app with ARKit and OpenTok

Build Live video spp with ARKit

If you have an iPhone, chances are that you have already upgraded to the latest version of iOS. In its 11th version, Apple has introduced many new things. As usual, some of them are related to the new hardware, others improve and polish the well-known iOS formula. But there is one thing that is completely new and will bring a new type of applications that never existed before at this scale. We are talking about Augmented Reality (or AR) applications, and the Apple SDK ARKit.

An intro to AR

AR applications simulate virtual elements existing in our world. That’s achieved in the phone by overlaying those virtual elements over a view which is displaying the video stream of our camera. At the same time, it uses the phone’s sensors like the accelerometer or core motion data to track our movements, so when we move around the virtual object, it is rotated like a real object living in the same place would do.

These kind of applications have existed since long before Apple introduced its own SDK. Many of them are based on the well-known OpenCV library, whose great capabilities to do live image processing make AR apps possible. However, ARKit will be a strong push for this kind of apps, since the right hardware in the form of dual cameras is now available and so the user can achieve greater accuracy of AR objects in the real world.  It is also well integrated with the iOS developer tools and other SDKs like SpriteKit and SceneKit.

OpenTok and ARKit

If you’ve seen the recent post from our CTO Badri Rajasekar about his thoughts and predictions around AI, you’ll know that here at TokBox, we think there is a whole host of opportunities to be had by combining the power of AR with cameras and microphones. So when Apple unveiled the ARKit back in June at the WWDC we immediately thought how cool it could be to put the video of the participants of a conference in a virtual room combining it with virtual elements. Using that inspiration as a starting point, we started investigating the new SDK and how we could integrate it with the OpenTok SDK.

In this blog post we will describe how you can use ARKit to show the video of your session participants.

All the code of this blog comes from a ARKit sample that you can find here.

ARKit primer

ARKit core is its ARSession class. Whenever you start a session, it will start capturing images from your camera and reading data from your phone sensors in order to perform the calculations to show your virtual elements as if they were real.

In many cases, you will want to place virtual objects over the ground or “hanging” on the walls. For that purpose, ARSession class also has capabilities to inform you when a flat surface has been detected.

Before jumping to the image rendering part, we need to know that ARSession class also provides a way to add elements in a given position via the class ARAnchor. For example, if you want to place a virtual Sphere in the AR world, you will use an ARAnchor to position it. As described in the last paragraph, when a flat surface is detected, ARSession will provide you an ARAnchor in case you want to place an object at that position.

Creating Virtual Elements

We have been talking about placing virtual elements, but we haven’t talked about how we are going to create those elements. ARSession class and the rest of AR* classes will help us with world tracking, but we need a way to draw the virtual objects.

For this purpose ARKit design is open so it can be used in conjunction with many 3D and 2D rendering engines, such as Unity or Unreal Engine. However, one of the easiest ways to combine ARKit is to use Apple’s own frameworks to render content: SpriteKit or SceneKit depending if you want to show 2D or 3D elements.

If you prefer to follow the Apple way, like we did for our sample, ARKit SDK offers you two classes, ARSCNView and ARSKView. In both, you will have a typical SCNView or SKView where you can render sprites or meshes, and you will have an ARSession ready to move and place them as if they were real. In our sample, we decided to use SCNKit, to show the video of the people of a conference as if they were in a real frame, so we will continue exploring SCNKit and ARSCNView.

ARSession and ARSessionConfiguration


OpenTok and SceneKit

SceneKit is a high level 3D API that apple introduced a couple of years ago whose main aim is to create Games or 3D visualization Apps. You could say that SceneKit is the Apple response to Unity or other high level Game Engines out there.

So, our plan was to create a scene in SCNKit (Xcode has a nice 3D scene editor built-in), and render the video content of a OpenTok session over a 3D plane (the purple one in the image below), which was placed inside a bigger cube which will act as its frame.

OpenTok and SceneKit for ARKit app

SceneKit uses both OpenGL or Metal backends to do the 3D object rendering, however, not everything works the same in both backends. For example, Metal backend only works on real devices and not in the simulator and some other features are only available if you are using Metal. This makes sense knowing that Apple is moving forward to Metal due to its better API and better performance.

Using Metal was one of the first problems when we tried to use OpenTok SDK. Our SDK uses OpenGL to render the video of a OpenTok Session in a UIView element. Since we were having tough times trying to use our default OpenGL renderer in SceneKit scenes we decided to create a Metal video renderer that we could use in this sample and in anywhere else Metal is preferred.

Metal Rendering in OpenTok

In some ways, Metal design is close to OpenGL and the concepts we used to build our OpenGL renderer are valid for building the Metal one. We just need to reformulate how we render the video.

In OpenGL, we use 2D textures as the input of a fragment shader. That fragment shader is a small program that runs on the GPU (that means high parallelization, or in other words, hundreds of mathematical operations done in parallel) and converts from the input format of the video, YUV, to another 2D texture in RGB. Then we assign that texture to a 2D UIView, and the video is renderer on the screen.

In Metal, we will use something similar. In this case we use a compute shader that will also perform hunderds of matrix multiplications in the GPU. That compute shader will take 3 2D textures for Y, U and V planes, and will write onto another 2D texture that happens to be a MTLTexture.

Guess why? SCNKit objects use SCNMaterial instances to give real apparency to SCNKit objects, and yes, SCNMaterial are formed by MTLTexture.

So we have our path clear, we will need to:

  1. Create a custom OpenTok renderer that will receive video frames,
  2. Those video frames will be fed to the metal compute shader
  3. The shader will convert them to a RGB in a MTLTexture
  4. That texture will be assigned to a SCNPlane in our scene.

Easy, right?

Metal rendering in OpenTok for ARKit app

YUV to RGB Metal compute shader

Once we have a clear view of what we want to achieve, let’s see the code we used to make it real. We will start by showing the metal shader:

kernel void YUVColorConversion(
       texture2d<uint, access::read> yTexture [[texture(0)]],
       texture2d<uint, access::read> uTexture [[texture(1)]],
       texture2d<uint, access::read> vTexture [[texture(2)]],
       texture2d<float, access::write> outTexture [[texture(3)]],
       uint2 gid [[thread_position_in_grid]])
    float3 colorOffset = float3(-(16.0/255.0), -0.5, -0.5);
    float3x3 colorMatrix = float3x3(
                                float3(1.164,  1.164, 1.164),
                                float3(0.000, -0.392, 2.017),
                                float3(1.596, -0.813, 0.000)

    uint2 uvCoords = uint2(gid.x / 2, gid.y / 2); // Due to UV subsampling
    float y = / 255.0;
    float u = / 255.0;
    float v = / 255.0;
    float3 yuv = float3(y, u, v);
    float3 rgb = colorMatrix * (yuv + colorOffset);
    outTexture.write(float4(float3(rgb), 1.0), gid);

As you can see, the code of the shader is quite simple, it just reads the pixel data from the three input textures, and performs some operations to convert it from YUV colorspace to RGB. The idea behind this is very similar to what we do in our OpenGL renderer

Custom Metal renderer

In order to create a custom OpenTok Renderer, we need to create a class that conforms to the OTVideoRender protocol. That protocol has just one function func renderVideoFrame(_ frame: OTVideoFrame). As you can image, that function will be called around 10 to 30 times per second (depending on the video fps received) and every time it will pass a YUV video frame in its frame parameter.

We need to extract the video frame information from that frame, and send it to our shader. In our Swift class we will have 4 instances of MTLTexture, 3 for YUV inputs and 1 for RGB output. Starting with the RGB output, we create it using:

textureDesc = MTLTextureDescriptor.texture2DDescriptor(
                 pixelFormat: .rgba16Float, 
                 width: Int(format.imageWidth), 
                 height: Int(format.imageHeight),
                 mipmapped: false) 
textureDesc?.usage = [.shaderWrite, .shaderRead] 
 // device is MTLDevice instance
let outTexture = device.makeTexture(descriptor: textureDesc!) 

The difference is the pixelFormat, RGB output will be rgbafloat, and Y input will be r8Uint since the Y place just has 1 byte per pixel.

In order to fill the yTexture with the data coming from the OTVideoFrame, we will do:

guard let planes = frame.planes else { return }
                   region: MTLRegionMake2D(0, 0, 
                                         Int(format.imageWidth), Int(format.imageHeight)),
                   mipmapLevel: 0,
                   withBytes: planes.pointer(at: 0)!,
                   bytesPerRow: (format.bytesPerRow.object(at: 0) as! Int))

We will do the same for U and V textures, but taking into account that our image format has 2:2 subsampling in the U and V planes, meaning that the size of those textures is divided by 2. (Same as you could see in the shader code).

Once we have the texture data in our custom OpenTok renderer, we need to pass the three textures to the shader and effectively, run the shader. To do so, we need to issue a Metal command buffer, encode the commands and add the encoded commands to the command buffer. If you want to know more about how this works, please read official Apple documentation about Command Organization and Execution Model.

Although it could sound a little bit intimidating, as you can see, the code is not that complex:

let defaultLibrary = device.makeDefaultLibrary()
let commandQueue = device.makeCommandQueue()
let commandBuffer = commandQueue?.makeCommandBuffer()
let commandEncoder = commandBuffer?.makeComputeCommandEncoder()
let kernelFunction = defaultLibrary?.makeFunction(name:"YUVColorConversion")
let pipelineState =
                         try! device.makeComputePipelineState(function: kernelFunction!)

commandEncoder?.setTexture(yTexture, index: 0)
commandEncoder?.setTexture(uTexture, index: 1)
commandEncoder?.setTexture(vTexture, index: 2)
commandEncoder?.setTexture(outTexture, index: 3)

                        threadsPerThreadgroup: threadsPerThreadgroup)


Once we call commit, the shader will do its job, and will start processing our textures.

Linking everything together

If you remember our checklist:

  1. Create a custom OpenTok renderer that will receive video frames,
  2. Those video frames will be fed to the metal compute shader
  3. The shader will convert them to a RGB in a MTLTexture
  4. That texture will be assigned to a SCNPlane in our scene.

There is just one thing left. Assign the output of the Metal shader to the SCNMaterial of our SCNPlace in the scene.

To do that, we need to get the reference of the plane, and we will do that in our UIViewController,

let scene = SCNScene(named: "art.scnassets/opentok.scn")!
let node = scene.rootNode.childNode(withName: "plane", recursively: false)!

Once we have the node, we will send that node to the Custom Renderer we have built, and it will assign the texture to its material, by executing:

node.geometry?.firstMaterial?.diffuse.contents = outTexture 
// outTexture from our custom capturer.

After this long journey around SCNKit and Metal rendering, we haven’t forgotten that this blog post is about ARKit. We are using a ARSCNView, so it has the session bundled in: we just need to run the session (and pause it!)

override func viewWillAppear(_ animated: Bool) {
  let configuration = ARWorldTrackingConfiguration()
override func viewWillDisappear(_ animated: Bool) {

Also, we need to create an OpenTok session, connect it to start viewing the video of a given subscriber in that plane, that due to ARKit will be floating in the space in our room (or wherever we run the sample). We can walk away or walk around it, and the video will appear to be living in our world.

The code in the blog post are excerpts. If you want to see the whole sample in action, don’t miss its repo at github. Using this sample as the starting point, it is very easy to add more video participants to the virtual room, or to even model a complete virtual living room where the video of your OpenTok session participants will be hanging around.