How do you create a video chat application?

MONDAY, JUNE 15, 2020    

Common video applications utilize this technology call WebRTC. WebRTC has been a technology of late to allow developers to exchange data and even media across the web with the use of your every day internet browser. This has been a technological transformation of the decade given how such functions are only available as plugins such as flush plugins.

Even if you have not heard of WebRTC, you probably already used it before. They are used in Whatapps, Facebook, and even Google handout etc. It is called Web Real-Time communication that allows you to have a direct connection between browsers. This technology has been around for at least 7 years already but I finally caught up! I managed to create my own WebRTC application to get my hands dirty to understand the tool better.

Here’s my WebRTC application as my new play toy! Check it out here

In this article, I will be covering some of the broad topics:

1. How does it work?

Magic is not actually magic. It is just a bunch of intensive work done underneath the surface. Like ducks, they look like they are just swimming and gliding elegantly over the surface of the water. But when you get to see the work behind that, its a ton of plain-old hard kicking around.

2. What are the components required to get a WebRTC application running

A web application has different components. Most are unseen and out of sight such as the server and etc. But they are like your car engine. You don’t see it when you drive a car (unless you are driving a Doge Charger R/T) but that doesn’t mean the engine is not there either.

Source: carsforsale

3. What is a STUN server used for

This section we will go a little into the details of how and when do you need a STUN server in the entire process.

4. What is a TURN server used for

This section we will just touch and go about when do you need a TURN server. It is not needed in all situations but a component needed if you have to deal with slightly less straightway use cases.

5. What are some of the possible WebRTC architectures for different use case and optimized room capacity

There are mainly 4 types of architectures when we talk about WebRTC. When you use zoom, have you wondered how is a server able to handle a meeting of up to 160 members? This isn’t magic and it actually requires a HUGE ton of data to be transmitted, talking about Terabytes of data for a simple call of 10mins.

Getting the right architecture for the business use case is half the battle won

6. How to test your WebRTC and free resource to try with

As I was playing with my new toy, I managed to find out some tools and resources that are interesting to play with to fit the different components of a simple WebRTC application.

Let’s dive in!

#1. How does it work?

When we first start off a video chat session, you first need to signal to a room that you have arrived. This way of telling other people you have arrived is called Signalling. If you’re the first to arrive into a party venue, technically you will still say hi but there will be no one to respond to you.

In the technical sense, signalling is part of the Interactive Connectivity Establishment Framework, and is the process of finding each other and then coordinating communication through an exchange of media information. Signalling makes use of the Session Description Protocol (SDP) for gathering network information such as Internet Protocol address (IP address) and port numbers used for media exchange.

Session Description Protocol (SDP) used as a standard method of announcing and managing session invitations. It represents the browser capabilities and preferences in a text-based format.

For your browser to connect to others, the information exchanged in a SDP contains the following:

  • Session control messages used to open or close communication.
  • Error messages.
  • Media metadata such as codecs and codec settings, bandwidth and media types.
  • Key data, used to establish secure connections.
  • Network data, such as a host’s IP address and port as seen by the outside world.

The architecture of how your WebRTC application will look like

#2. What components are needed to get a WebRTC application up and running?

1. You need a user interface.

You need to have a web application that is able to establish the connection upon load or a button click and then display the video data that are sent by your peers in the same chat room.

In a more technical sense, you need a RTCPeerConnection offering to be created and sent over to your remote peers to store in that own machines. Once you have the web application done, you need to host your web application somewhere in a web server so it can be publicly accessed by your friends.

2. You need a server for signalling.

Remember what we mentioned about signalling? It is used to indicate to all the members in a particular room that you have arrived. When you arrive, each of the members existing in the room will take note of your remote information and details via SDP and store them onto their machine - ready to receive the media data from you.

From the technical aspect, WebRTC handles the creation and handling of a SDP information, but not the sending and receiving part of it. Therefore a server is needed for this transmission. It is quite common to utilize a websocket server to do this transmission and initialization.

3. You need a STUN server.

This server is used to retrieve the remote peer’s public ip addresses. To avoid going into the technical details, everyone of us has a public ip address and a STUN server is used to get this information. This information will then be part of the SDP information you need to send over to your peers when you have just arrived.

4. You may need a TURN server

This server is used as a middle man to pass messages for you if you need to and is unable to reach your remote peer directly. Someone has asked this question in Quora, and you can check it out here.

#3. What is a STUN server used for?

Session Traversal Utilities for Nat (STUN). Both parties necessarily requires at least the knowledge of their peer’s IP address and the assigned UDP port.

Whatever architecture used, we will always need a signalling server for registration and presence. We will need a TURN server for network traversal, and to ensure that internal IP addresses can be mapped to external public ip addresses.

How does STUN works

STUN server

You can feel free to update or edit this here

Now you can see that a STUN server is used for your browser to request for your public IP address. These are all done under the hood, out of sight. One thing i’ve learnt, magic is never really magic. There are many things happening behind the scene. If you want to learn the magic? You’ve got to learn these steps behind the scene.

#4. What is a TURN server

Traversal Using Relay NAT is a protocol for relaying network traffic. Sometimes when you use a STUN server and managed to retrieve the public IP address of your remote peer, due to firewall and some networking related tool (NAT if you are interested to find out), you need a TURN server that is publicly accessible to the net to do a media relay for you. It’s kind of like your middle man helping you to pass your message.

Turn server

#5. How to scale, there are 4 key architectures you can consider!

  • Mesh

Client to client, where there are 3 clients in a conference call, there will be 2 connections to encrypt and establish. Everyone connects with everyone as you can see in the screenshot below:

  • Forwarding

Selective forwarding unit acts as an intelligent media relay in the middle of a session. Every client connects to SFU once to send media, and then once more for each every other client, thus resulting in each client managing n unidirectional connections, where n is the number of connected clients. Although the total number of connection for this architecture is n^2, client only encrypt once upon connection initiatization which alleviates pressure on the device itself, especially for mobile.

Forwarding requires additional server infrastructure such as the SFUs but are highly efficient. An SFU does not attempt to decrypt the packets unless recording is activated or decode the data.

  • Mixing

Mixing depends on a multipoint control unit (MCU) which acts as a high-powered media mixer in the middle of a session. Each client connects to the MCU once, having to deal with just one bidirectional connection regardless of number of clients connected. The connection will be used to send and receive media from the server.

Like forwarding, encryption and uploading are performed only once, but now we are only download and decryption once as well. This approach is most efficient on the clients, but the least efficient from server perspective. The burden of depacketizing decoding, mixing, encoding, and packetizing is performed entirely on the server. This approach has fairly heavy sets of operations that require significant server resources to complete in real-time.

Consider this approach for applications with large numbers of active participants such as a virtual classroom or cases where devices are particularly resource constraint like bandwidth. This architecture comes at a server cost.

  • Hybrid

As indicated in its name, Hybrid architecture is a combination of mesh, forwarding, and/or mixing. So you create a session for a participant based on whatever makes the most sense.

  • For simple 2 party calls, a mesh setup is simple and requires minimal server resources.
  • For small group sessions, broadcasting, and live events, forwarding will be most suitable to meet needs.
  • For larger group sessions or telephony integrations, mixing is often the open practical option to consider.

#6. How do you test your WebRTC using Testrtc

Time and again, the issue with webRTC applications happens when its at scale. It consumes a huge load amount of data. So if you are intending to scale your webRTC application, considering the appropriate architecture will be the first battle won. Second battle is the issue with data traffic.

With a stress test of 500 participants, split into groups of multiparty calls with 5 participants running for only 6.5 minutes. The server actually transmitted 52GB of media traffic in both directions. Less than 10 minutes.

#6b. Resources

There are free STUN servers for you to utilize in your webRTC implementation

Free tier for webRTC

I’ve found a comprehensive youtube video that helps one to understand how does a WebRTC application works. It is interesting to find out that all the webrtc related youtube videos are at least years old. WebRTC has been around for some good 7 years. I guess Joshua here has finally caught up and I hope I managed to get you up to date with this old aged technology today! (: