Why is networked multiplayer so difficult to implement?
It's often said that "networked games are hard to make", but why is that? In an age of game engines with built-in networking capabilities, why is this still considered a problem? Why is networking different from other features that have been massively simplified by modern engines, such as rendering, physics, and audio? Let's find out.
Player Expectations
Imagine you're playing a split screen cart racer game at home on the couch with a friend. The race starts, you push the stick forward, and your cart immediately starts pulling away. The game is responsive to your input. But you take the first corner too slowly, and your friend speeds past you. You both see the overtaking manouevre, your race position drops to 2nd, and theirs switches to 1st. The game is consistent for all players. You and your friend control your vehicles with split-second reflexes, until eventually one of you crosses the finish line just a moment before the other. The game plays in real-time.
These are the three key expectations that we have of most (not all!) video games:
- They will be responsive to our inputs, acting quickly to take the action that we have asked the game to take.
- They will be consistent, meaning that every player experiences the same game state, even only seeing different parts of it and from different angles, and maybe even at slightly different times.
- The game runs in real-time, so that we have a constant input-process-output loop that allows us to respond to the challenges set by the game and our opponents.
Information moves slowly
This works well when everyone is sat in front of the same machine. Within the tiny duration of a single rendered 'frame' of the game - maybe 17 milliseconds for a 60Hz game - the software is able to query all the game pads, apply the input in each case to the specific player carts, perform any AI routines for non-player drivers, run the physics simulation to move all the carts, calculate any game-specific logic (such as on-track collectables, boosts, etc), create all the visuals and sounds, and send them to the appropriate output devices.
The fact that this can be done literally faster than the blink of an eye is a technological marvel, and it's made possible by all the hard work being done within a single console or computer where everything is connected and can send billions of bits of information around the computer every second.
But what if we put those players in front of separate computers, miles apart from each other, connected only by The Internet? You'd be right in thinking that they're still technically 'connected', but the problem is, these connections are slow - glacially so, compared to the insides of a computer. Internet bandwidth continues to go up and up over time, but that is mostly about making the virtual pipes 'wider' so that more data can travel at once. The actual speed at which any one given message travels has not increased much, and cannot get much quicker. Network traffic - and indeed, all information - is limited by the speed of light, and that is not changing any time soon.
We see people talking about 20ms or 30ms being a good 'ping' for online gaming, and that means it takes 20ms to get a message - even an absolutely tiny one - to the recipient and back. But if it takes 20ms to get a reply, that's too long for the 17ms frame times we mentioned above, so we can't process the input in the same frame it was generated. Even if we accepted a 33ms frame duration (for a 30Hz game) it still means that the response to your input would arrive about two-thirds of the way through the frame time, leaving little time left to do the actual calculations and rendering.
This led to one of the earliest lessons learned by online game developers. The simplest way to implement networked multiplayer would have been to run the game on a server, and have each player's computer run a much simpler program (sometimes called a 'thin client') that only handles input and rendering, along with the network messages required to send the input to the server and to get the rendering data back from that server. This is the model used by online text games such as MUDs.
However, those text games do not run at 30 or 60 frames per second, and are not concerned with complex physics and rendering either. For modern action game developers, even with messages travelling at near to light speed, it still took too long to get them across the internet to allow the usual input-process-output loop within a single frame duration. This meant that simply could not work for a typical action-oriented game. A different approach was needed.
Shared simulations
The answer was to have the 'client' - that is, the program running on the player's computer, rather than on some server somewhere - to run the actual game, rather than just handling the input and output. This allows it to do all the usual calculations that a game normally has to make, without waiting 20ms or longer to get the necessary information.
The interesting thing is that while this was new for developers of racing games and shooting games, the strategy gaming world had already been doing this. A classic article named "1500 Archers on a 28.8" refers to how Age of Empires was able to have massive battles playing across the internet back in an age when a computer could not even send 28 kilobytes of data per second. The trick they used was to have each player run the whole game separately, and exchange network messages to ensure they all play out the same way, i.e. 'in sync'. In their words:
the expectation was to run the exact same simulation on each machine, passing each an identical set of commands that were issued by the users at the same time. The PCs would basically synchronize their game watches in best war-movie tradition, allow players to issue commands, and then execute in exactly the same way at the same time and have identical games.
Problem solved! Or was it?
Keeping the simulations perfectly synchronised required some very careful programming, including working to a schedule of "communication turns", each 200ms long. The idea here is that each of those turns could contain orders from each player, about which units they wanted to move and to where, and they could be sent out to all players and processed at the same time. However, for them to all be processed at the same time meant that all players had to wait a little while for everyone else's orders for that turn to arrive. The end result was several hundred milliseconds of latency between a player clicking to give an order, and seeing that order start to play out on their screen. In terms of our earlier expectations, this was real-time (in that the game world keeps marching on as we watch it) and it was consistent (in that all players were seeing that same world) but it was not especially responsive, at least not in a way that you would need from a shooter or racing game.
The Core of the Problem
This starts to show us that the concept of information taking time to travel has unavoidable implications for our games. We essentially have these 3 ideas - responsiveness, consistency, and the need for each player to run the game in real-time - and it turns out that it's basically impossible to achieve all three simultaneously without any compromises.
- A real-time game that is instantly responsive to local player input is making changes for each player that will take some time to reach the other players, meaning the simulations can't be kept consistent - events could happen in different orders or an event that made sense on one player's machine might not make sense on another player's by the time that information arrives there.
- A real-time game that guarantees consistency must wait for consensus about the result of any action before it advances, whether that consensus is in the form of synchronised input (as with the RTS example) or in the form of authoritative game state broadcast out from a server. This means the simulations cannot be instantly responsive to local inputs.
- A game that is responsive and yet also needs to be consistent must allow the input made by one player to reach and be applied by other players before the game can proceed, to ensure that actions taken by the other players don't somehow contradict the action we already permitted to happen. This means it cannot run in real-time.
Each of these permutations has its place. We saw above how real-time strategy games (at least in the early days of the internet) would choose to be real-time and consistent at the expense of responsiveness. Compare that to a turn-based strategy game, anything from Magic: The Gathering to online chess. By giving up the real-time requirement, and only moving the game on when a player submits their turn, they can be responsive and instantly show the results of a player's move locally.
There are also online games that are real-time and responsive but less concerned about consistency. Many RPGs - especially the 'massively multiplayer' variety - are like this. It rarely matters if players are slightly out of position or playing slightly different animations as it has little effect on the actual gameplay, which is usually resolved by game rules on the server that do not have to be responsive. Many MMORPG players have seen enemies hit by a spell when they appeared to be safely outside the area of effect, or had their character die even though they just drank a healing potion, because the server had resolved the death before it received the input about the potion. While frustrating at times, it's a reasonable technical tradeoff that the genre can make.
Another category that makes concessions on consistency is that of physics-based games like Fall Guys, which inevitably end up processing client-side physics slightly differently for all players. This results in discrepancies between player positions that are usually minor, but occasionally visible and game-changing. It also includes some racing games which extrapolate vehicle positions to give the impression of zero network latency, but which means cars can 'teleport' when that extrapolated position turns out to be wrong due to player input or collisions. Sometimes it even means revising the rankings at the end of a tightly-contested race.
In practice, it's rare that any game totally commits to two of the three values while fully sacrificing the third. It's more realistic to compromise on all three to a lesser or greater extent, or even to make different trade-offs for different features.
Predicting the Future
So, we know that we have to compromise on at least one of our three core expectations, but we'll obviously want to mitigate that as far as possible. Just what sort of mitigation would a game need, in each case?
With real time strategy games, they tend to 'fake' responsiveness by playing visual and audio cues to let you know immediately that your order has been accepted by the system, even if that order doesn't really start being executed until quite some time later.
Similarly, with the games that compromise on consistency, they still attempt to minimise the discrepancies, often by having a central server which acts as a tie-breaker, instructing clients to blend or teleport any out-of-position players back towards the spot that the server thinks they should be at.
What about games like our cart racer, or most shooters, or indeed fighting games? We find ourselves in a position where the genre dictates the requirement for the game to be real-time, and the competitive nature means consistency is important. But these are fast-paced action games where responsiveness is essential too. How do we solve that contradiction?
The usual answer is that such games live in a twilight zone where they essentially fake responsiveness and consistency, in such a way that it seems like the player is experiencing both, while arguably experiencing neither. This is done in the following way:
- Each player's inputs are immediately processed locally to give the appearance of responsiveness, while also being sent to a server or to the other clients;
- When they arrive on the server or other clients, the inputs are processed, and the results are sent back to the original player;
- Upon receiving the results, the original player's client checks to see if they are consistent with what happened locally (usually a few frames earlier), and if not, has to make corrections to bring the client's simulation in line with what the 'real' state should be.
The first step, where inputs are applied locally without waiting to see the exact effect, is often called "client prediction". It might be better to think of it as client assumption, because the input is applied locally on the assumption that things will play out exactly as the local simulation shows. Often, that is exactly what happens. But, when a response comes from the network and disproves the assumption, a correction needs to be made before the game proceeds, or the inconsistency between the simulations will grow to the point where the game is unplayable.
So, how are those corrections made?
Reconciliation
Some sources refer to this as "server reconciliation", the idea being that a client has to reconcile its idea with what it assumed would happen with the actual events that the server has told the client about. However, the broader concept doesn't actually require a server, as it can apply in a peer-to-peer game as well.
This is arguably the most complex part of the whole process, for two reasons. Firstly, it potentially has significant effects on how the whole game is implemented because it can become necessary to store past world states, to store past inputs, to be able to do arbitrary resimulation, maybe to be able to blend entity states (or rendered states), and so on. Because the correction is usually responding to something that happened a few frames ago (since it takes time to send the inputs across the internet and get a response), it's not as simple as just overwriting some data with the 'correct' values - the local effects of the client prediction, which have been building upon the incorrect values, also have to be taken into account.
Secondly, much of this is constrained by the rules of the game and a player's specific expectations for the genre, because there is not necessarily a single correct way of resolving an inconsistency.
For example, a modern fighting game is likely to "rollback" and resimulate the whole game every frame, starting from the most recent verified state and re-applying the player's input since then. The players expect extremely precise hit calculations and will not tolerate being hit when their screen clearly shows them out of range of their opponent. As a result, most fighting games historically operated what was known as 'delay-based netcode', not too dissimilar to the real-time strategy example above, giving full consistency at the price of reduced responsiveness. In recent years this has given way to 'rollback netcode' which gives that responsiveness back, at the price of a little visual consistency when corrections have to be made, plus the extra CPU power needed to be able to perform several simulation steps within a single rendering frame.
First-person shooter games have been doing the rollback-and-resimulate approach for much longer, since at least the late 1990s, because that was the only way to get the shared simulations working in a responsive way, once the 'thin client' approach was abandoned. Unlike fighting games, they don't necessarily need to do the resimulation every frame, because direct contact between entities is very rare and the exact positioning only really matters when resolving shots fired or projectile trajectories. Similarly, even in cases of resimulation, it's not always necessary to resimulate every single entity as most will be unaffected. (However, determining whether this is true or not is a hard problem, and sometimes it's better just to err on the side of a complete resimulation to be on the safe side.)
In most cases, this process of performing a rollback and a resimulation means consistency - every client ends up back in the same state as all the others, excluding the local prediction of course. But that's not the end of the story.
Arbitrary decision-making
The first person perspective used for these games, in conjunction with the 'prediction' aspect, means that there is often a very visible lack of consistency, with the classic example being where Player 1 has quickly ducked behind cover, but the information about that movement hasn't reached the Player 2 yet, who has just pulled the trigger to shoot Player 1 who was right in the crosshairs. Should the game award the kill, in a way that Player 2 would think fair? Or should the game say the shot missed, in a way that Player 1 would think fair? Games choose to handle this differently - the Overwatch developers chose to "favor the shooter" and explained this in their video. This is controversial to some, but is not a new concept - Valve were doing it in their Source engine over two decades ago.
Other types of game can be more lenient and don't need to attempt the rollback-and-resimulate approach. It was mentioned above that RPGs might be real-time and responsive but less concerned about consistency, and one effect of this is that their approach to applying corrections might just be to teleport a character back into position, or to blend its position over time. Positional blending is also popular with some racing games (though some with simpler physics models might also opt for the rollback approach if practical), and can be almost invisible to the player due to everything already being in rapid motion. But since such corrections take time to blend the entity towards the expected position, it trades off even more consistency during the blending period than if a resimulation had happened.
A hybrid approach is also not uncommon - some games might perform the rollback-and-resimulate to calculate the 'official' corrected position of an object, but apply a slow blend towards that position when rendering the object. This looks better than an abrupt correction, at the price of potentially visible glitches if the rendered position is clearly inconsistent with the gameplay (e.g. characters temporarily intersecting visually, despite their collision volumes being correctly separate).
Finally, any game which runs a physics model on each client has to make a decision about how the input received from the network should affect the outcome from the local physics simulation. It's not uncommon for the network to send information that positions entities in a way that causes overlaps, causing a powerful (and unwanted) repulsion force, or for it to end up causing instability as a physics constraint that is usually stably solved in one iteration of the physics engine instead gets unstably iterated four times a second across the internet.
What about the rest of the netcode?
The interesting thing about everything covered above is how very little of it is about the low level technical implementation, or the things that programmers tend to talk about. Writing the code to have the player clients communicating with each other across the network is deep and complex, potentially involving any or all of the following:
- socket programming
- event loops
- reliable vs unreliable messaging
- ordered vs unordered messaging
- congestion avoidance and management
- latency measurement
- clock synchronisation
- packet fragmentation and reassembly
- message serialization
- state snapshots or actor replication
- remote procedure calls (RPCs)
- ...and more!
The good news is that modern engines and frameworks have done a lot to make the above easier, which is why they're not mentioned above.
The bad news is that a developer will usually still need to understand a fair bit about some of these areas, unless they are lucky enough to be working with a genre that has a networking framework designed specifically for the type of game they are making. And chances are, they will still need to consider some aspects of the real-time/responsiveness/consistency trade-offs if they come to implement new features or attempt to improve the player experience.
So while it's true to say that making a networked multiplayer game is easier than ever, it is still far from easy. To those attempting it, good luck!