Extreme Bevy 3.5: Detecting Desyncs

In this part, we’ll look at how to be completely certain all of our players are seeing the exact same game state.

This post is part of a series on making a p2p web game with rust and Bevy.

At the very end of the previous chapter, we introduced a very subtle bug, that will lead to diverging game states for each player. Before I explain why, and how to fix it. We'll look at some tools from bevy_ggrs that makes such bugs much easier to find: sync test sessions and desync-detection through checksums.

Optional chapter

If you just want to make a proper game ASAP, you can skip this chapter, but you'll have a much better time making a game on your own if you use the techniques here to make sure your game isn't de-syncing.

If you skip it, make sure to at least fix the bug as explained in the "Fixing the bug" section

Synctest sessions

bevy_ggrs comes with an awesome mode called a "synctest session".

It's a mode where the game runs on a single machine instead of p2p.

Every time the game advances a frame, it first rolls back a couple of frames, and then it re-simulates them and checks if the game state is the same as the previous time it was simulated.

To use it, we need to only add local players, and we need to use session_builder.start_synctest_session() instead of start_p2p_session().

Since it's running locally there is also no point in starting a Matchbox socket.

Arguments

So we kind of need to be able to run our game in two quite different ways, for this, we will use clap, which is an amazing crate for reading command line arguments. Unfortunately, it doesn't work on web out of the box, but we'll get back to that, but for now, since all our other dependencies support native, we'll simply run the game in native mode.

Going native (temporarily)

To run the game in native mode you could either specify the target:

# linux
cargo run --target x86_64-unknown-linux-gnu
# windows
cargo run --target x86_64-pc-windows-msvc

Or you can simply remove

[build]
target = "wasm32-unknown-unknown"

from .cargo/config.toml that we added back in part 1.

And then just do cargo run. The game should now open in a normal window again, and everything should work as before. You can even play with other players on the web.

Back to clap!

Add clap to your dependencies in Cargo.toml, and include the derive feature:

clap = { version = "4.4", features = ["derive"] }

And create a new module, args.rs with the following in it.

use clap::Parser;

#[derive(Parser, Debug)]
pub struct Args {
    /// runs the game in synctest mode
    #[clap(long)]
    pub synctest: bool,
}

It's a very simple struct, we derive clap::Parser for it, which makes it possible to parse it from command line args.

Now, in main.rs, we first add the module:

mod args; // NEW
mod components;
mod input;

Then we import our new struct, and clap::Parser:

use args::Args; // NEW
use bevy::{prelude::*, render::camera::ScalingMode};
use bevy_asset_loader::prelude::*;
use bevy_ggrs::*;
use bevy_matchbox::prelude::*;
use clap::Parser; // NEW
use components::*;
use input::*;

And then we can use it at the top of main to parse the command arguments:

fn main() {
    let args = Args::parse();
    eprintln!("{args:?}");
    // ...
}

If you run the game now, you should see:

$ cargo run
   Compiling extreme_bevy v0.1.0 (C:\Users\Johan\dev\extreme_bevy)
    Finished dev [unoptimized + debuginfo] target(s) in 7.42s
     Running `target\x86_64-pc-windows-msvc\debug\extreme_bevy.exe`
Args { synctest: false }

In order to parse command line arguments to our app through cargo run, we need to add -- so cargo knows it's not something meant for cargo. We can now run:

$ cargo run -- --help
    Finished dev [unoptimized + debuginfo] target(s) in 0.42s
     Running `target\x86_64-pc-windows-msvc\debug\extreme_bevy.exe --help`
Usage: extreme_bevy.exe [OPTIONS]

Options:
      --synctest  runs the game in synctest mode
  -h, --help      Print help

And it shows our shiny new command line option, let's try it:

cargo run -- --synctest
    Finished dev [unoptimized + debuginfo] target(s) in 0.43s
     Running `target\x86_64-pc-windows-msvc\debug\extreme_bevy.exe --synctest`
Args { synctest: true }

It works!

Back to synctest sessions

Ok, so now we have a bool we can easily use to tell our program to do something differently, depending on how we launch it.

We will start with the easy part, which is just making sure we don't start a socket or a p2p session when we want to start a synctest session.

While we could just add if statements around where we add the wait_for_players system and start_matchbox_socket systems to the schedule. The better way is to use Bevy's run conditions.

Run conditions are special systems that return a bool that can be used to determine whether other systems should run. Let's define some simple conditions that determines whether we are running in synctest or p2p mode:

fn synctest_mode(args: Res<Args>) -> bool {
    args.synctest
}

fn p2p_mode(args: Res<Args>) -> bool {
    !args.synctest
}

For this to work, we also need to derive Resource for Args

#[derive(Parser, Resource, Debug, Clone)] // changed
pub struct Args {

...and add it to our app's resources:

    App::new()
        .insert_resource(args) // NEW
        .init_state::<GameState>()

Now we can use our run condition to make sure a socket is only created when running in p2p mode:

        .add_systems(
            OnEnter(GameState::Matchmaking),
            (setup, start_matchbox_socket.run_if(is_p2p)), // changed
        )

And that we only wait for players (and start p2p sessions) in p2p mode:

                wait_for_players.run_if(in_state(GameState::Matchmaking).and_then(p2p_mode)),

wait_for_players already had a run condition (in_state), so we use and_then to chain the run conditions, so it will only run if both are true.

If you run the game now, it should start and show the grid, but be stuck in the Matchmaking state.

Now we just need to write the system that starts a synctest session:

fn start_synctest_session(mut commands: Commands, mut next_state: ResMut<NextState<GameState>>) {
    info!("Starting synctest session");
    let num_players = 2;

    let mut session_builder = ggrs::SessionBuilder::<Config>::new().with_num_players(num_players);

    for i in 0..num_players {
        session_builder = session_builder
            .add_player(ggrs::PlayerType::Local, i)
            .expect("failed to add player");
    }

    let ggrs_session = session_builder
        .start_synctest_session()
        .expect("failed to start session");

    commands.insert_resource(bevy_ggrs::Session::SyncTest(ggrs_session));
    next_state.set(GameState::InGame);
}

As you can see, it's quite similar to the last part of wait_for_players where we start the p2p session. The biggest difference is that all the players are not PlayerType::Local, and we use start_synctest_session() instead of start_p2p_session(socket).

Now we just need to add the system to our schedule, behind the synctest_mode run condition:

        .add_systems(
            Update,
            (
                (
                    wait_for_players.run_if(p2p_mode),
                    start_synctest_session.run_if(synctest_mode),
                )
                    .run_if(in_state(GameState::Matchmaking)),
                camera_follow.run_if(in_state(GameState::InGame)),
            ),
        )

If you now start the game with cargo run -- --synctest, you'll see that the game starts, and you control both players at the same time.

If you run it without --synctest the game should still run as before.

So where's the desync?

If you shoot one player, the other will die, and the other will continue to run around.

Now, I told you at the start that of this chapter that we'd introduced a desync when players die, and now we run a desync session and kill players, so why don't we get an error?

The reason is that we need to tell bevy_ggrs how it determines whether game state is similar or not. By default, it only does some very basic checks on which entities exist, but not the state of the components.

To include the state of components, we need to specify to bevy_ggrs which components should be used to calculate a frame checksum, and how to do so.

Our most important state, are the positions of players and bullets, which is stored in the Transform component, so we'll tell bevy_ggrs how to create checksums for Transforms:

        .rollback_component_with_clone::<Transform>()
        .rollback_component_with_copy::<BulletReady>()
        .rollback_component_with_copy::<MoveDir>()
        .checksum_component::<Transform>(checksum_transform) // new

The .checksum_component method takes a function as its argument that takes a reference to a single component, and returns an u64 checksum for it. Let's implement it in components.rs.

pub fn checksum_transform(transform: &Transform) -> u64 {
    // todo: produce some u64 based on the value of transform
}

Now, the easiest way to produce such number is to use a hash function.

Hashing in rust normally looks like this:

    let value = "data to hash";
    let mut hasher = DefaultHasher::new();
    value.hash(&mut hasher);
    let hash: u64 = hasher.finish();

However there are a few problems with it... If we try to do this approach for bevy Transforms:

pub fn checksum_transform(transform: &Transform) -> u64 {
    let mut hasher = DefaultHasher::new();
    transform.hash(&mut hasher);
    hasher.finish()
}

Most obviously, we get a compile error:

error[E0599]: the method `hash` exists for reference `&Transform`, but its trait bounds were not satisfied
  --> src\components.rs:23:15
   |
23 |     transform.hash(&mut hasher);
   |               ^^^^ method cannot be called on `&Transform` due to unsatisfied trait bounds
   |
  ::: C:\Users\Johan\.cargo\registry\src\index.crates.io-6f17d22bba15001f\bevy_transform-0.12.0\src\components\transform.rs:41:1
   |
41 | pub struct Transform {
   | -------------------- doesn't satisfy `bevy::prelude::Transform: Hash`
   |
   = note: the following trait bounds were not satisfied:
           `bevy::prelude::Transform: Hash`
           which is required by `&bevy::prelude::Transform: Hash`

For more information about this error, try `rustc --explain E0599`.

The problem is that Transform doesn't implement the Hash trait, which is needed for the hasher to know how to hash the value.

The reason for this is that Transform is built up of various f32 values, which don't implement Hash either. This is by design, for complicated reasons, mostly to do with NaN values and whether -0.0 == 0.0 or not but for our use case it's fine. We can still implement hashing ourselves by running hash on the underlying bits of the f32s:

    let mut hasher = DefaultHasher::new();

    assert!(
        transform.is_finite(),
        "Hashing is not stable for NaN f32 values."
    );

    transform.translation.x.to_bits().hash(&mut hasher);
    transform.translation.y.to_bits().hash(&mut hasher);
    transform.translation.z.to_bits().hash(&mut hasher);

    transform.rotation.x.to_bits().hash(&mut hasher);
    transform.rotation.y.to_bits().hash(&mut hasher);
    transform.rotation.z.to_bits().hash(&mut hasher);
    transform.rotation.w.to_bits().hash(&mut hasher);

    // skip transform.scale as it's not used for gameplay

    hasher.finish()

For good measure, we put in an assert! that the translation and rotation .is_finite(), which means they don't contain NaN values, as we don't want that to happen in our game anyway.

While it now compiles, there is one more problem with our code, it's using DefaultHasher. If we look at its documentation, we see:

/// The internal algorithm is not specified, and so it and its hashes should
/// not be relied upon over releases.

This is not ideal. Instead, we'll use bevy_ggrs's checksum_hasher function to create our hasher:

use bevy_ggrs::checksum_hasher;

pub fn checksum_transform(transform: &Transform) -> u64 {
    let mut hasher = checksum_hasher();

Phew! That's quite a lot just to ensure our transforms are similar.

Now finally, if you run the game in synctest session, you should get a warning just as one player kills the other:

2023-11-16T10:14:29.575124Z  WARN bevy_ggrs::schedule_systems: Detected checksum mismatch during rollback on frame 269.

This tells us that the Transform components were not the same the first and second (or third) time frame 269 was simulated.

So we have a desync bug, and we should look into it!

Fixing the bug

So if we look into kill_players:

fn kill_players(
    mut commands: Commands,
    players: Query<(Entity, &Transform), (With<Player>, Without<Bullet>)>,
    bullets: Query<&Transform, With<Bullet>>,
) {
    for (player, player_transform) in &players {
        for bullet_transform in &bullets {
            let distance = Vec2::distance(
                player_transform.translation.xy(),
                bullet_transform.translation.xy(),
            );
            if distance < PLAYER_RADIUS + BULLET_RADIUS {
                commands.entity(player).despawn_recursive();
            }
        }
    }
}

We see that what happens when one player kills another, is that we despawn the player using despawn_recursive. It's reasonable to assume it's related to this call. And in face, if we do an experiment and remove the despawn_recursive call temporarily, we see that we no longer desync when we fire at the other player... but we also don't really kill them. We effectively just removed killing. It might seem stupid, but it can be good to do such sanity checking just to make sure we're looking in the right place.

Okay, so why does it desync when despawning?

If we look at the queries for our player gameplay systems, we see that they operate on components with components like Transform, MoveDir, BulletReady and... Player!

While we registered the other components for rollback, we didn't register Player. It might seem unnecessary to register a component such as Player which doesn't change during the duration of the game. However, when bevy_ggrs restores entities we despawned, it only restores the components we've registered. So when it restores the player we've killed, Player isn't added to it, and the next time it's looking for players to kill (With<Player>), it doesn't find the player, and the player is not killed, and survives, creating a desync.

We could easily solve this desync by adding Player for rollback as well:

        .rollback_component_with_copy::<Player>()

And we also need to make sure Player derives Clone and Copy:

#[derive(Component, Clone, Copy)]
pub struct Player {
    pub handle: usize,
}

And if you run it now, our desync is gone.

Problem solved?

However, while we fixed the desync, we're still left with another subtle bug.

Remember that only registered components are restored when rollback entities are restored? This also goes for all the other components on the player, which are not necessarily important for gameplay, but are important for the players playing the game.

If we look at spawn_players we see that we add a SpriteBundle, which contains a bunch of components necessary to render the entity on the screen (Sprite, Visibility etc.). This means that if a rollback happens right after a kill, which turns out not to be a kill after all, due to a last minute evasion by the other player, the other player would lose their sprite components and be invisible!

That's quite an unfair advantage, and not something we want in our game.

So how do we solve it?

The easy way out is to simply register all the other components from SpriteBundle as well:

        .rollback_component_with_clone::<Sprite>()
        .rollback_component_with_clone::<GlobalTransform>()
        .rollback_component_with_clone::<Handle<Image>>()
        .rollback_component_with_clone::<Visibility>()
        .rollback_component_with_clone::<InheritedVisibility>()
        .rollback_component_with_clone::<ViewVisibility>()

There are other, perhaps better, solutions to this problem, like simply avoiding despawning, but we'll settle with this for now to get on with the desync detection chapter :)

Are we safe?

So we've fixed one nasty, hard-to-find bug, using a synctest session. It would have been really hard to spot by running the game in p2p mode one one machine, since it would only happen if there was a rollback exactly when a player was killed.

Furthermore, rollbacks only happen if the latency is greater than the input_delay, which we've currently set to 2 frames. So for sessions on the same machine, this is rarely the case.

We could intentionally provoke rollbacks on p2p sessions by temporarily lowering the input_delay to 0, in fact, let's expose the input_delay as an argument:

// args.rs
#[derive(Parser, Resource, Debug, Clone)]
pub struct Args {
    /// runs the game in synctest mode
    #[clap(long)]
    pub synctest: bool,
    /// sets a custom input delay
    #[clap(long, default_value = "2")] // new
    pub input_delay: usize, // new
}

// main.rs
fn wait_for_players(
    mut commands: Commands,
    mut socket: ResMut<MatchboxSocket<SingleChannel>>,
    mut next_state: ResMut<NextState<GameState>>,
    args: Res<Args>, // new
) {
    // ...

    let mut session_builder = ggrs::SessionBuilder::<Config>::new()
        .with_num_players(num_players)
        .with_input_delay(args.input_delay); // changed

Now we can run the game in p2p mode with zero input delay using:

cargo run -- --input-delay 0

Now every time we change input (press or release a button) should cause a rollback on the other client.

However, even with this in place, and if you're lucky enough to trigger a rollback on the exact right (wrong) frame, it can be hard to tell that a desync really happened.

To help with that, it's also possible to make p2p sessions exchange checksums for their confirmed frames (frames where we have complete information about inputs for all players).

We can enable it simply by calling a method on the builder:

    // create a GGRS P2P session
    let mut session_builder = ggrs::SessionBuilder::<Config>::new()
        .with_num_players(num_players)
        .with_desync_detection_mode(DesyncDetection::On { interval: 1 }) // new
        .with_input_delay(args.input_delay);

This tells ggrs to send checksums on every frame. Checksums are quite small (128bit), and it's nice to be on the safe side.

This sends the checksums across peers, but to actually get notified that they happened, we need to listen for GGRS events. Let's add a tiny system for that:

        .add_systems(
            Update,
            (
                (
                    wait_for_players.run_if(p2p_mode),
                    start_synctest_session.run_if(synctest_mode),
                )
                    .run_if(in_state(GameState::Matchmaking)),
                (camera_follow, handle_ggrs_events).run_if(in_state(GameState::InGame)), // changed
            ),
        )

// ...

fn handle_ggrs_events(mut session: ResMut<Session<Config>>) {
    match session.as_mut() {
        Session::P2P(s) => {
            for event in s.events() {
                match event {
                    GgrsEvent::Disconnected { .. } | GgrsEvent::NetworkInterrupted { .. } => {
                        warn!("GGRS event: {event:?}")
                    }
                    GgrsEvent::DesyncDetected {
                        local_checksum,
                        remote_checksum,
                        frame,
                        ..
                    } => {
                        error!("Desync on frame {frame}. Local checksum: {local_checksum:X}, remote checksum: {remote_checksum:X}");
                    }
                    _ => info!("GGRS event: {event:?}"),
                }
            }
        }
        _ => {}
    }
}

For good measure, we put in some warnings for other kinds of ggrs events.

If you run the game, everything should be fine (we did fix the bug after all).

We can temporarily re-introduce it by commenting out the Player rollback:

        // .rollback_component_with_copy::<Player>()

If you run the game a few times with --input-delay 0, you should eventually be able to trigger a rollback across the despawn, and you'll get an error:

2023-11-16T11:58:28.885165Z ERROR extreme_bevy: Desync on frame 176. Local checksum: D908637C0CE1597F, remote checksum: DE670F6C96BB4E38

So we've detected the bug through an alternative way as well.

Done?

While synctest sessions seemed superior in this case, it can be really nice to use both modes regularly. Not all desync bugs can be detected through synctest sessions. Also, it's really nice to be able to detect these issues during gameplay, so you could apologize to the user and/or resync the game through other means. That's a topic for later, though.

So our bug was that rollback restored the player entity without the Player marker component, and that caused players that rolled back to desync with players that did...

One lesson we learned is that we should always register marker components for things that can despawned. So... what about the bullets?

Bullets also have marker components, which we didn't register back when we started despawning them just the same way as we started despawning players.

This means rolling back will restore bullets without the Bullet component. That means fi we're unlucky with when a rollback occurs, bullets may sometimes be "stuck" and remain after a new round starts.

Unfortunately, synctest sessions don't currently catch this specific desync, this may be fixed in the future, though. Desync-detection in p2p sessions would catch it, but the desync would only happen if a rollback happened just as a new round starts, so it's hard to catch.

This bug doesn't actually lead to a gameplay desync, but we don't want old bullets littering the screen.

In any case, the fix is the same, register the Bullet component for rollback:

// components.rs
#[derive(Component, Clone, Copy)]
pub struct Bullet;

//main
        .rollback_component_with_copy::<Bullet>()

Conclusion

We've seen some of the ugly sides of working with (non-)deterministic rollback.

Hopefully, you've learned a lot and now have the tools you need to more easily solve problems whenever something breaks (because it definitely will).

We can now be happy that the bug is gone, feel a little bit more confident our game is working, and we'll continue with more exciting stuff.

In the next part we'll add scoring, respawning, and some basic UI.

Reference implementation

Diff for this part

Also, a big thanks to ConnorBP for spotting the bug in part 3. And leaving a comment about it!

Comments

Loading comments...