How webcam-based cube timing works

Published June 18, 2026

For decades, the canonical way to time a Rubik’s cube solve has been a stackmat — a pressure-sensitive pad you slap your hands on to start and stop. It’s accurate, official, and costs about fifty bucks. If you have one, great. If you don’t, swift is what happens when you ask: can a webcam do this?

It turns out it can, and the moving parts are simpler than you’d think.

What the model actually sees

The brain of the operation is Google’s MediaPipe HandLandmarker, a small machine-learning model that ships as ~10 MB of WASM and weights. You feed it a frame; it tells you where up to two hands are and, for each hand, the 3D position of 21 specific points — the tips of each finger, the knuckles, the wrist. That’s it. No image segmentation, no skeleton inference, no “what is this person doing” — just landmarks.

The model is good. Not perfect, but the kind of good where you stop thinking about edge cases unless you’re hunting for them. It runs at 60 frames per second on a modest laptop and the WebGL backend means the GPU does the work.

What it doesn’t do is interpret. “Palms down, getting ready to inspect” is not a concept it has. That’s the second half of the pipeline.

Turning points into gestures

From the 21 landmarks per hand, you can derive everything you need with a few lines of vector math. The key signals:

  • Finger bend angles at the PIP joint (the middle finger knuckle). Straight finger → 180°. Curled around a cube → 90° or less.
  • Palm normal — the vector pointing out of the back of the hand, computed from the wrist, index-base, and pinky-base landmarks. When your palm is flat on the table, the normal points up; when you grip the cube, it points roughly forward.

Two gestures cover the whole timer:

  • Palms-down: all five fingers extended (mean bend > 170°) AND the palm normal pointing vertically (|ny| > 0.55 in world coordinates).
  • Grip: at least one finger curled enough to suggest holding the cube (mean bend < 166°).

Everything else — what state to enter, whether to start the inspection countdown, whether to stop the solve — is just bookkeeping on top of those two signals.

Why world coordinates matter

Early versions of swift used image-space landmarks for everything. The problem: the same physical hand pose reads differently depending on where in the camera frame your hand sits. Move your hand left and the bend angles shift slightly. Move it up and you can swing 10-15° on some fingers — enough to flip palms-down off and on at random.

MediaPipe also gives you world landmarks, anchored to the hand itself regardless of where it sits in the frame. swift uses world coordinates for orientation (the palm normal) and image coordinates for relative distances (finger curl). That split fixed almost all the flicker.

The debounce window

Even a good model misfires. A frame where your hands cross might briefly report a curled finger that’s actually fully extended. Without filtering, the state machine would chatter — inspecting, ready, inspecting, ready — multiple times a second.

swift waits for two consecutive frames of the same target gesture before transitioning between states. At 60 fps that’s about 33 ms. You don’t notice the delay; the misclassifications get absorbed.

Two frames is a pragmatic compromise. Three or four would smooth out even more noise but make the timer feel sluggish; one would be back to chatter. Two felt right after about a dozen rounds of dogfooding.

Where it falls down

A few things webcam timing genuinely can’t match a stackmat on:

  • Reaction-time consistency. A stackmat senses your hand within milliseconds. The webcam pipeline adds the inference latency (~10 ms on a decent GPU) plus the debounce window (~33 ms). For competitive averages this matters; for practice it doesn’t.
  • Lighting. Bad lighting hurts the model. Bright direct sun, deep shadow, or extreme backlighting can drop landmark confidence. Indoor lighting at a desk is fine.
  • Hands out of frame. If your inspection setup has your hands resting outside the camera’s view, the timer can’t see you. You have to keep both hands roughly in frame.

For most practice contexts — sitting at a laptop, decent room lighting, just wanting to time some solves — none of these matter.

What you actually need

A laptop, a webcam, a modern browser, and a cube. That’s the entire hardware list. The model downloads once and caches. Everything else is JavaScript.

If you want to try it: swift is free and there’s nothing to install. If you want the bigger picture, the about page walks through the architecture in more detail.

← back to blog