How swift works

swift replaces the stackmat with a webcam. Here's the actual pipeline, from raw camera frame to a recorded solve time — and what the app does (and doesn't do) with your video.

The pipeline at a glance

  1. Camera capture. The browser opens your webcam via the standard getUserMedia API. The video stream is rendered into a hidden <video> element and never sent over the network.
  2. Hand landmark detection. Each frame is passed to Google's MediaPipe HandLandmarker running in your browser (WASM + WebGL). The model returns up to two hands, each as 21 3D landmark points.
  3. Gesture classification. A small TypeScript classifier reads finger bend angles at the PIP joints and the palm-normal vector from the world-space landmarks to decide if each hand is palms-down, gripping a cube, or neither.
  4. State machine. A 5-state machine (idle → inspecting → ready → solving → stopped) reacts to gesture changes — with a 2-frame debounce on every transition so the timer doesn't twitch on a single bad frame.
  5. Session logging. When a solve ends, the time, scramble, and any +2 / DNF penalty are written to localStorage. Stats (best, ao5, ao12, session mean) are recomputed.

Why no video leaves the device

There's no upload step. The MediaPipe model is downloaded once from a CDN and cached; inference runs locally on every frame. The only data that crosses the network is the page itself, the model files (once), and anonymous product analytics that capture page interactions — never the camera feed. The <video> element is also explicitly excluded from session replay so even the UI surrounding the video isn't captured visually.

There's a practical reason beyond privacy: streaming raw video would be slow, expensive, and add latency to the gesture pipeline. Running inference client-side is the right architecture for this kind of app.

Why a debounce window matters

Hand-landmark inference is noisy. A single frame can drop one of your hands, mis-classify a curled finger, or briefly mis-orient the palm normal — especially as your hands cross or rotate. Without debouncing, the state machine would flicker between inspecting and ready several times a second during the natural setup before a solve.

swift waits for 2 consecutive frames of the same target gesture before transitioning. At 60 fps that's about 33 ms — invisible to you, but long enough to absorb a single misclassification. The same debounce applies on the way out (palms-down to stop the solve), so a single noisy frame doesn't end a solve early.

The world-landmark trick for palm orientation

Image-space landmarks shift around when your hand moves across the frame — so the same physical pose can produce different finger-bend readings depending on where your hand sits. MediaPipe also returns world landmarks: a coordinate system anchored to the hand itself, independent of camera position. swift uses the world-space palm normal to detect "palms-down" (|ny| above a threshold) and falls back to image-space bend angles when world landmarks are unavailable.

What's deliberately not in swift

If you need any of those, csTimer is a much deeper app and is what we'd point you to.