Gesture-based Web AR
Touchless input system for browsers. MSc thesis with usability study and SUS scoring.
- Role
- Researcher and designer
- Type
- Research · MSc Thesis
- Period
- 2023
- Status
- Completed
- Stack
- MediaPipe · OpenCV · WebAR
- Links
- Research paper ↗

Context
Web AR is moving from novelty to plausible interface. The browser stack (WebXR, MediaPipe in JavaScript, GPU-accelerated camera capture) makes it possible to build mixed-reality interfaces without an app install or specialist hardware. But the input layer is unsolved. Mouse and keyboard assume a desk. Touchscreens assume a phone. Voice has its limits. Gesture is the obvious primitive for hands-free immersive interfaces, and it is also the easiest one to do badly.
The thesis question was narrow. Can a desktop Web AR experience use hand gestures for primary input (typing, selection, navigation) and feel learnable to first-time users, or does gesture input always carry the "fun for thirty seconds, frustrating after" problem that has plagued AR demos for fifteen years?
The artefact was a working gesture-controlled virtual keyboard, deployed in browser, evaluated through structured usability testing.
Approach
Two constraints anchored the design work.
First, the gesture vocabulary needed to be small and physically distinct. Most consumer AR demos overload gesture sets (pinch to select, fist to grab, splay to release, two-finger swipe to dismiss), and users cannot remember any of them by minute three. The thesis kept the vocabulary to four primitives and made each one biomechanically distinct, so users were not asked to remember subtle wrist orientations.
Second, on-screen feedback had to be immediate and visible at the gestural target. AR gesture systems typically rely on a small text or icon affordance in a separate UI region to confirm an action. That is too slow at 12 frames per second after camera latency. The keyboard was designed so the gesture-active state appeared at the key being targeted, removing the saccade between hand and confirmation.
Process
Picking the gesture vocabulary
A pilot study with five participants tested ten candidate gestures across a range of postures (seated, standing, raised arm, resting). Three findings narrowed the set.
Pinch-and-hold was the most accurate selection gesture, ahead of tap-in-air, which had a high false-positive rate from incidental hand movement. Fist-and-release for delete was preferred over swipe-left, which was being confused with navigation. Anything requiring two hands had a steep learning curve and was dropped from the vocabulary entirely.
The shipped vocabulary: hover (cursor positioning), pinch (select), fist (delete), neutral hand (no input).
Designing the keyboard layout
The keyboard followed a QWERTY grid because participants type fastest on the layout they already know, regardless of whether the layout is theoretically optimal. Each key had a hit area larger than its visible footprint, to compensate for hand-tremor at distance. A per-key visual state pulsed when the cursor was within the hit area but had not yet pinch-selected, so the user could see the system was tracking their intent before they committed.
The visible-versus-effective gap was tuned through pilot tests until input felt forgiving without becoming sloppy.
Running the usability study
Twelve participants completed a structured task: type a five-word phrase using the gesture keyboard, then complete the System Usability Scale (SUS) questionnaire and a short semi-structured interview. Each session was screen-recorded with hand-pose data captured alongside, so post-hoc analysis could correlate hesitations and corrections with specific gesture states.
Common errors clustered around pinch-confirmation, where participants released the gesture too early. The fix was to extend the confirmation window from 200ms to 350ms, which closed most of the false-positive cases without making the input feel laggy.
[Findings, mean SUS score, and top qualitative quotes: review against thesis examiner-approved version before publishing.]
Outcome
The artefact shipped as a deployed Web AR demo, and the methodology was documented in a research paper. The SUS score landed in the acceptable-to-good range for first-time touchless input, which was the strongest available evidence that gesture input in browser AR can be designed to a usable standard rather than left as a tech demo.
The keyboard itself is a research artefact, not a shipping product. The transferable result is the design heuristics: a small, biomechanically-distinct vocabulary; feedback at the target rather than in a separate region; and a gesture-confirmation window long enough for human variability.
Reflection
If I were doing this again, I would test on more diverse hand sizes earlier. The pilot recruited from the design school cohort, which skewed toward similar handedness and similar arm geometry. Several findings looked stronger than they probably are. A more diverse pilot would have caught the confirmation-window issue in week two rather than week six.
Next →
Reezo.ai