Controlling Spotify From the Wall

Computer Vision CNN Projects

I wanted to change the music without picking up my phone. Not for any good reason — the phone was right there — but the idea of pointing at a record on the wall and having it play felt like the right kind of useless.

So I printed out the album art for a handful of records, stuck the prints on the wall, and pointed a webcam at them. The plan: look at where my hand is, figure out which album I’m pointing at, read the gesture I’m making, and send the right command to Spotify.

How it fits together

There are three problems hiding in “point at a record and play it”, and they need different tools.

Which album am I pointing at? This is object detection. The camera sees the wall of prints; I run detection over the frame to locate each album and pick the one my hand is nearest to. I leaned on an existing object-detection API rather than training my own detector — the album covers are distinctive enough that off-the-shelf detection was fine.

What am I doing with my hand? This is the interesting part. A single frame tells you the hand’s position but not the gesture — a swipe and a hold look identical in a still image. Gestures live in time. So I fed a short stack of frames into a 3D convolutional network, where the third convolution dimension is time. That lets the model pick up motion: a swipe left, a swipe right, an open palm to pause. Hand-landmark detection gave a cleaner signal to work from than raw pixels.

Then do the thing. Once you have (album, gesture), the rest is the Spotify Web API. Point and open palm → play that record. Swipe → skip. It’s a thin layer over a REST API and it’s the least interesting 5% of the project, which is usually how these things go.

Running it

The repo (wall-music) is a small Python package (this was pre-uv-era):

git clone https://github.com/willleeney/wall-music
cd wall-music
pip install -r requirements.txt
pip install -e .
python music_ai/main.py

What I took away

Temporal models are underrated for interface problems. The moment I stopped trying to classify single frames and started feeding the model a window of time, the gesture recognition went from “frustrating demo” to “actually works”. A lot of real-world signals are like this — the information is in how things change, not in any one snapshot.

It’s a toy. The lighting has to be decent, the gestures have to be deliberate, and it’s slower than just tapping a screen. But it taught me how to wire object detection, a 3D CNN and a third-party API into a single loop that does something physical, and that’s a skill that transfers to far less silly projects.