The AI that physically uses your phone.
Order food, hail rides, check your bank, reply to messages.
Any app, any phone, no APIs, no OAuth.
Pizza for dinner on Uber Eats
On it. I'll open Uber Eats on your phone, find a pizza place nearby, and place the order.
The entire flow runs autonomously. You just watch.
Top Camera ──→ AI Agent ──→ 3-Axis Arm ──→ Side Camera ──→ Aligned?
(read screen) (decide) (move stylus) (check pos) │
▲ Yes │ No
│ │ │
│ Touch Phone ◄─────────────────────────────┘ │
│ │ │
└──────────────┘ (next action) adjust & retry ◄┘ Stylus retracts out of frame. Top camera takes a clean screenshot. The AI sees exactly what's on your phone screen.
The agent (Claude, GPT, etc.) analyzes the screen and picks a direction and distance — like "move down-right, large" — no pixel coordinates needed.
The arm moves the stylus. Side camera checks alignment from 45°. Not aligned? The AI adjusts. Aligned? It taps.
Stylus touches the screen, then retracts. Top camera verifies the result. The loop continues for the next action.
Looks straight down at the screen — reads what's on it
Views from ~45° — checks stylus tip position before tapping
Moves X/Y to reach any point, up/down (Z) to touch or release
Today's AI agents can control your computer — but they hit walls everywhere.
The insight: Instead of building software bridges to every service, give the AI agent physical presence. A camera sees the screen. A robotic finger taps it. No app can detect or block it — because to the phone, it's just a finger.
┌───────────────────────────────────────┐
│ AI Agent (Brain) │
│ Claude Desktop / OpenClaw / etc. │
│ Sees screen → decides → calls tools │
└──────────────────┬────────────────────┘
│ MCP Protocol
▼
┌───────────────────────────────────────┐
│ PhysiClaw MCP Server (Python) │
│ │
│ Tools: │
│ · screenshot_top (top camera) │
│ · screenshot_side (side camera) │
│ · move (X/Y plane) │
│ · tap / swipe (Z down + move) │
│ · park (retract) │
└──────────┬────────────────┬───────────┘
│ │
USB Cameras USB Serial (GRBL)
│ │
▼ ▼
┌────────────┐ ┌───────────────┐
│ Top Camera │ │ GRBL Board │
│ (screen) │ │ (embedded) │
├────────────┤ │ X/Y gantry │
│ Side Camera│ │ Z stylus │
│ (stylus) │ └──────┬────────┘
└────────────┘ │ touch
▼
┌─────────────────┐
│ Phone │
│ (unlocked) │
└─────────────────┘ Everything you need to build one. Total cost: ~$127.
| Component | Price |
|---|---|
| GRBL Arm | ~$80 |
| Top Camera | ~$14 |
| Side Camera | ~$14 |
| Stylus | ~$1.50 |
| Camera Mounts | ~$4 |
| Phone Mount | ~$1.20 |
| USB Hub | ~$13 |
pip install pyserial opencv-python mcp git clone https://github.com/echosprint/PhysiClaw.git
cd PhysiClaw physiclaw/
├── physiclaw_server.py # MCP Server entry point
├── hand.py # Serial G-code control
├── eyes.py # USB camera capture
└── config.py # Configuration constants Any app, any phone, iOS or Android.
Move to target → down 50-100ms → up
G0 → M3 S12 → G4 P0.05 → M3 S0 Move to target → down 800ms → up
G0 → M3 S12 → G4 P0.8 → M3 S0 Down at start → linear move → up at end
M3 S12 → G1 F3000 → M3 S0 Two taps with <300ms interval
tap → G4 P0.1 → tap