2026-06-24
Automate Video Editing With OpenClaw (2026)
Automate video editing with OpenClaw and video-use: AI-driven cuts, color grading, subtitles, and animation overlays from raw footage to final.mp4.
2026-06-23
Automate Video Editing With OpenClaw (2026)
Automate video editing with OpenClaw and video-use: AI-driven cuts, color grading, subtitles, and animation overlays from raw footage to final.mp4.
A practical guide to the video-use skill: drop raw footage into a folder, tell your agent to edit, and get a polished final.mp4 back — cuts, color grade, subtitles, and animation overlays included.
Video editing is the kind of task that should not take hours. You record a talking head, a tutorial, a product demo. The raw footage has filler words, dead air, inconsistent color, no subtitles. Turning it into something publishable means: review every take, mark cut points, apply color correction, generate and burn subtitles, add lower-thirds or animations, render, review, fix, render again. A 10-minute video eats 2-3 hours of manual work.
video-use is an open-source skill that flips this: you drop raw footage into a folder, tell your agent what you want, and the agent produces final.mp4. It works with Claude Code, Codex, and OpenClaw — and it uses HyperFrames, Remotion, Manim, and PIL for animation overlays, all spawned as parallel sub-agents.
This tutorial shows how to install the skill in OpenClaw, run a real editing workflow, and tune the output for different content types — talking heads, tutorials, social clips, and course material.
What video-use does
video-use is a skill, not a GUI editor. There is no timeline, no preview window, no drag-and-drop. You talk to your agent; the agent reads the footage, decides where to cut, renders the result, self-evaluates, and hands you the file.
The pipeline has six stages:
| Stage | What happens | Tool |
|---|---|---|
| Transcribe | One ElevenLabs Scribe call per source file → word-level timestamps, speaker diarization, audio events | ElevenLabs API |
| Pack | All takes compressed into a single ~12KB takes_packed.md — the agent's reading view |
Python helper |
| Reason | The LLM reads the transcript, identifies filler words, dead space, retakes, and proposes a cut strategy | LLM (you approve) |
| EDL | The agent generates an Edit Decision List — cut points, color grade per segment, subtitle chunks | Python helper |
| Render | ffmpeg executes the EDL: cuts, color grading, 30ms audio fades at every cut, burned subtitles | ffmpeg |
| Self-eval | The agent runs timeline_view on the rendered output at every cut boundary — catches visual jumps, audio pops, hidden subtitles |
Python helper |
The self-eval loop is the part that matters. The agent does not hand you a first draft and call it done. It renders, inspects the output at every cut boundary, and if something is wrong — a visual jump, an audio pop, a subtitle that bleeds — it fixes and re-renders. Maximum three iterations. You see the preview only after it passes.
Why transcript-driven, not frame-dumping
A naive approach to AI video editing would dump every frame and feed it to the LLM. A 10-minute video at 30fps is 18,000 frames. At ~1,500 tokens per frame, that is 27 million tokens of noise. Expensive, slow, and unreliable.
video-use does the opposite. The LLM never watches the video. It reads it — through two layers:
Audio transcript (always loaded). One ElevenLabs Scribe call per source gives word-level timestamps, speaker diarization, and audio events (
(laughter),(applause),(sigh)). All takes pack into a single ~12KBtakes_packed.md— the LLM's primary reading view.Visual composite (on demand).
timeline_viewproduces a filmstrip + waveform + word labels PNG for any time range. Called only at decision points — ambiguous pauses, retake comparisons, cut-point sanity checks.
The result: 12KB of text plus a handful of PNGs instead of millions of tokens. The same principle that browser-use applied to web automation (structured DOM instead of screenshots), applied to video.
Setup: install the video-use skill in OpenClaw
Total time: under 10 minutes. You need an OpenClaw workspace, ffmpeg, and an ElevenLabs API key.
Step 1 — Clone and link the skill
git clone https://github.com/browser-use/video-use ~/skills/video-use
ln -sfn ~/skills/video-use ~/.openclaw/workspace/.agents/skills/video-use
If your OpenClaw skills directory lives elsewhere, adjust the symlink target. The key is that OpenClaw discovers the SKILL.md file inside the video-use directory.
Step 2 — Install dependencies
cd ~/skills/video-use
uv sync # or: pip install -e .
You also need ffmpeg on the system:
# Ubuntu/Debian
sudo apt install ffmpeg
# macOS
brew install ffmpeg
Optional but useful: yt-dlp for downloading source material from YouTube or other platforms.
brew install yt-dlp # macOS
pip install yt-dlp # Linux
Step 3 — Add your ElevenLabs API key
cp ~/skills/video-use/.env.example ~/skills/video-use/.env
Edit .env:
ELEVENLABS_API_KEY=your_key_here
Get a key at elevenlabs.io/app/settings/api-keys. The free tier includes transcription minutes; for regular use, the Creator plan ($22/month) covers most editing workflows.
Step 4 — Verify the skill is loaded
Start an OpenClaw session and check that the skill appears:
openclaw skills check
You should see video-use listed with a green check. If not, verify the symlink and restart OpenClaw.
The quick setup prompt
If you want the agent to handle the install for you, paste this into an OpenClaw session:
Set up https://github.com/browser-use/video-use for me.
Read install.md first to install this repo, wire up ffmpeg, register the skill with whichever agent you're running under, and set up the ElevenLabs API key — ask me to paste it when you need it. Then read SKILL.md for daily usage, and always read helpers/ because that's where the editing scripts live. After install, don't transcribe anything on your own — just tell me it's ready and wait for me to drop footage into a folder.
The agent will clone the repo, install deps, prompt you once for the ElevenLabs key, and confirm when ready.
Workflow: from raw footage to final.mp4
Here is a real session. We will edit a 5-minute talking-head video — the kind you record for a product update or a LinkedIn post.
Step 1 — Drop your footage
Put your raw recordings in a folder:
mkdir ~/videos/product-update
cp ~/Downloads/take_01.mp4 ~/Downloads/take_02.mp4 ~/videos/product-update/
The skill works with any video format ffmpeg can read: MP4, MOV, MKV, WebM. Multiple takes are fine — the agent will identify the best segments across all sources.
Step 2 — Start the session
cd ~/videos/product-update
openclaw
In the session:
edit these into a launch video
That is it. The agent takes over from here.
Step 3 — What the agent does
First, it inventories the sources:
Found 2 sources:
- take_01.mp4 (4:32, 1920×1080, 30fps)
- take_02.mp4 (2:48, 1920×1080, 30fps)
Transcribing take_01.mp4...
Transcribing take_02.mp4...
Then it proposes a strategy:
Strategy proposal:
- Remove filler words (um, uh, false starts): ~23 instances
- Cut dead air between takes: ~40 seconds total
- Use take_02 for the opening hook (stronger energy)
- Merge take_01 middle section (clearer explanation)
- Close with take_02 ending (cleaner CTA)
- Color grade: warm cinematic
- Subtitles: 2-word UPPERCASE, white on semi-transparent black
- Estimated final length: ~5:30
Approve? (y/n)
You approve. The agent generates the EDL, renders with ffmpeg, and self-evaluates.
Step 4 — Self-evaluation
The agent runs timeline_view on the rendered output at every cut boundary. It checks for:
- Visual jumps — a hard cut where the speaker's position shifts noticeably
- Audio pops — a cut that lands mid-syllable or on a plosive
- Subtitle overflow — text that bleeds past the cut point
- Color mismatches — adjacent segments with noticeably different grades
If any issue is found, the agent fixes and re-renders. Maximum three iterations. You see the preview only after it passes.
Step 5 — Output
The final video lands in the edit/ directory next to your sources:
~/videos/product-update/
├── take_01.mp4
├── take_02.mp4
└── edit/
├── final.mp4 ← your finished video
├── takes_packed.md ← transcript (for reference)
├── edl.json ← edit decision list
└── timeline/ ← visual composites from self-eval
The skill directory stays clean. All outputs live in <videos_dir>/edit/.
Tuning the output
The default settings produce a clean, publishable video. But video-use is not a black box — every aspect is configurable through the session prompt.
Color grading
Three built-in looks, or any custom ffmpeg chain:
edit these into a launch video, color grade: neutral punch
Options:
- warm cinematic — warmer tones, slightly crushed blacks, gentle highlight roll-off (default)
- neutral punch — accurate colors, higher contrast, crisp highlights
- custom — pass any ffmpeg filter chain:
color grade: eq=brightness=0.05:contrast=1.1:saturation=1.2,hue=h=2
Subtitles
Default: 2-word UPPERCASE chunks, white text on a semi-transparent black bar, centered bottom-third. To change:
edit these into a tutorial, subtitles: 3-word lowercase, position: bottom-center, font: Inter, size: 48
The subtitle system uses ffmpeg's subtitles filter with ASS formatting. Any ASS-style parameter works.
Animation overlays
This is where video-use gets interesting for product videos and content marketing. The skill can spawn parallel sub-agents to generate animation overlays — one per animation — using four backends:
| Backend | Best for | Output |
|---|---|---|
| HyperFrames | Animated text, lower-thirds, callouts, branded intros | HTML/CSS/JS → PNG sequence or MP4 |
| Remotion | Data-driven visualizations, charts, countdown timers | React → MP4 |
| Manim | Mathematical animations, diagrams, step-by-step explanations | Python → MP4 |
| PIL | Static overlays, watermarks, simple text cards | Python → PNG |
Example prompt:
edit these into a product demo
- add a lower-third with my name "Max Jafar" and title "Head of Growth" at 0:15
- add an animated callout "100k users" at 2:30 using hyperframes
- add a countdown timer overlay for the last 10 seconds
The agent spawns one sub-agent per animation. Each sub-agent generates its overlay independently, then the main agent composites them onto the video during the render stage.
Session memory
video-use persists a project.md file in the edit/ directory. This file remembers:
- The strategy you approved
- The color grade and subtitle settings
- Which takes were used and which were discarded
- Any custom instructions you gave
Next time you drop footage in the same folder, the agent picks up where it left off:
edit these new takes, same style as last time
No re-explaining. No re-configuring.
Real-world use cases
Talking head / LinkedIn video
Raw: 3 takes of a 60-second LinkedIn thought leadership clip. Agent removes filler words, picks the strongest take for each segment, adds a branded lower-third via HyperFrames, burns subtitles, color grades warm cinematic. Output: 55 seconds, publish-ready.
Tutorial / course material
Raw: 4 screen recordings + 1 webcam track. Agent cuts dead air between steps, merges the best explanations across takes, adds Manim animations for key concepts (e.g., architecture diagrams that build step-by-step), burns step-numbered subtitles. Output: 12-minute lesson, ready for a course platform.
Social media clip
Raw: 1 longer interview. Agent identifies the most quotable 45-second segment, crops to 9:16 vertical, adds animated captions with HyperFrames, applies a punchy color grade. Output: TikTok/Reels-ready clip.
Product demo
Raw: 2 screen recordings + voiceover. Agent synchronizes the voiceover to the screen actions, cuts redundant clicks, adds a callout animation at the key feature reveal, burns subtitles for accessibility. Output: 90-second demo, ready for a landing page.
How video-use compares to traditional editors
| video-use + OpenClaw | CapCut / Premiere / DaVinci | |
|---|---|---|
| Cuts | Agent identifies filler words, dead air, and retakes automatically from transcript | Manual review, mark in/out points |
| Color grading | One prompt, applied uniformly or per-segment | Manual per-clip grading or LUT application |
| Subtitles | Auto-generated from transcript, burned in, style via prompt | Auto-generate then manually fix and restyle |
| Animations | Sub-agents generate overlays in parallel | Manual creation in After Effects or similar |
| Iteration | Self-evaluates and fixes before showing you | You review, fix, re-render, repeat |
| Batch editing | Point at a folder, walk away | One project at a time |
| Creative control | 12 hard rules + artistic freedom; you guide via prompt | Full manual control |
| Speed | ~3-5 minutes for a 10-minute video (after transcription) | 1-3 hours for the same |
The tradeoff: traditional editors give you frame-level control. video-use gives you speed and consistency. For most content — talking heads, tutorials, social clips, product demos — the agent's cuts are as good as a human editor's first pass. You trade the last 10% of polish for 90% of the time saved.
Requirements and pricing
| Component | Requirement | Cost |
|---|---|---|
| OpenClaw | Running agent with shell access | Free (self-hosted) or GolemWorkers plan |
| video-use skill | Cloned and linked | Free (MIT license) |
| ffmpeg | System binary | Free |
| ElevenLabs API | Scribe transcription | Free tier (10 min/month), Creator $22/month (1,000 min) |
| HyperFrames (optional) | For animation overlays | Free (open source) |
| Remotion (optional) | For data-driven animations | Free (dev), $50/mo (company) |
| Manim (optional) | For math/diagram animations | Free (open source) |
The only hard cost is ElevenLabs Scribe. Everything else is free or optional.
FAQ
Do I need to watch the video before editing?
No. The agent reads the transcript, not the video. You describe what you want ("edit into a launch video", "cut filler words and add subtitles"), approve the strategy, and the agent handles the rest. You review the final output.
Can I use video-use without OpenClaw?
Yes. video-use works with Claude Code, Codex, and any agent with shell access. OpenClaw adds the benefit of running as an always-on agent — you can trigger edits from Telegram, Slack, or a cron schedule without opening a terminal.
How accurate are the AI cuts?
The transcript-driven approach means cut precision is at the word level — typically within 30ms. The 30ms audio fades at every cut eliminate pops. The self-eval loop catches visual issues that transcript-only editing would miss (e.g., a cut where the speaker's head position jumps).
Can I edit 4K footage?
Yes. ffmpeg handles 4K natively. Transcription time scales with audio length, not resolution. Render time scales with resolution and effects — a 10-minute 4K video with color grading and subtitles takes ~5-8 minutes on a modern machine.
What if I do not like the agent's cut strategy?
Reject it. The agent proposes a strategy and waits for your approval before executing. You can also give specific instructions: "use only take_02", "cut everything before the word 'welcome'", "keep all the bloopers in a separate outtakes file."
Does video-use work with multiple speakers?
Yes. ElevenLabs Scribe provides speaker diarization. The agent can cut to a specific speaker, balance audio levels between speakers, and label each speaker in the subtitles.
Common pitfalls
- Skipping the strategy approval step — The agent always proposes before executing. If you auto-approve without reading, you may get cuts you did not want. Take 30 seconds to review the proposal.
- Using low-quality source audio — Scribe is good, but garbage in means garbage out. Record in a quiet room with a decent microphone. Background music is fine; background HVAC noise degrades transcription quality.
- Expecting Hollywood-level color grading — The built-in looks are clean and professional. They are not a substitute for a colorist. For most content marketing, they are more than enough.
- Forgetting to install yt-dlp — If you want to pull source material from YouTube or Vimeo, you need yt-dlp. It is optional but commonly needed.
- Not using session memory — The
project.mdfile is there for a reason. If you edit the same type of video regularly (weekly podcast, daily social clip), let the agent remember your settings instead of re-specifying them every time.
Related articles
- How to automate Vercel deployments with OpenClaw
- How to automate Slack with OpenClaw
- How to automate Telegram with OpenClaw
- How to automate GitHub with OpenClaw
- OpenClaw Skills vs Plugins: The Architectural Fork
- How to deploy an AI agent on your own server in 2026