2026-06-24

Automate Video Editing With OpenClaw (2026)

Automate video editing with OpenClaw and video-use: AI-driven cuts, color grading, subtitles, and animation overlays from raw footage to final.mp4.

2026-06-23

Automate Video Editing With OpenClaw (2026)

Automate video editing with OpenClaw and video-use: AI-driven cuts, color grading, subtitles, and animation overlays from raw footage to final.mp4.

A practical guide to the video-use skill: drop raw footage into a folder, tell your agent to edit, and get a polished final.mp4 back — cuts, color grade, subtitles, and animation overlays included.

Video editing is the kind of task that should not take hours. You record a talking head, a tutorial, a product demo. The raw footage has filler words, dead air, inconsistent color, no subtitles. Turning it into something publishable means: review every take, mark cut points, apply color correction, generate and burn subtitles, add lower-thirds or animations, render, review, fix, render again. A 10-minute video eats 2-3 hours of manual work.

video-use is an open-source skill that flips this: you drop raw footage into a folder, tell your agent what you want, and the agent produces final.mp4. It works with Claude Code, Codex, and OpenClaw — and it uses HyperFrames, Remotion, Manim, and PIL for animation overlays, all spawned as parallel sub-agents.

This tutorial shows how to install the skill in OpenClaw, run a real editing workflow, and tune the output for different content types — talking heads, tutorials, social clips, and course material.

What video-use does

video-use is a skill, not a GUI editor. There is no timeline, no preview window, no drag-and-drop. You talk to your agent; the agent reads the footage, decides where to cut, renders the result, self-evaluates, and hands you the file.

The pipeline has six stages:

Stage	What happens	Tool
Transcribe	One ElevenLabs Scribe call per source file → word-level timestamps, speaker diarization, audio events	ElevenLabs API
Pack	All takes compressed into a single ~12KB `takes_packed.md` — the agent's reading view	Python helper
Reason	The LLM reads the transcript, identifies filler words, dead space, retakes, and proposes a cut strategy	LLM (you approve)
EDL	The agent generates an Edit Decision List — cut points, color grade per segment, subtitle chunks	Python helper
Render	ffmpeg executes the EDL: cuts, color grading, 30ms audio fades at every cut, burned subtitles	ffmpeg
Self-eval	The agent runs `timeline_view` on the rendered output at every cut boundary — catches visual jumps, audio pops, hidden subtitles	Python helper

The self-eval loop is the part that matters. The agent does not hand you a first draft and call it done. It renders, inspects the output at every cut boundary, and if something is wrong — a visual jump, an audio pop, a subtitle that bleeds — it fixes and re-renders. Maximum three iterations. You see the preview only after it passes.

Why transcript-driven, not frame-dumping

A naive approach to AI video editing would dump every frame and feed it to the LLM. A 10-minute video at 30fps is 18,000 frames. At ~1,500 tokens per frame, that is 27 million tokens of noise. Expensive, slow, and unreliable.

video-use does the opposite. The LLM never watches the video. It reads it — through two layers:

Audio transcript (always loaded). One ElevenLabs Scribe call per source gives word-level timestamps, speaker diarization, and audio events ((laughter), (applause), (sigh)). All takes pack into a single ~12KB takes_packed.md — the LLM's primary reading view.
Visual composite (on demand). timeline_view produces a filmstrip + waveform + word labels PNG for any time range. Called only at decision points — ambiguous pauses, retake comparisons, cut-point sanity checks.

The result: 12KB of text plus a handful of PNGs instead of millions of tokens. The same principle that browser-use applied to web automation (structured DOM instead of screenshots), applied to video.

Setup: install the video-use skill in OpenClaw

Total time: under 10 minutes. You need an OpenClaw workspace, ffmpeg, and an ElevenLabs API key.

Step 1 — Clone and link the skill

git clone https://github.com/browser-use/video-use ~/skills/video-use
          ln -sfn ~/skills/video-use ~/.openclaw/workspace/.agents/skills/video-use

If your OpenClaw skills directory lives elsewhere, adjust the symlink target. The key is that OpenClaw discovers the SKILL.md file inside the video-use directory.

Step 2 — Install dependencies

cd ~/skills/video-use
          uv sync          # or: pip install -e .

You also need ffmpeg on the system:

# Ubuntu/Debian
          sudo apt install ffmpeg

          # macOS
          brew install ffmpeg

Optional but useful: yt-dlp for downloading source material from YouTube or other platforms.

brew install yt-dlp    # macOS
          pip install yt-dlp     # Linux

Step 3 — Add your ElevenLabs API key

cp ~/skills/video-use/.env.example ~/skills/video-use/.env

Edit .env:

ELEVENLABS_API_KEY=your_key_here

Get a key at elevenlabs.io/app/settings/api-keys. The free tier includes transcription minutes; for regular use, the Creator plan ($22/month) covers most editing workflows.

Step 4 — Verify the skill is loaded

Start an OpenClaw session and check that the skill appears:

openclaw skills check

You should see video-use listed with a green check. If not, verify the symlink and restart OpenClaw.

The quick setup prompt

If you want the agent to handle the install for you, paste this into an OpenClaw session:

Set up https://github.com/browser-use/video-use for me.

          Read install.md first to install this repo, wire up ffmpeg, register the skill with whichever agent you're running under, and set up the ElevenLabs API key — ask me to paste it when you need it. Then read SKILL.md for daily usage, and always read helpers/ because that's where the editing scripts live. After install, don't transcribe anything on your own — just tell me it's ready and wait for me to drop footage into a folder.

The agent will clone the repo, install deps, prompt you once for the ElevenLabs key, and confirm when ready.

Workflow: from raw footage to final.mp4

Here is a real session. We will edit a 5-minute talking-head video — the kind you record for a product update or a LinkedIn post.

Step 1 — Drop your footage

Put your raw recordings in a folder:

mkdir ~/videos/product-update
          cp ~/Downloads/take_01.mp4 ~/Downloads/take_02.mp4 ~/videos/product-update/

The skill works with any video format ffmpeg can read: MP4, MOV, MKV, WebM. Multiple takes are fine — the agent will identify the best segments across all sources.

Step 2 — Start the session

cd ~/videos/product-update
          openclaw

In the session:

edit these into a launch video

That is it. The agent takes over from here.

Step 3 — What the agent does

First, it inventories the sources:

Found 2 sources:
            - take_01.mp4 (4:32, 1920×1080, 30fps)
            - take_02.mp4 (2:48, 1920×1080, 30fps)

          Transcribing take_01.mp4...
          Transcribing take_02.mp4...

Then it proposes a strategy:

Strategy proposal:
            - Remove filler words (um, uh, false starts): ~23 instances
            - Cut dead air between takes: ~40 seconds total
            - Use take_02 for the opening hook (stronger energy)
            - Merge take_01 middle section (clearer explanation)
            - Close with take_02 ending (cleaner CTA)
            - Color grade: warm cinematic
            - Subtitles: 2-word UPPERCASE, white on semi-transparent black
            - Estimated final length: ~5:30

          Approve? (y/n)

You approve. The agent generates the EDL, renders with ffmpeg, and self-evaluates.

Step 4 — Self-evaluation

The agent runs timeline_view on the rendered output at every cut boundary. It checks for:

Visual jumps — a hard cut where the speaker's position shifts noticeably
Audio pops — a cut that lands mid-syllable or on a plosive
Subtitle overflow — text that bleeds past the cut point
Color mismatches — adjacent segments with noticeably different grades

If any issue is found, the agent fixes and re-renders. Maximum three iterations. You see the preview only after it passes.

Step 5 — Output

The final video lands in the edit/ directory next to your sources:

~/videos/product-update/
            ├── take_01.mp4
            ├── take_02.mp4
            └── edit/
                ├── final.mp4          ← your finished video
                ├── takes_packed.md    ← transcript (for reference)
                ├── edl.json           ← edit decision list
                └── timeline/          ← visual composites from self-eval

The skill directory stays clean. All outputs live in <videos_dir>/edit/.

Tuning the output

The default settings produce a clean, publishable video. But video-use is not a black box — every aspect is configurable through the session prompt.

Color grading

Three built-in looks, or any custom ffmpeg chain:

edit these into a launch video, color grade: neutral punch

Options:

warm cinematic — warmer tones, slightly crushed blacks, gentle highlight roll-off (default)
neutral punch — accurate colors, higher contrast, crisp highlights
custom — pass any ffmpeg filter chain: color grade: eq=brightness=0.05:contrast=1.1:saturation=1.2,hue=h=2

Subtitles

Default: 2-word UPPERCASE chunks, white text on a semi-transparent black bar, centered bottom-third. To change:

edit these into a tutorial, subtitles: 3-word lowercase, position: bottom-center, font: Inter, size: 48

The subtitle system uses ffmpeg's subtitles filter with ASS formatting. Any ASS-style parameter works.

Animation overlays

This is where video-use gets interesting for product videos and content marketing. The skill can spawn parallel sub-agents to generate animation overlays — one per animation — using four backends:

Backend	Best for	Output
HyperFrames	Animated text, lower-thirds, callouts, branded intros	HTML/CSS/JS → PNG sequence or MP4
Remotion	Data-driven visualizations, charts, countdown timers	React → MP4
Manim	Mathematical animations, diagrams, step-by-step explanations	Python → MP4
PIL	Static overlays, watermarks, simple text cards	Python → PNG

Example prompt:

edit these into a product demo
          - add a lower-third with my name "Max Jafar" and title "Head of Growth" at 0:15
          - add an animated callout "100k users" at 2:30 using hyperframes
          - add a countdown timer overlay for the last 10 seconds

The agent spawns one sub-agent per animation. Each sub-agent generates its overlay independently, then the main agent composites them onto the video during the render stage.

Session memory

video-use persists a project.md file in the edit/ directory. This file remembers:

The strategy you approved
The color grade and subtitle settings
Which takes were used and which were discarded
Any custom instructions you gave

Next time you drop footage in the same folder, the agent picks up where it left off:

edit these new takes, same style as last time

No re-explaining. No re-configuring.

Real-world use cases

Talking head / LinkedIn video

Raw: 3 takes of a 60-second LinkedIn thought leadership clip. Agent removes filler words, picks the strongest take for each segment, adds a branded lower-third via HyperFrames, burns subtitles, color grades warm cinematic. Output: 55 seconds, publish-ready.

Tutorial / course material

Raw: 4 screen recordings + 1 webcam track. Agent cuts dead air between steps, merges the best explanations across takes, adds Manim animations for key concepts (e.g., architecture diagrams that build step-by-step), burns step-numbered subtitles. Output: 12-minute lesson, ready for a course platform.

Social media clip

Raw: 1 longer interview. Agent identifies the most quotable 45-second segment, crops to 9:16 vertical, adds animated captions with HyperFrames, applies a punchy color grade. Output: TikTok/Reels-ready clip.

Product demo

Raw: 2 screen recordings + voiceover. Agent synchronizes the voiceover to the screen actions, cuts redundant clicks, adds a callout animation at the key feature reveal, burns subtitles for accessibility. Output: 90-second demo, ready for a landing page.

How video-use compares to traditional editors

	video-use + OpenClaw	CapCut / Premiere / DaVinci
Cuts	Agent identifies filler words, dead air, and retakes automatically from transcript	Manual review, mark in/out points
Color grading	One prompt, applied uniformly or per-segment	Manual per-clip grading or LUT application
Subtitles	Auto-generated from transcript, burned in, style via prompt	Auto-generate then manually fix and restyle
Animations	Sub-agents generate overlays in parallel	Manual creation in After Effects or similar
Iteration	Self-evaluates and fixes before showing you	You review, fix, re-render, repeat
Batch editing	Point at a folder, walk away	One project at a time
Creative control	12 hard rules + artistic freedom; you guide via prompt	Full manual control
Speed	~3-5 minutes for a 10-minute video (after transcription)	1-3 hours for the same

The tradeoff: traditional editors give you frame-level control. video-use gives you speed and consistency. For most content — talking heads, tutorials, social clips, product demos — the agent's cuts are as good as a human editor's first pass. You trade the last 10% of polish for 90% of the time saved.

Requirements and pricing

Component	Requirement	Cost
OpenClaw	Running agent with shell access	Free (self-hosted) or GolemWorkers plan
video-use skill	Cloned and linked	Free (MIT license)
ffmpeg	System binary	Free
ElevenLabs API	Scribe transcription	Free tier (10 min/month), Creator $22/month (1,000 min)
HyperFrames (optional)	For animation overlays	Free (open source)
Remotion (optional)	For data-driven animations	Free (dev), $50/mo (company)
Manim (optional)	For math/diagram animations	Free (open source)

The only hard cost is ElevenLabs Scribe. Everything else is free or optional.

FAQ

Do I need to watch the video before editing?

No. The agent reads the transcript, not the video. You describe what you want ("edit into a launch video", "cut filler words and add subtitles"), approve the strategy, and the agent handles the rest. You review the final output.

Can I use video-use without OpenClaw?

Yes. video-use works with Claude Code, Codex, and any agent with shell access. OpenClaw adds the benefit of running as an always-on agent — you can trigger edits from Telegram, Slack, or a cron schedule without opening a terminal.

How accurate are the AI cuts?

The transcript-driven approach means cut precision is at the word level — typically within 30ms. The 30ms audio fades at every cut eliminate pops. The self-eval loop catches visual issues that transcript-only editing would miss (e.g., a cut where the speaker's head position jumps).

Can I edit 4K footage?

Yes. ffmpeg handles 4K natively. Transcription time scales with audio length, not resolution. Render time scales with resolution and effects — a 10-minute 4K video with color grading and subtitles takes ~5-8 minutes on a modern machine.

What if I do not like the agent's cut strategy?

Reject it. The agent proposes a strategy and waits for your approval before executing. You can also give specific instructions: "use only take_02", "cut everything before the word 'welcome'", "keep all the bloopers in a separate outtakes file."

Does video-use work with multiple speakers?

Yes. ElevenLabs Scribe provides speaker diarization. The agent can cut to a specific speaker, balance audio levels between speakers, and label each speaker in the subtitles.

Common pitfalls

Skipping the strategy approval step — The agent always proposes before executing. If you auto-approve without reading, you may get cuts you did not want. Take 30 seconds to review the proposal.
Using low-quality source audio — Scribe is good, but garbage in means garbage out. Record in a quiet room with a decent microphone. Background music is fine; background HVAC noise degrades transcription quality.
Expecting Hollywood-level color grading — The built-in looks are clean and professional. They are not a substitute for a colorist. For most content marketing, they are more than enough.
Forgetting to install yt-dlp — If you want to pull source material from YouTube or Vimeo, you need yt-dlp. It is optional but commonly needed.
Not using session memory — The project.md file is there for a reason. If you edit the same type of video regularly (weekly podcast, daily social clip), let the agent remember your settings instead of re-specifying them every time.

Automate Video Editing With OpenClaw (2026)

Automate Video Editing With OpenClaw (2026)

What video-use does

Why transcript-driven, not frame-dumping

Setup: install the video-use skill in OpenClaw

Step 1 — Clone and link the skill

Step 2 — Install dependencies

Step 3 — Add your ElevenLabs API key

Step 4 — Verify the skill is loaded

The quick setup prompt

Workflow: from raw footage to final.mp4

Step 1 — Drop your footage

Step 2 — Start the session

Step 3 — What the agent does

Step 4 — Self-evaluation

Step 5 — Output

Tuning the output

Color grading

Subtitles

Animation overlays

Session memory

Real-world use cases

Talking head / LinkedIn video

Tutorial / course material

Social media clip

Product demo

How video-use compares to traditional editors

Requirements and pricing

FAQ

Do I need to watch the video before editing?

Can I use video-use without OpenClaw?

How accurate are the AI cuts?

Can I edit 4K footage?

What if I do not like the agent's cut strategy?

Does video-use work with multiple speakers?

Common pitfalls

Related articles