2026-06-24

Automate Video Editing With OpenClaw (2026)

Automate video editing with OpenClaw and video-use: AI-driven cuts, color grading, subtitles, and animation overlays from raw footage to final.mp4.

2026-06-23

Automate Video Editing With OpenClaw (2026)

Automate video editing with OpenClaw and video-use: AI-driven cuts, color grading, subtitles, and animation overlays from raw footage to final.mp4.

A practical guide to the video-use skill: drop raw footage into a folder, tell your agent to edit, and get a polished final.mp4 back — cuts, color grade, subtitles, and animation overlays included.


Video editing is the kind of task that should not take hours. You record a talking head, a tutorial, a product demo. The raw footage has filler words, dead air, inconsistent color, no subtitles. Turning it into something publishable means: review every take, mark cut points, apply color correction, generate and burn subtitles, add lower-thirds or animations, render, review, fix, render again. A 10-minute video eats 2-3 hours of manual work.

video-use is an open-source skill that flips this: you drop raw footage into a folder, tell your agent what you want, and the agent produces final.mp4. It works with Claude Code, Codex, and OpenClaw — and it uses HyperFrames, Remotion, Manim, and PIL for animation overlays, all spawned as parallel sub-agents.

This tutorial shows how to install the skill in OpenClaw, run a real editing workflow, and tune the output for different content types — talking heads, tutorials, social clips, and course material.

What video-use does

video-use is a skill, not a GUI editor. There is no timeline, no preview window, no drag-and-drop. You talk to your agent; the agent reads the footage, decides where to cut, renders the result, self-evaluates, and hands you the file.

The pipeline has six stages:

Stage What happens Tool
Transcribe One ElevenLabs Scribe call per source file → word-level timestamps, speaker diarization, audio events ElevenLabs API
Pack All takes compressed into a single ~12KB takes_packed.md — the agent's reading view Python helper
Reason The LLM reads the transcript, identifies filler words, dead space, retakes, and proposes a cut strategy LLM (you approve)
EDL The agent generates an Edit Decision List — cut points, color grade per segment, subtitle chunks Python helper
Render ffmpeg executes the EDL: cuts, color grading, 30ms audio fades at every cut, burned subtitles ffmpeg
Self-eval The agent runs timeline_view on the rendered output at every cut boundary — catches visual jumps, audio pops, hidden subtitles Python helper

The self-eval loop is the part that matters. The agent does not hand you a first draft and call it done. It renders, inspects the output at every cut boundary, and if something is wrong — a visual jump, an audio pop, a subtitle that bleeds — it fixes and re-renders. Maximum three iterations. You see the preview only after it passes.

Why transcript-driven, not frame-dumping

A naive approach to AI video editing would dump every frame and feed it to the LLM. A 10-minute video at 30fps is 18,000 frames. At ~1,500 tokens per frame, that is 27 million tokens of noise. Expensive, slow, and unreliable.

video-use does the opposite. The LLM never watches the video. It reads it — through two layers:

  1. Audio transcript (always loaded). One ElevenLabs Scribe call per source gives word-level timestamps, speaker diarization, and audio events ((laughter), (applause), (sigh)). All takes pack into a single ~12KB takes_packed.md — the LLM's primary reading view.

  2. Visual composite (on demand). timeline_view produces a filmstrip + waveform + word labels PNG for any time range. Called only at decision points — ambiguous pauses, retake comparisons, cut-point sanity checks.

The result: 12KB of text plus a handful of PNGs instead of millions of tokens. The same principle that browser-use applied to web automation (structured DOM instead of screenshots), applied to video.

Setup: install the video-use skill in OpenClaw

Total time: under 10 minutes. You need an OpenClaw workspace, ffmpeg, and an ElevenLabs API key.

Step 1 — Clone and link the skill

git clone https://github.com/browser-use/video-use ~/skills/video-use
          ln -sfn ~/skills/video-use ~/.openclaw/workspace/.agents/skills/video-use
          

If your OpenClaw skills directory lives elsewhere, adjust the symlink target. The key is that OpenClaw discovers the SKILL.md file inside the video-use directory.

Step 2 — Install dependencies

cd ~/skills/video-use
          uv sync          # or: pip install -e .
          

You also need ffmpeg on the system:

# Ubuntu/Debian
          sudo apt install ffmpeg

          # macOS
          brew install ffmpeg
          

Optional but useful: yt-dlp for downloading source material from YouTube or other platforms.

brew install yt-dlp    # macOS
          pip install yt-dlp     # Linux
          

Step 3 — Add your ElevenLabs API key

cp ~/skills/video-use/.env.example ~/skills/video-use/.env
          

Edit .env:

ELEVENLABS_API_KEY=your_key_here
          

Get a key at elevenlabs.io/app/settings/api-keys. The free tier includes transcription minutes; for regular use, the Creator plan ($22/month) covers most editing workflows.

Step 4 — Verify the skill is loaded

Start an OpenClaw session and check that the skill appears:

openclaw skills check
          

You should see video-use listed with a green check. If not, verify the symlink and restart OpenClaw.

The quick setup prompt

If you want the agent to handle the install for you, paste this into an OpenClaw session:

Set up https://github.com/browser-use/video-use for me.

          Read install.md first to install this repo, wire up ffmpeg, register the skill with whichever agent you're running under, and set up the ElevenLabs API key — ask me to paste it when you need it. Then read SKILL.md for daily usage, and always read helpers/ because that's where the editing scripts live. After install, don't transcribe anything on your own — just tell me it's ready and wait for me to drop footage into a folder.
          

The agent will clone the repo, install deps, prompt you once for the ElevenLabs key, and confirm when ready.

Workflow: from raw footage to final.mp4

Here is a real session. We will edit a 5-minute talking-head video — the kind you record for a product update or a LinkedIn post.

Step 1 — Drop your footage

Put your raw recordings in a folder:

mkdir ~/videos/product-update
          cp ~/Downloads/take_01.mp4 ~/Downloads/take_02.mp4 ~/videos/product-update/
          

The skill works with any video format ffmpeg can read: MP4, MOV, MKV, WebM. Multiple takes are fine — the agent will identify the best segments across all sources.

Step 2 — Start the session

cd ~/videos/product-update
          openclaw
          

In the session:

edit these into a launch video
          

That is it. The agent takes over from here.

Step 3 — What the agent does

First, it inventories the sources:

Found 2 sources:
            - take_01.mp4 (4:32, 1920×1080, 30fps)
            - take_02.mp4 (2:48, 1920×1080, 30fps)

          Transcribing take_01.mp4...
          Transcribing take_02.mp4...
          

Then it proposes a strategy:

Strategy proposal:
            - Remove filler words (um, uh, false starts): ~23 instances
            - Cut dead air between takes: ~40 seconds total
            - Use take_02 for the opening hook (stronger energy)
            - Merge take_01 middle section (clearer explanation)
            - Close with take_02 ending (cleaner CTA)
            - Color grade: warm cinematic
            - Subtitles: 2-word UPPERCASE, white on semi-transparent black
            - Estimated final length: ~5:30

          Approve? (y/n)
          

You approve. The agent generates the EDL, renders with ffmpeg, and self-evaluates.

Step 4 — Self-evaluation

The agent runs timeline_view on the rendered output at every cut boundary. It checks for:

  • Visual jumps — a hard cut where the speaker's position shifts noticeably
  • Audio pops — a cut that lands mid-syllable or on a plosive
  • Subtitle overflow — text that bleeds past the cut point
  • Color mismatches — adjacent segments with noticeably different grades

If any issue is found, the agent fixes and re-renders. Maximum three iterations. You see the preview only after it passes.

Step 5 — Output

The final video lands in the edit/ directory next to your sources:

~/videos/product-update/
            ├── take_01.mp4
            ├── take_02.mp4
            └── edit/
                ├── final.mp4          ← your finished video
                ├── takes_packed.md    ← transcript (for reference)
                ├── edl.json           ← edit decision list
                └── timeline/          ← visual composites from self-eval
          

The skill directory stays clean. All outputs live in <videos_dir>/edit/.

Tuning the output

The default settings produce a clean, publishable video. But video-use is not a black box — every aspect is configurable through the session prompt.

Color grading

Three built-in looks, or any custom ffmpeg chain:

edit these into a launch video, color grade: neutral punch
          

Options:

  • warm cinematic — warmer tones, slightly crushed blacks, gentle highlight roll-off (default)
  • neutral punch — accurate colors, higher contrast, crisp highlights
  • custom — pass any ffmpeg filter chain: color grade: eq=brightness=0.05:contrast=1.1:saturation=1.2,hue=h=2

Subtitles

Default: 2-word UPPERCASE chunks, white text on a semi-transparent black bar, centered bottom-third. To change:

edit these into a tutorial, subtitles: 3-word lowercase, position: bottom-center, font: Inter, size: 48
          

The subtitle system uses ffmpeg's subtitles filter with ASS formatting. Any ASS-style parameter works.

Animation overlays

This is where video-use gets interesting for product videos and content marketing. The skill can spawn parallel sub-agents to generate animation overlays — one per animation — using four backends:

Backend Best for Output
HyperFrames Animated text, lower-thirds, callouts, branded intros HTML/CSS/JS → PNG sequence or MP4
Remotion Data-driven visualizations, charts, countdown timers React → MP4
Manim Mathematical animations, diagrams, step-by-step explanations Python → MP4
PIL Static overlays, watermarks, simple text cards Python → PNG

Example prompt:

edit these into a product demo
          - add a lower-third with my name "Max Jafar" and title "Head of Growth" at 0:15
          - add an animated callout "100k users" at 2:30 using hyperframes
          - add a countdown timer overlay for the last 10 seconds
          

The agent spawns one sub-agent per animation. Each sub-agent generates its overlay independently, then the main agent composites them onto the video during the render stage.

Session memory

video-use persists a project.md file in the edit/ directory. This file remembers:

  • The strategy you approved
  • The color grade and subtitle settings
  • Which takes were used and which were discarded
  • Any custom instructions you gave

Next time you drop footage in the same folder, the agent picks up where it left off:

edit these new takes, same style as last time
          

No re-explaining. No re-configuring.

Real-world use cases

Talking head / LinkedIn video

Raw: 3 takes of a 60-second LinkedIn thought leadership clip. Agent removes filler words, picks the strongest take for each segment, adds a branded lower-third via HyperFrames, burns subtitles, color grades warm cinematic. Output: 55 seconds, publish-ready.

Tutorial / course material

Raw: 4 screen recordings + 1 webcam track. Agent cuts dead air between steps, merges the best explanations across takes, adds Manim animations for key concepts (e.g., architecture diagrams that build step-by-step), burns step-numbered subtitles. Output: 12-minute lesson, ready for a course platform.

Social media clip

Raw: 1 longer interview. Agent identifies the most quotable 45-second segment, crops to 9:16 vertical, adds animated captions with HyperFrames, applies a punchy color grade. Output: TikTok/Reels-ready clip.

Product demo

Raw: 2 screen recordings + voiceover. Agent synchronizes the voiceover to the screen actions, cuts redundant clicks, adds a callout animation at the key feature reveal, burns subtitles for accessibility. Output: 90-second demo, ready for a landing page.

How video-use compares to traditional editors

video-use + OpenClaw CapCut / Premiere / DaVinci
Cuts Agent identifies filler words, dead air, and retakes automatically from transcript Manual review, mark in/out points
Color grading One prompt, applied uniformly or per-segment Manual per-clip grading or LUT application
Subtitles Auto-generated from transcript, burned in, style via prompt Auto-generate then manually fix and restyle
Animations Sub-agents generate overlays in parallel Manual creation in After Effects or similar
Iteration Self-evaluates and fixes before showing you You review, fix, re-render, repeat
Batch editing Point at a folder, walk away One project at a time
Creative control 12 hard rules + artistic freedom; you guide via prompt Full manual control
Speed ~3-5 minutes for a 10-minute video (after transcription) 1-3 hours for the same

The tradeoff: traditional editors give you frame-level control. video-use gives you speed and consistency. For most content — talking heads, tutorials, social clips, product demos — the agent's cuts are as good as a human editor's first pass. You trade the last 10% of polish for 90% of the time saved.

Requirements and pricing

Component Requirement Cost
OpenClaw Running agent with shell access Free (self-hosted) or GolemWorkers plan
video-use skill Cloned and linked Free (MIT license)
ffmpeg System binary Free
ElevenLabs API Scribe transcription Free tier (10 min/month), Creator $22/month (1,000 min)
HyperFrames (optional) For animation overlays Free (open source)
Remotion (optional) For data-driven animations Free (dev), $50/mo (company)
Manim (optional) For math/diagram animations Free (open source)

The only hard cost is ElevenLabs Scribe. Everything else is free or optional.

FAQ

Do I need to watch the video before editing?

No. The agent reads the transcript, not the video. You describe what you want ("edit into a launch video", "cut filler words and add subtitles"), approve the strategy, and the agent handles the rest. You review the final output.

Can I use video-use without OpenClaw?

Yes. video-use works with Claude Code, Codex, and any agent with shell access. OpenClaw adds the benefit of running as an always-on agent — you can trigger edits from Telegram, Slack, or a cron schedule without opening a terminal.

How accurate are the AI cuts?

The transcript-driven approach means cut precision is at the word level — typically within 30ms. The 30ms audio fades at every cut eliminate pops. The self-eval loop catches visual issues that transcript-only editing would miss (e.g., a cut where the speaker's head position jumps).

Can I edit 4K footage?

Yes. ffmpeg handles 4K natively. Transcription time scales with audio length, not resolution. Render time scales with resolution and effects — a 10-minute 4K video with color grading and subtitles takes ~5-8 minutes on a modern machine.

What if I do not like the agent's cut strategy?

Reject it. The agent proposes a strategy and waits for your approval before executing. You can also give specific instructions: "use only take_02", "cut everything before the word 'welcome'", "keep all the bloopers in a separate outtakes file."

Does video-use work with multiple speakers?

Yes. ElevenLabs Scribe provides speaker diarization. The agent can cut to a specific speaker, balance audio levels between speakers, and label each speaker in the subtitles.

Common pitfalls

  • Skipping the strategy approval step — The agent always proposes before executing. If you auto-approve without reading, you may get cuts you did not want. Take 30 seconds to review the proposal.
  • Using low-quality source audio — Scribe is good, but garbage in means garbage out. Record in a quiet room with a decent microphone. Background music is fine; background HVAC noise degrades transcription quality.
  • Expecting Hollywood-level color grading — The built-in looks are clean and professional. They are not a substitute for a colorist. For most content marketing, they are more than enough.
  • Forgetting to install yt-dlp — If you want to pull source material from YouTube or Vimeo, you need yt-dlp. It is optional but commonly needed.
  • Not using session memory — The project.md file is there for a reason. If you edit the same type of video regularly (weekly podcast, daily social clip), let the agent remember your settings instead of re-specifying them every time.

Related articles