2026-06-18

Automate YouTube with OpenClaw: 5-Stage Agent Pipeline (2026)

Automate YouTube with OpenClaw: a 5-stage agent pipeline (research, script, voice, thumbnail, upload) with prompts, tools, and real output examples.

2026-06-18

Automate YouTube with OpenClaw: 5-Stage Agent Pipeline (2026)

Automate YouTube with OpenClaw: a 5-stage agent pipeline (research, script, voice, thumbnail, upload) with prompts, tools, and real output examples.

YouTube is the most "agent-shaped" content workflow on the internet. Every step — research, script, voice, thumbnail, upload — is a discrete task with a clear input, a clear output, and a clear handoff to the next step. The whole pipeline is a textbook multi-agent system, and in 2026 it is the single cleanest way to show what OpenClaw is for.

This guide walks through the full five-stage pipeline on OpenClaw, from "what should I make a video about?" to "video is live on the channel", with the actual prompts, the actual tools, and the actual cost. Every stage is one agent. The pipeline is one workflow. By the end, you can run it end-to-end, hands-off, on a schedule.

This article is for the creator who has more ideas than hours, the operator who runs a faceless channel, the marketer who wants a content engine instead of a content calendar, and the engineer who would rather wire the pipeline once than write the next 200 videos by hand.

Why YouTube is the cleanest demo for OpenClaw

OpenClaw is, at its core, a runtime for AI agents. A great demo of any runtime is a workflow where many agents, each good at one thing, hand work to each other. YouTube has five such steps. None of them depend on each other in a way that forces a single monolithic agent. Each one has a real, off-the-shelf tool behind it:

Stage	Job	Tool (off-the-shelf)
1. Research	Find a topic, prove demand, write a brief	Serper, Perplexity, YouTube Data API
2. Script	Turn brief into a tight, hook-driven script	Claude, GPT, Gemini
3. Voice	Render the script as a natural-sounding voiceover	ElevenLabs, MiniMax, OpenAI TTS
4. Thumbnail	Generate a click-worthy still image	Flux, nano-banana, GPT-Image
5. Upload	Publish to YouTube with metadata	YouTube Data API v3

Five stages, five tools, one workflow. OpenClaw sits in the middle: it owns the schedule, the state, the secret keys, the budget, the retry logic, and the handoffs. The agents do the work. OpenClaw makes them do the work together.

This is the pattern that most "AI YouTube automation" content misses. They show you a single GPT prompt that writes a script. They do not show you the pipeline that takes the script all the way to a published video, with metadata, on schedule, every week. We will.

The architecture in one diagram


                        ┌────────────────┐
                        │   OpenClaw     │
                        │   Workflow     │
                        │   (cron: Mon)  │
                        └──────┬─────────┘
                               │
        ┌───────────┬──────────┼──────────┬───────────┐
        ▼           ▼          ▼          ▼           ▼
   ┌────────┐  ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐
   │Research│  │ Script │ │ Voice  │ │Thumbnail│ │Upload │
   │ Agent  │─▶│ Agent  │▶│ Agent  │▶│ Agent  │▶│ Agent │
   └────┬───┘  └────┬───┘ └────┬───┘ └────┬───┘ └────┬───┘
        │           │          │          │           │
   Serper,    Claude/    ElevenLabs,   Flux,      YouTube
   YouTube,   GPT        MiniMax       Nano-      Data API
   Perplexity             TTS          banana

Each box is one agent with its own prompt, its own model, its own secret keys, and its own budget. OpenClaw wires them. The arrows are handoffs: each agent's output becomes the next agent's input, in a shared state.json that lives in the workflow's workspace.

Setup: what you need before you start

You do not need much. About twenty minutes of setup, then you can run the pipeline end-to-end.

OpenClaw. Hosted or self-hosted. The hosted version at golemworkers.com is the fastest path; the self-hosted image is in the OpenClaw docs. You need a Gateway running, a default agent configured, and the dashboard accessible.

Five API keys. Each agent needs a credential. Store them in OpenClaw's secret store, not in the prompts:

SERPER_API_KEY (or GOLEMWORKERS_SERPER_PROXY_BASE_URL) for the research agent
ANTHROPIC_API_KEY or OPENAI_API_KEY for the script agent
ELEVENLABS_API_KEY (or GOLEMWORKERS_FAL_PROXY_BASE_URL for TTS) for the voice agent
FAL_KEY for the thumbnail agent
YOUTUBE_REFRESH_TOKEN and YOUTUBE_CLIENT_ID/SECRET for the upload agent

One YouTube channel. Set up the YouTube Data API OAuth flow once. The refresh token lasts until you revoke it. The agent uses it to upload on your behalf.

A shared workspace folder. OpenClaw writes intermediate artifacts to disk so the agents can hand off files. The default ~/.openclaw/workspaces/youtube-pipeline/ is fine. Each run gets a timestamped subfolder.

Once the five keys are in place and the YouTube OAuth is done, you have a five-agent pipeline that produces a video per run. Let us walk through each stage.

Stage 1: Research — what should the video be about?

The research agent's job is to take a seed (a niche, a channel, a topic) and produce a brief: a one-page document that says what the video is, who it is for, what angle makes it different, what the working title is, and what the three best references are.

Inputs:

seed_topic (e.g. "AI agents for small business")
target_audience (e.g. "operations leads at 10-50 person SaaS companies")
target_length_seconds (e.g. 480 for an 8-minute long-form video)

Agent prompt:

You are a YouTube research analyst. Given a seed topic, find the 5 best-performing related videos from the last 90 days, identify the angle each one used, and propose 3 differentiated angles for a new video. Output a JSON brief with: working_title (max 60 chars), angle (one sentence), hook (first 15 seconds, written), outline (3-5 bullets), references (list of 3-5 video URLs with one-line notes on what they did well), search_terms (5 YouTube search terms that a viewer would type to find this), and thumbnail_concept (a 2-sentence visual idea for the thumbnail).

Tools the agent uses:

web_search (Serper.dev) to find related videos
web_fetch to read the description and the top comments of each
Optional: youtube_data_api.search.list for hard numbers (views, age, channel authority)

Output: state.brief.json in the workspace, plus a human-readable brief.md for review.

The handoff: the brief is the contract between stage 1 and stage 2. The script agent reads state.brief.json and never has to know how the research was done. This is the cleanest part of the pipeline: research output is a single JSON file, and every downstream stage treats it as read-only input.

A real brief, lightly redacted:


{
  "working_title": "Why Your AI Agent Forgets Everything (And How To Fix It)",
  "angle": "Memory is the silent failure mode of every AI agent; here are 3 patterns that work.",
  "hook": "Every AI agent you have ever used has amnesia. By the time it gets to step 4, it has forgotten what step 1 said. Today I am going to show you the three memory patterns that fix this — and the one that almost always makes it worse.",
  "outline": [
    "Why context windows are not enough",
    "The 3 memory patterns that actually work",
    "The anti-pattern: stuffing everything into the prompt"
  ],
  "references": [
    "https://youtube.com/watch?v=… (top result for 'AI memory'; 410k views; did the hook well but skipped the anti-pattern)",
    "https://youtube.com/watch?v=… (engineering deep dive; too long, loses viewers at 6 min)"
  ],
  "thumbnail_concept": "A robot looking at its own hand, which is full of scribbled notes that are smudging. Bold text: 'AGENT AMNESIA'.",
  "search_terms": ["ai agent memory", "llm memory patterns", "context window vs memory"]
}

Cost: roughly 1 cent per brief (Serper + Claude Sonnet). At one video per week, that is 50 cents a year on research.

Stage 2: Script — turn the brief into a video

The script agent's job is to read the brief and produce a tight script. The shape is fixed: hook, intro, body, conclusion, CTA. The length is set by target_length_seconds. The tone is whatever the brief says (casual, technical, sales-y, founder-of-a-startup).

Agent prompt:

You are a YouTube scriptwriter. Read the brief in state.brief.json and write a script of approximately target_length_seconds spoken at 155 words per minute. The script has 5 sections, in this order:

1. Hook (15-20s, ~40-50 words). Repeat or paraphrase the hook from the brief. First sentence must earn the next sentence. No intro fluff. No "hey guys welcome back". 2. Intro (30-45s, ~80-110 words). State the problem in concrete terms. One specific example. Why now. 3. Body (60-70% of total length). Three to five sections, each 60-120 seconds, with a clear headline and a one-line setup that tells the viewer what they will get out of this section. 4. Conclusion (20-30s, ~50-70 words). Restate the takeaway. Plant the seed for the next video. 5. CTA (10-15s, ~25-35 words). Subscribe / next video / link in description. Do not beg.

Output as a single markdown file script.md with the section headings preserved. After the script, append a voiceover_notes block with pronunciation guides for any non-English words or unusual names. Do not write any production notes — those are for the human editor.

Tools the agent uses:

The brief file from stage 1
The local file system (to write script.md)
Optionally, the web_fetch tool, if the script needs to pull in a specific source

Output: script.md in the workspace, plus a word_count and an estimated_duration_seconds so the orchestrator can verify the script is on length.

The handoff: stage 3 reads script.md directly. The script is pure text, no formatting, no timestamps. The voice agent adds the timestamps and the breath marks.

What the script agent is good at:

Holding the structure. Five sections, fixed shape, every time.
Hitting the length. 155 words per minute is a tested pace for a clear English voiceover.
Producing clean dialogue. No stage directions, no parentheticals, no "(smiles)" — those are the voice agent's job to interpret, or to ignore.

What the script agent is not good at:

Knowing what is on trend this week. That is stage 1's job. The script trusts the brief.
Knowing the channel's voice. The voice comes from examples in the brief; if you do not give it examples, it defaults to neutral explainer.

Cost: roughly 3 cents per script (one Claude Sonnet call, ~3-4k tokens out). At one video per week, that is $1.50 a year on scripts.

Stage 3: Voice — turn the script into an audio file

The voice agent's job is to read script.md, chunk it into voice-friendly segments, render each segment through a TTS provider, concatenate, and produce a single .mp3 (or .wav) ready to drop into a video editor — or, if you are going faceless, ready to drop straight into the upload.

Agent prompt:

You are a voiceover producer. Read script.md and render it as a single audio file.

1. Split the script at section boundaries (Hook, Intro, Body, Conclusion, CTA). Each section is one audio segment. 2. For each section, choose the most natural voice for the tone. Default voice: a neutral male or female, mid-30s, conversational, not news-anchor. If the brief specifies a voice, use that. 3. Add a 250ms silence at section boundaries and a 600ms silence between major sections. 4. Use SSML or provider-specific tags to add micro-pauses (150ms) after every comma and a longer pause (300ms) at every period or question mark. 5. Apply light post-processing: a touch of compression and a -1 dB peak normalise. Do not over-process. 6. Concatenate the segments and export as voiceover.mp3 at 44.1 kHz, 192 kbps.

Save each segment as voiceover_segment_{n}.mp3 in case a human wants to re-render one section. Save the full file as voiceover.mp3.

Append a voiceover_log.md with the voice used per segment, the duration per segment, and the total duration.

Tools the agent uses:

The script file from stage 2
ElevenLabs TTS API (or any TTS provider with an OpenAI-compatible endpoint — including fal-ai/elevenlabs/tts/eleven-v3 via the FAL relay)
A small shell command (ffmpeg) to concatenate and post-process

Output: voiceover.mp3, per-segment files, and a voiceover_log.md with timing data.

The handoff: stage 4 (thumbnail) and stage 5 (upload) both read the script's word count and the voiceover's duration to fill in the YouTube description and the title-card timing.

Provider choice, briefly:

Provider	Voice quality	Cost per minute of audio	Notes
ElevenLabs (multilingual v3)	Best in class	~$0.30 / 1k chars	28+ languages, voice cloning, natural prosody
OpenAI TTS (`gpt-4o-mini-tts`)	Very good	~$0.015 / 1k chars	Cheapest, fewer voices, less expressive
Google Cloud TTS	Good	~$0.016 / 1k chars	Wide language coverage, studio voices
Local XTTS / CosyVoice	Varies	Free (you pay the GPU)	Best for privacy-sensitive work

For most faceless channels, ElevenLabs v3 hits the best quality-to-cost ratio. For cost-sensitive daily-Shorts channels, OpenAI TTS is the right default.

Cost: roughly 30-50 cents per 8-minute voiceover on ElevenLabs. At one video per week, that is $15-25 a year on voice.

Stage 4: Thumbnail — generate the click-worthy still

The thumbnail agent's job is to take the brief's thumbnail_concept and produce two or three on-brand, high-contrast thumbnail images for the human to choose from.

Agent prompt:

You are a thumbnail designer. Read state.brief.json and produce 3 thumbnail images.

For each image: - Aspect ratio 16:9, 1280x720 px - High contrast; subject should fill 60% of the frame - Bold sans-serif text overlay, max 4 words, max 30% of the frame area - Text positioned to be readable on both desktop and mobile previews - No clip-art, no stock-photo faces, no logos - Avoid the YouTube "gradient + giant arrow" cliche

Render at 2x for sharpness (2560x1440 internal, downscale to 1280x720 for upload). Save as thumbnail_v1.png, thumbnail_v2.png, thumbnail_v3.png.

Append a thumbnail_notes.md describing each variant in one sentence (so the human editor can pick without re-opening every image).

Tools the agent uses:

The brief from stage 1
A text-to-image model: fal-ai/nano-banana, fal-ai/flux/schnell, OpenAI gpt-image-1, or Google's Imagen
Optional: an image-editor step (e.g. fal-ai/ideogram/character-remix if the concept needs a specific character)

Output: 3 PNG files at 1280x720, plus thumbnail_notes.md.

The handoff: the upload agent in stage 5 uses the human-selected thumbnail. OpenClaw does not pick the thumbnail — that is the one decision the human owns. The pipeline pauses here, sends a Telegram/Slack message to the channel owner with the 3 images and the brief, and waits for the human to type "use #2" or "regenerate with a darker background". This is the right place to keep the human in the loop. Thumbnails are the highest-leverage 1% of the video.

Cost: roughly 5-10 cents per thumbnail (3 generations on nano-banana or Flux Schnell). At one video per week, that is $2.50-5 a year on thumbnails.

Stage 5: Upload — publish the video with metadata

The upload agent's job is to take the script, the voiceover, the thumbnail, and the brief, and produce a fully-published YouTube video with title, description, tags, thumbnail, and category. It writes the description, picks the tags, sets the visibility, and uploads.

Agent prompt:

You are a YouTube publisher. You have these files in the workspace: - script.md (the script) - voiceover.mp3 (the voiceover; you need its duration) - state.brief.json (the brief; you need working_title, search_terms, references) - thumbnail_selected.png (the human-picked thumbnail)

Produce and upload a YouTube video with the following fields: - title: the working_title from the brief, with a small hook phrase appended if it fits under 60 chars - description: a 2-paragraph summary, the references from the brief (with timestamps once you know the video length), and a 5-line "about the channel" block. Total length 150-300 words. - tags: the search_terms from the brief, plus 5-7 high-volume variants you can derive from the topic - category: people & blogs (default) or science & technology if the brief says so - thumbnail: thumbnail_selected.png - visibility: unlisted (do not publish publicly until the human confirms in the dashboard)

Run ffprobe on voiceover.mp3 to get the exact duration; the description's references should use chapter timestamps at roughly 0:00, 0:15, and at each section boundary in the script.

Save the response from the YouTube Data API to upload_log.json. The response must include the videoId, the upload status, and the watch URL.

Tools the agent uses:

The brief, the script, the voiceover, the thumbnail
ffprobe to get the voiceover duration
The YouTube Data API v3 (videos.insert with uploadType=resumable)

Output: the published (unlisted) video on the channel, plus upload_log.json with the video ID and watch URL.

The handoff: the workflow ends with a Telegram message to the human: "Your video is uploaded as unlisted. Watch: [URL]. Type publish to make it public, or regenerate to start over."

Cost: zero (YouTube Data API is free for the upload quotas most channels will ever use).

The whole pipeline, end-to-end

Let us put all five agents in one OpenClaw workflow so it runs on a schedule.


# /root/.openclaw/workflows/youtube-pipeline.yaml
name: youtube-weekly-pipeline
trigger:
  kind: cron
  expr: "0 9 * * 1"   # every Monday 9am
steps:
  - id: research
    agent: youtube-researcher
    input:
      seed_topic: "{{inputs.seed_topic}}"
      target_audience: "{{inputs.target_audience}}"
    output: state.brief.json

  - id: script
    agent: youtube-scriptwriter
    needs: [research]
    input:
      brief: state.brief.json
    output: script.md

  - id: voice
    agent: voiceover-producer
    needs: [script]
    input:
      script: script.md
      voice: "{{inputs.voice_id}}"
    output: voiceover.mp3

  - id: thumbnail
    agent: thumbnail-designer
    needs: [research]
    input:
      brief: state.brief.json
    output:
      - thumbnail_v1.png
      - thumbnail_v2.png
      - thumbnail_v3.png

  - id: human-pick
    kind: human_in_loop
    needs: [thumbnail]
    message: "Pick a thumbnail (1, 2, 3) or type 'regenerate'."
    timeout_minutes: 1440   # 24h, then auto-pick v1

  - id: upload
    agent: youtube-publisher
    needs: [voice, human-pick]
    input:
      brief: state.brief.json
      script: script.md
      voiceover: voiceover.mp3
      thumbnail: thumbnail_selected.png
    output: upload_log.json

  - id: notify
    kind: notify
    needs: [upload]
    channel: telegram
    target: "{{inputs.owner_chat_id}}"
    message: "Video is up (unlisted): {{steps.upload.watch_url}}"

  - id: publish
    kind: human_in_loop
    needs: [notify]
    message: "Type 'publish' to make public, 'hold' to keep unlisted."
    timeout_minutes: 4320  # 3 days

This is the whole pipeline. One YAML, five agents, two human-in-the-loop pauses (thumbnail pick, final publish). The cron trigger fires it every Monday at 9am. By 9:15am, the brief is in your inbox. By 9:30am, the script is written. By 10:00am, the voiceover is rendered and the thumbnails are waiting for you. By lunch, the video is uploaded unlisted. By Friday, you have either published it or you have not, and either way you are not spending your week on production.

Cost: what this actually costs

The per-video bill, assuming an 8-minute long-form video, English, ElevenLabs v3, Flux Schnell thumbnails, and Claude Sonnet for scripts:

Stage	Cost per video
Research (Serper + Claude)	$0.01
Script (Claude Sonnet)	$0.03
Voice (ElevenLabs v3)	$0.40
Thumbnail (Flux Schnell × 3)	$0.08
Upload (YouTube API)	$0.00
Total	~$0.52 per video

At one video per week, that is $27 a year to run the whole pipeline, end to end, with five agents, two human checkpoints, and a published video. The hosting of OpenClaw itself is a separate line item (the hosted version is $19/month, the self-hosted version is whatever your server costs).

For comparison, a US-based freelance YouTube editor doing this work by hand would charge $150-400 per video for a comparable result. The pipeline pays for itself in week one.

What can go wrong

A few things to watch for in production.

Voice cloning is regulated. If you clone a specific person's voice, you are in the territory of right-of-publicity laws in some US states and the EU AI Act in Europe. The default voices (the ones ElevenLabs ships) are licensed for commercial use. Custom cloned voices are your responsibility. Document which voice you used and why.

YouTube may reject your upload. The Data API has a 50MB / 60-second initial chunk limit for resumable uploads, and the channel must have a verified phone number. If the channel is brand new, the first upload is often held for human review (24-48 hours). Build that delay into the workflow.

Thumbnails are a human decision. Do not auto-pick. The cost of a bad thumbnail is a 3x lower CTR, which dwarfs the cost of asking the human to click one of three buttons. Keep the human in the loop on this one.

The script agent is not the brand voice. The script agent writes a voice, not your voice. If you have a strong channel identity, paste 2-3 of your best-performing scripts into the brief as voice_examples. The agent will mirror them.

The pipeline can fail mid-run. If stage 3 (voice) fails on a transient ElevenLabs outage, you do not want to re-do stages 1 and 2. OpenClaw's workflow engine has idempotent steps and a per-step retry budget; use them. Set max_retries: 3 and retry_backoff: exponential on every step.

The bigger picture: this is what agent orchestration looks like

The reason this guide exists is not to teach you how to automate a YouTube channel. The reason is to show you what an OpenClaw multi-agent pipeline looks like in practice.

Every step in this guide is a bounded, replaceable agent. You can swap ElevenLabs for OpenAI TTS by changing the prompt and the secret key. You can swap Claude for GPT by changing the model alias. You can swap the YouTube uploader for a TikTok uploader by writing one more agent. The pipeline is a graph, not a monolith, and the graph is the product.

The same shape applies to:

A weekly newsletter pipeline. Research → write → format → send.
A weekly podcast pipeline. Research → script → voice → edit → publish.
A weekly SEO content pipeline. Keyword research → brief → article → image → publish.
A weekly sales outreach pipeline. Lead list → enrichment → email draft → send → follow-up.

If the workflow has more than two steps and more than one tool, OpenClaw is the right substrate. YouTube is the cleanest example, which is why we used it. The pattern is the thing.

FAQ

What is the best AI agent for YouTube automation in 2026? The right answer depends on which steps you want to automate. For the full five-stage pipeline above, OpenClaw on a workspace with the five tools wired in is the cleanest off-the-shelf option. For one-off scripts, ChatGPT or Claude is fine. For voice only, ElevenLabs' built-in editor works. For upload only, TubeBuddy. The pipeline is the value; no single tool does the whole thing alone.

Can AI fully automate a YouTube channel? Technically yes. Practically no — you want a human in the loop on the thumbnail (highest-leverage 1% of the video) and the final publish (brand-safety). The pipeline above gets you 95% of the way there; the last 5% is the human's call. That is the right shape.

How much does it cost to automate a YouTube channel with AI? The pipeline above costs roughly $0.50 per video for the AI calls. The hosting of OpenClaw (or whatever orchestrator you use) is a separate line item — the hosted GolemWorkers version is $19/month, self-hosted is your server cost. The AI itself is the cheap part. The orchestrator is the long-term cost.

Can I use this for YouTube Shorts? Yes. Shorts are an 8th of the work: skip the long-form script, render the voice from a 60-second script, use a 9:16 thumbnail (or a vertical video frame), upload as a Short. The same five agents, just with different target_length_seconds and a different aspect ratio on the thumbnail step.

Is this against YouTube's terms of service? No, with two caveats. The first is that auto-generated content must still meet YouTube's quality guidelines (no spam, no misleading metadata, no mass-produced low-effort content). The second is that AI-generated content disclosure rules are tightening — in 2026, several jurisdictions require you to label AI-generated content. Add a small made_with_ai: true line to your metadata and disclose in the description.

What about copyright on AI voices? The default voices from ElevenLabs, OpenAI, and Google are licensed for commercial use. Custom cloned voices are your responsibility — see the right-of-publicity note above. For music, the background-track providers in this space (Epidemic Sound, Artlist) have AI-friendly licenses; check the fine print.

Can I run this without coding? The workflow YAML above is 30 lines and you can copy-paste it. The prompts are full English. The setup is one-time and well-documented. If you can set up a Notion template, you can set up this pipeline. If you cannot, GolemWorkers has a one-click install for the full workflow.

What if I want to add TikTok or Instagram Reels? Write a sixth agent. The pipeline structure is the same: take the same script and voiceover, render at 9:16, upload via the TikTok/Instagram API. The OpenClaw workflow engine treats it as one more step in the DAG.

How do I avoid the "AI slop" look on my channel? Three things. First, the human-pick on the thumbnail is non-negotiable. Second, paste 2-3 of your real scripts into the brief as voice_examples; the script agent will mirror them. Third, watch the first 5 videos the pipeline produces and edit the script prompt based on what does not sound like you. The pipeline is a starting point, not a finished product. Iterate the prompts the way you would iterate a junior writer's draft.

Try it on GolemWorkers

The five agents in this guide, plus the workflow YAML, are available as a one-click template on GolemWorkers. The hosted version sets up the agents, wires the secrets (yours, not ours), and gives you a dashboard tab for the pipeline. The self-hosted version is the same code, run on your machine. The link is in the description.

If you have a specific niche or a different cadence (daily Shorts, twice-weekly long-form, monthly podcast), the same agents re-shape to your run. The pipeline is a graph; the graph is the product; the product is the thing that does the work.