2026-06-18

GLM-5.2 Goes Fully Open: 753B MoE with 1M Context (2026)

GLM-5.2 from Z.ai is now fully open-source under MIT: 753B MoE, 1M context, beats GPT-5.5 on long-horizon coding at 1/6 the cost. Use it on OpenClaw.

2026-06-17

GLM-5.2 Goes Fully Open: 753B MoE with 1M Context (2026)

On June 13, 2026, Z.ai open-sourced GLM-5.2 — a 753-billion-parameter Mixture-of-Experts model with a usable 1-million-token context window and an MIT license. This article covers what shipped, where it lands on the benchmarks, and how to wire it into OpenClaw on GolemWorkers as a drop-in replacement for Claude and GPT on long-horizon coding jobs at roughly one-sixth the API cost.

Four days ago, on June 13, 2026, Z.ai — the company formerly known as Zhipu AI — pushed the weights of GLM-5.2 to Hugging Face and ModelScope under an MIT license, with no research-only restrictions and no borders clause. Two weeks after MiniMax M3, this is the second open-weights bombshell of the month, and it hits harder: a 753-billion-parameter Mixture-of-Experts design with 40 billion parameters active per token, a 1-million-token context window that the team explicitly trained for long-horizon agentic use, and an API price of $1.40 per million input tokens and $4.40 per million output tokens — roughly one-sixth of GPT-5.5 at the same scale. On FrontierSWE, a benchmark that grades agents on multi-hour to multi-day open-source engineering tasks, GLM-5.2 hits 74.4% versus GPT-5.5's 72.6% — the first open-weight model to beat a frontier closed model on this test.

For GolemWorkers users, the practical question is the same one we got the day GPT-5 launched: can I use this from OpenClaw? The answer is yes, in two ways. The hosted Z.ai API works as an OpenAI-compatible drop-in — set the base URL, paste the key, and any OpenClaw agent that already speaks openai.ChatCompletion.create will route through GLM-5.2 instead of OpenAI. And because the weights are MIT, you can also self-host on a GolemWorkers dedicated worker if you have the GPU budget (eight H200s or sixteen H100s is the published minimum). This article walks through the news, the benchmark numbers, the technical architecture, and the two integration paths so you can decide which one fits.

What shipped on June 13

The release is a coordinated drop across four surfaces: the weights, an OpenAI-compatible API, a coding-plan subscription, and a v5.2 line of the Z.ai chat product. The headline numbers, all from the Z.ai blog post and the model card:

Spec GLM-5.2 What it means in practice
Parameters 753B total, 40B active per token (MoE) Comparable inference cost to a 40B dense model despite the 753B total parameter count
Context window 1,000,000 tokens input Lossless, not lossy compression — usable, not theoretical
Max output 128,000 tokens per turn Enough for a full code file or a long-form report in one response
License MIT Commercial use, modification, redistribution all permitted with attribution
API pricing $1.40 / 1M input · $4.40 / 1M output ~1/6 the cost of GPT-5.5 at parity, ~1/4 of Claude Opus 4.8
Self-host min 8× H200 or 16× H100 GPUs (FP8 weights) One full GolemWorkers dedicated worker, 4-GPU bare metal SKU is too small
Reasoning modes High / Max (dual thinking effort) High for normal tasks, Max for multi-step coding and planning

The MIT license is the headline. It is not "open weights for research only." It is the same license the Qwen team uses for the 3-series and the same license Llama 3 used to ship under. You can fine-tune GLM-5.2, distill it into a smaller model, sell fine-tuned versions, and run it as a managed API — all without negotiating with Z.ai. The one restriction that does remain is the export-control compliance note on the Hugging Face repo: if you are in a jurisdiction covered by U.S. export restrictions targeting PRC-origin frontier models, check with counsel before deploying.

Benchmark scores that matter

The model card and Z.ai's blog post publish the following numbers. Three are worth caring about; six are noise. Here are the three that actually change buying decisions:

FrontierSWE — 74.4%. This is the benchmark that measures an agent's ability to complete a real open-source engineering task end-to-end over hours or days — not "write a function" but "refactor the storage layer of a 200K-line codebase, run the test suite, fix the failures, open a PR." GLM-5.2 hits 74.4%, ahead of GPT-5.5 at 72.6% and within 0.7 points of Claude Opus 4.8 at 75.1%. For teams that ship agents for long-horizon work — code refactoring, multi-system coordination, large PRs — this is the number that makes GLM-5.2 a credible replacement for the closed frontier models on cost alone.

Terminal-Bench 2.1 — 81.0%. The first open-weight model to break 80% on this benchmark, which grades terminal-based agentic coding across dozens of tools. Up from GLM-5.1's 63.5 — a 17.5-point jump in one release, which is unusually large for a single version bump and suggests the team did serious RL work between 5.1 and 5.2.

MCP-Atlas — 76.8%. This benchmark specifically grades tool invocation through the Model Context Protocol — the same protocol OpenClaw exposes to agents. GLM-5.2 beats GPT-5.5's 75.3 here, which is relevant if you build agents on OpenClaw that call many tools in a single turn (which is most production agents).

Two more worth a mention. SWE-bench Pro — 62.1%, up from GLM-5.1's 58.4%. And AIME 2026 — 99.2%, near saturation, mostly a marketing number now. The model also went #2 globally on the LMArena Coding Blind Test (ahead of Claude Opus 4.7 and 4.8) and #1 globally on the Design Arena Design Programming test — the first time an open-source model has defeated top closed models in blind testing. The Arena wins are softer signal than the reproducible benchmarks because Arena votes are noisy, but they corroborate that the model feels competitive to users who don't know what they're talking to.

The one open-source comparison worth making explicit: GLM-5.2 is meaningfully ahead of Gemini 3.1 Pro and the open-weight Llama 4 variants on FrontierSWE, SWE-bench Pro, and Terminal-Bench 2.1. The previous generation of open-weight models (Llama 4 Behemoth, DeepSeek V4) trailed Claude and GPT by 10-15 points on these benchmarks. GLM-5.2 closes most of that gap.

The technical story: IndexShare and dual reasoning

Two architectural choices are worth understanding because they show up in how the model behaves at inference time.

IndexShare sparse attention. The pain point of 1-million-token context is compute cost — full attention over a 1M-token sequence scales quadratically. Z.ai's answer is an IndexShare mechanism that reuses a lightweight indexer across every four sparse attention layers, so most of the layers can attend selectively instead of computing full attention over the whole context. The published result is that GLM-5.2 holds coherent reasoning across a full 1M-token input without the usual "lost in the middle" degradation you see with RAG-retrieved context. In practice this means you can point it at a 600K-token code repository and ask a question about a function defined on line 4,217 of a file referenced from line 312,889, and it will answer correctly. Most other long-context models fail that test at 200K tokens.

Dual High and Max reasoning modes. The API exposes two reasoning effort levels. High is the default — fast, comparable to GLM-5.1 in latency and cost. Max enables the deeper thinking path: longer chain-of-thought, more candidate evaluations per step, slower and more expensive. The official guidance is to use High for normal coding and chat, and to use Max for multi-step planning, hard debugging, and security-sensitive code review. On GolemWorkers, you can configure per-agent reasoning mode in the worker config — useful if you want your refactor agent to use Max while your Slack-bot agent stays on High.

How to use GLM-5.2 with OpenClaw on GolemWorkers

Two integration paths. The hosted API path is 10 minutes of setup. The self-host path is a procurement decision.

Path A — Hosted Z.ai API (10 minutes)

The Z.ai API is OpenAI-compatible. OpenClaw already speaks the OpenAI chat completions protocol, so wiring GLM-5.2 in means pointing the existing client at a new base URL.

In your worker environment on GolemWorkers, set two env vars:

export ZAI_API_KEY=*** ~/.secrets/zai-api-key)
          export ZAI_BASE_URL=https://api.z.ai/v1   # or the EU endpoint, https://api.eu.z.ai/v1
          

Then in the worker config (~/.openclaw/agents/<agent-name>/config.yaml), point the model field at GLM-5.2 and pick the reasoning mode:

model: glm-5-2-high   # or glm-5-2-max for the deeper reasoning mode
          model_provider: zai
          model_fallback:
          - claude-opus-4-8   # automatic fallback if GLM-5.2 errors or rate-limits
          

Restart the agent:

openclaw restart
          

Verify with a smoke test — a coding question where you already know the expected answer:

openclaw run --agent smoke "what's the output of `python -c 'print(sum(range(10**7)))'`?"
          

If the model is wired correctly, you get the right answer in under five seconds and the response header shows X-Model: glm-5-2-high. Cost on this single query is roughly $0.0001 — about one ten-thousandth of what the same query would cost on Claude Opus 4.8.

For teams running GLM-5.2 in production on GolemWorkers, the realistic workflow is to use GLM-5.2 for the bulk of long-horizon work (refactor agents, multi-step planning agents, code review at scale) and keep Claude Opus as the fallback for the 5-10% of queries that need the absolute best quality. The cost split ends up around 80/20 — most queries on GLM, the hard ones on Claude — which lands the per-task cost at roughly $0.30 on Opus-equivalent workloads where pure Claude would cost $1.80.

Path B — Self-hosted weights (procurement-level)

If you have data-residency requirements, hit GLM-5.2 hard enough that the API bill exceeds the dedicated-server cost, or just want full control of the weights, you can self-host on a GolemWorkers dedicated worker.

The published minimum is 8× H200 GPUs or 16× H100 GPUs in FP8. The closest GolemWorkers SKU is the dedicated 8-GPU H200 worker (around $4,800/month at the time of writing). For inference at scale, you also want the vLLM 0.7+ server, which has first-class GLM-5.2 support and the IndexShare path integrated as of the v0.7.2 release.

The deploy pattern:

# On a GolemWorkers dedicated 8xH200 worker
          pip install vllm>=0.7.2

          # Download weights (Hugging Face or ModelScope)
          huggingface-cli download zai/GLM-5.2 --include "*.safetensors" --local-dir /models/glm-5-2

          # Start the vLLM server
          vllm serve /models/glm-5-2 \
          --served-model-name glm-5-2-high \
          --tensor-parallel-size 8 \
          --max-model-len 1048576 \
          --gpu-memory-utilization 0.92
          

The OpenClaw worker config then points at the local vLLM endpoint instead of the Z.ai API:

model: glm-5-2-high
          model_provider: openai
          model_base_url: http://localhost:8000/v1
          model_api_key: not-needed    # vLLM ignores the auth header
          

The catch with self-hosting is throughput. A single 8×H200 node serves roughly 8-12 concurrent requests at 1M-context input at acceptable latency. For a team running 50+ agents concurrently, you need a multi-node setup with tensor parallelism across workers, which is operationally non-trivial. Most GolemWorkers users start on Path A and migrate to Path B only when the API bill crosses the dedicated-server cost — usually around 200M tokens/month sustained.

What to use GLM-5.2 for

The honest answer is: anything you'd use Claude Opus for, except the 5-10% of tasks where Opus is meaningfully better. Based on the benchmark spread and early user reports, here is where GLM-5.2 is a clear win, a coin flip, and a clear loss:

Clear win — long-horizon agentic coding. Refactor agents, multi-system coordination, large-PR generation, multi-day planning tasks. This is what the model was trained for, and the IndexShare path means the 1M context window actually helps on real workloads, not just on marketing demos.

Clear win — high-volume, cost-sensitive workloads. Bulk code review, large-scale refactor, CI/CD log triage, anything where the answer needs to be "good enough" and the cost-per-query matters. At 1/6 the cost of GPT-5.5, you can afford to run ten times as many agents on the same budget.

Coin flip — general chat and short-form coding. For under-200K-context single-turn chat, GLM-5.2 is competitive with Claude Sonnet and GPT-5.5 but not obviously better. Pick on cost and latency.

Clear loss — the hardest 5%. Tasks that need absolute top-of-leaderboard reasoning (frontier math, novel-architecture security review, multi-step formal verification) still go to Claude Opus 4.8 or GPT-5.5 high-effort. GLM-5.2 Max closes some of the gap but doesn't erase it.

Cost: GLM Coding Plan vs API vs self-host

Three ways to pay for the model, with rough monthly costs at different usage levels:

Workload GLM Coding Plan Z.ai API Self-host (8xH200)
5M tokens/month (hobbyist) $5/month flat ~$25 not cost-effective
50M tokens/month (small team) $20/month flat ~$250 not cost-effective
500M tokens/month (production agent) upgrade to usage-based, ~$2,000 ~$2,500 $4,800 (break-even vs API around 200M tok/mo)
5B tokens/month (heavy fleet) usage-based ~$25,000 $4,800 + ops

The GLM Coding Plan at $5/month for 5M tokens is the cheapest path for individual developers and hobby projects — the same plan tier Z.ai launched for GLM-4.6, now extended to GLM-5.2. For anything above 50M tokens/month, you compare the API bill to a dedicated worker and migrate when the API cost crosses the worker cost.

What to watch next

Three things will move the needle for OpenClaw users in the next 30 days.

Fine-tunes. The MIT license allows commercial fine-tuning, and the first community fine-tunes of GLM-5.2 will land on Hugging Face within 2-4 weeks. Watch for coding-specialized variants (likely from the DeepSeek and Qwen fine-tune communities) and tool-use-specialized variants.

Open-source MCP support. Z.ai has hinted at first-class MCP server support in the next API release. If they ship a hosted MCP tool catalog the way Anthropic has for Claude, that changes the calculus for hosted-vs-self-host.

GolemWorkers managed GLM-5.2 worker. Vsevolod has signaled that the next GolemWorkers dedicated-worker SKU after the 8-GPU H200 will be a pre-configured GLM-5.2 inference server — eight H200s, vLLM pre-installed, weights pre-downloaded, OpenAI-compatible endpoint exposed. If that ships in Q3 2026 as rumored, self-hosting GLM-5.2 on GolemWorkers becomes a one-click deployment instead of a 4-hour setup.

FAQ

What is GLM-5.2?

GLM-5.2 is the latest flagship model from Z.ai (formerly Zhipu AI), released on June 13, 2026 under the MIT license. It is a 753-billion-parameter Mixture-of-Experts language model with 40 billion parameters active per token, a 1-million-token context window, and 128K-token max output. It is positioned for long-horizon agentic coding and multi-system engineering tasks.

Is GLM-5.2 really open source?

Yes, under the MIT license — the same license used by Qwen 3 and earlier Llama 3 versions. You can download the weights from Hugging Face or ModelScope, fine-tune them, distill them, sell fine-tuned versions, and deploy them as a managed service. The only practical restriction is export-control compliance: if you are in a jurisdiction covered by U.S. export restrictions on PRC-origin frontier models, check with counsel before deploying. There is no research-only restriction and no use-based restriction.

How does GLM-5.2 compare to GPT-5.5 and Claude Opus 4.8?

On FrontierSWE — a long-horizon coding benchmark — GLM-5.2 hits 74.4%, ahead of GPT-5.5's 72.6% and within 0.7 points of Claude Opus 4.8's 75.1%. On Terminal-Bench 2.1, GLM-5.2 is the first open-weight model to break 80% (81.0%). On SWE-bench Pro, GLM-5.2 hits 62.1%, up from GLM-5.1's 58.4%. On MCP-Atlas (tool invocation), GLM-5.2 hits 76.8%, ahead of GPT-5.5's 75.3. The honest summary: GLM-5.2 is the first open-weight model competitive with the closed frontier on long-horizon agentic coding, and it does so at roughly 1/6 the API cost.

How much does GLM-5.2 cost to run?

The Z.ai API is $1.40 per million input tokens and $4.40 per million output tokens — roughly 1/6 the cost of GPT-5.5 and 1/4 the cost of Claude Opus 4.8. The GLM Coding Plan subscription is $5/month flat for 5 million tokens, $20/month for 50 million tokens. Self-hosting on an 8× H200 GolemWorkers dedicated worker runs about $4,800/month and breaks even with the API around 200 million tokens/month sustained throughput.

Can I use GLM-5.2 with OpenClaw on GolemWorkers?

Yes, in two ways. The hosted API path takes about 10 minutes: set ZAI_API_KEY and ZAI_BASE_URL in the worker env, point the model field at glm-5-2-high (or glm-5-2-max for the deeper reasoning mode), and restart the agent. The self-host path is procurement-level: provision an 8× H200 GolemWorkers dedicated worker, install vLLM 0.7.2+, download the weights, start the server, and point OpenClaw at http://localhost:8000/v1 with model_provider: openai.

What is IndexShare?

IndexShare is Z.ai's sparse attention mechanism for long-context inference. It reuses a lightweight indexer across every four sparse attention layers, so most layers attend selectively instead of computing full attention over the whole 1M-token context. The practical effect is that GLM-5.2 holds coherent reasoning across a full 1-million-token input — including "lost in the middle" cases where the relevant information is buried at 40-60% depth — at a fraction of the compute that full attention would require.

What is the GLM Coding Plan?

The GLM Coding Plan is Z.ai's flat-rate subscription for individual developers and small teams. It is $5/month for 5 million tokens or $20/month for 50 million tokens, with overage billed at API rates. It is positioned for hobby projects, individual developers, and small teams that want predictable billing. Production workloads above 50 million tokens/month should compare to the usage-based API or a dedicated worker.

Should I migrate from Claude or GPT to GLM-5.2?

For long-horizon coding, multi-step agentic workflows, and high-volume bulk workloads — yes, on cost alone. The benchmark spread on FrontierSWE and Terminal-Bench 2.1 means you are giving up at most 0.7 points of quality for a 6× cost reduction. For the hardest 5% — frontier math, novel-architecture security review, multi-step formal verification — keep Claude Opus 4.8 or GPT-5.5 high-effort as the fallback. A practical split is 80/20: 80% of queries on GLM-5.2, 20% on the closed frontier for the hardest cases. This drops the per-task cost on Opus-equivalent workloads from about $1.80 to about $0.30.

Related articles