Ideogram 4.0: The Open-Weight Model That Actually Renders Text
A 9.3B model that beats 80B competitors on typography — and what that means for every builder working with design.
TL;DR
- → Ideogram 4.0 is a 9.3B open-weight text-to-image model released June 3, 2026 — the first open model from Ideogram.
- →It beats every other open-weight model on text rendering, scoring 0.97 OCR accuracy — while being 3–9× smaller than competitors.
- →The secret weapon is JSON prompting: structured captions with bounding boxes, hex color palettes, and per-element text control.
- →Community is already running it in ComfyUI with day-0 native support and three prompting methods.
- →The catch: commercial use of the weights requires a paid license. The API and web app are covered under standard terms.
- →Where it genuinely falls short: photorealistic human portraiture still belongs to GPT Image 2 and closed models.
On June 3, 2026, Ideogram did something the open-source image generation community has been waiting for: they dropped their weights. Ideogram 4.0 is a 9.3-billion-parameter text-to-image model you can download, run locally, fine-tune, and build on. More importantly, it's the first model in its parameter class that can render readable, correctly spelled, properly styled text inside images — reliably. That one capability changes more workflows than the benchmark numbers suggest.
Parameters
9.3B
▲ 3–9× smaller than rivals
Designer preference ELO
1062
▲ #1 open, #2 overall
Client-work usability score
3.55/5 Change vs 2.84 for Nano Banana 2
Why text rendering in images is harder than it looks
Every image generation model struggles with text. Ask Midjourney, FLUX, or Stable Diffusion to render the word "Breakfast" on a café sign and you will get something that looks like text from three feet away and reads like gibberish up close. Letters get transposed. Fonts blend into backgrounds. Apostrophes become lowercase Ls. This isn't a bug — it's a fundamental limitation of how diffusion models learn.
Standard text-to-image models learn from image–caption pairs. The caption describes the image, but the model isn't specifically supervised to produce readable glyphs. It learns that "text-looking pixels" correlate with certain prompts, but it doesn't learn the alphabet as a structured system. The result is plausible-looking text that doesn't actually say the right thing.
Ideogram's core research bet — one they've been building toward since the original Ideogram 1.0 — is that you can fix this by changing what you train on. Instead of natural-language captions, you train on structured JSON descriptions that exhaustively describe every element in the image, including text elements with their exact strings, styling, and position. The model learns text as a first-class output, not as a side effect of image style.
Good to know
The paper behind the approach Text: Ideogram's structured caption training is based on research published as "Generating an Image From 1,000 Words: Enhancing Text-to-Image With Structured Captions" (Gutflaish et al., 2025). The core insight: the more relationships a caption pins down per training pair, the more grounded the model's supervision becomes.
What Ideogram 4.0 actually is under the hood
The architecture is a single-stream Diffusion Transformer (DiT) — 34 transformer layers where text and image tokens share one sequence and one set of projections at every layer. This is the same pattern as HunyuanImage 3.0, Z-Image, and HiDream-O1, so the broad strokes aren't novel. What distinguishes Ideogram 4.0 are two specific design choices.
First: the text encoder. Most single-stream DiT models use either a single hidden state from a text encoder or no external encoder at all. Ideogram 4.0 uses Qwen3-VL-8B-Instruct — a vision-language model — as its text encoder, and the DiT consumes hidden states from 13 of its intermediate layers, concatenated along the feature dimension. That's a much richer text representation than a single final-layer embedding. It's a significant reason the model understands complex, compositional prompts better than its parameter count would suggest.
Second: asymmetric classifier-free guidance. Standard CFG runs a conditional pass (text + image) and an unconditional pass (text replaced with a null embedding). Ideogram 4.0's unconditional pass drops the text tokens entirely — it runs only over image tokens. This means the two passes can be tuned independently, giving separate control over prompt adherence and image quality across the sampling trajectory. In practice, this is why the "quality" presets run 45 steps with a guidance weight of 7 followed by 3 polish steps at weight 3 near the end of sampling — tightening fine detail without over-saturating the global composition.
Comparison
| Field | Value | Rows |
|---|---|---|
| 0 | Parameters 1 | 9.3B |
| 0 | Transformer layers 1 | 34 |
| 0 | Embedding dimension 1 | 4,608 |
| 0 | Text encoder 1 | Qwen3-VL-8B-Instruct (frozen) |
| 0 | Encoder layers consumed 1 | 13 intermediate layers |
| 0 | Sampler 1 | Euler flow-matching + asymmetric CFG |
| 0 | VAE 1 | KL autoencoder, 8× spatial compression |
| 0 | Resolution range 1 | 256–2048 px per side, any aspect ratio |
| 0 | Max text tokens 1 | 2,048 |
| 0 | Quantization options 1 | fp8 and nf4 |
| 0 | Minimum GPU for nf4 1 | 24 GB VRAM |
The benchmark numbers — and what they actually mean
Ideogram published four benchmark scores at launch. Let's look at each one honestly, because two of them are extraordinary and two require some context.
Text rendering at 0.97 is the genuinely remarkable number. X-Omni measures English OCR accuracy on text generated inside images. A score of 0.97 means the model correctly renders text 97% of the time as measured by optical character recognition. No other open-weight model comes close — FLUX.2 dev (32B parameters, 3.4× larger) scores significantly lower. HunyuanImage v3 (80B MoE, ~9× the active parameters) also scores lower. Ideogram 4.0 at 9.3B is doing more per parameter on this specific capability than any other released open model.
Prompt alignment at 0.89 on Prism-bench is strong but not unusual for a well-trained model at this scale. Prism measures how well a model follows long, compositional inputs. The structured JSON training pays dividends here — the model was literally trained on exhaustive scene descriptions, so it's particularly good at following complex multi-element prompts.
Spatial reasoning at 0.76 and layout control at 0.69 are good but more competitive. 7Bench measures how tightly generated objects land inside requested bounding boxes. A 0.69 mIoU means there's meaningful room to improve — bounding box placement is not pixel-perfect, especially for complex scenes. This matters if you're building design workflows where exact element positioning is critical.
One benchmark caveat to know
Ideogram used gemini-2.5-flash instead of Qwen2.5-VL-72B (the leaderboard standard) to judge SpatialGenEval. They applied this change uniformly across all compared models, so cross-model comparisons remain valid — but the absolute scores are not directly comparable to other SpatialGenEval results you'll see cited elsewhere.
Designer preference: what the community actually voted
Beyond the automated benchmarks, Ideogram ran an internal arena where graphic designers picked the better of two generations without knowing which model produced each image. The results are more interesting than the radar chart suggests.
In head-to-head pairwise preference across 4,366 designer votes, Ideogram 4.0 came second overall with an ELO of 1,062 — behind only GPT Image 2 (1,141). Every other model scored lower: Nano Banana 2 (1,004), Grok Imagine 1.0 (990), FLUX.2 Pro (982). Among open-weight models, Ideogram 4.0's lead is substantial — roughly 160 ELO points ahead of the next open competitor (FLUX.2 dev at 900).
The more practically useful number: when asked "Would you use this in real client work?", designers scored Ideogram 4.0 at 3.55 out of 5, compared to 2.84 for Nano Banana 2, 2.61 for Grok Imagine 1.0, and 2.49 for FLUX.2 max. That gap between preference and usability is worth noting — designers found Ideogram 4.0 both prettier and more practically useful, which doesn't always go together.
4,366
Designer votes in the ELO arena
3.55/5
Client-work usability score
#1 Label
Ranked open-weight model on LMArena
JSON prompting — the feature that changes everything (and the learning curve that comes with it)
This is the part of Ideogram 4.0 that most coverage treats as a footnote. It shouldn't be. The model was trained exclusively on structured JSON captions, and the official inference pipeline validates every prompt against a JSON schema before generation. This isn't just a prompting technique — it's how the model fundamentally understands input.
A JSON prompt has three parts: a high_level_description (the overall scene), a style_description block (aesthetics, lighting, medium, color palette), and a compositional_deconstruction block (background + an array of typed elements). Each element can be an object ("type": "obj") or a text element ("type": "text"). Text elements carry the exact string to render and a separate visual description for styling.
Three things become possible with JSON that aren't possible with flat prompts:
- Color palette conditioning. You can specify up to 16 hex colors per image, and up to 5 per element. The model steers the dominant colors directly from the hex values rather than from descriptive language ("a warm amber sunset" vs
#E6B422). The results are meaningfully more controllable. - Bounding-box layout. Any element can be placed using
[y_min, x_min, y_max, x_max]in 0–1000 normalized coordinates. The model respects these through its shared 3D Multimodal RoPE positional space — which is the same positional encoding used for both text and image tokens. Not pixel-perfect, but substantially better than any natural-language spatial instruction. - Typed text elements. This is the core of the text rendering capability. A text element carries the literal string to render as one field and a separate visual description for how it should look. The separation means the model knows what to write and what to draw independently.
Example JSON prompt
{"high_level_description": "A square event poster for a product launch, clean and modern.","style_description": {"aesthetics": "Minimal, editorial, high contrast.","lighting": "Flat. No shadows.","medium": "Digital graphic design.","color_palette": ["#0A0A0A", "#FFFFFF", "#3B82F6"]},"compositional_deconstruction": {"background": "Clean white background filling the full square frame.","elements": [{"type": "text","bbox": [80, 80, 300, 920],"text": "LAUNCH","desc": "Large bold condensed sans-serif headline in deep black, filling the upper third."},{"type": "text","bbox": [310, 80, 420, 920],"text": "June 12, 2026 · San Francisco","desc": "Small caps in blue (#3B82F6), centered, tracking wide."}]}}
Community shortcut
use an LLM to write the JSON Text: Writing JSON prompts by hand is verbose. The community has already solved this. In ComfyUI, the KJ nodes include a JSON Prompt Builder that lets you describe your scene in plain language and converts it to a schema-valid JSON prompt automatically. You can also pipe plain text into any LLM with the Ideogram4 Caption Prompt Template as a system prompt.
Running it locally: what you actually need
Ideogram 4.0 ships in two quantized variants — fp8 and nf4 — both available on Hugging Face through the Comfy-Org repackaged collection. The nf4 checkpoint is the practical one for local use: it fits on a single 24 GB GPU. If you're on a 16 GB card, community reports suggest it's tight but possible with careful VRAM management.
The full model needs five files across four directories. ComfyUI is the most accessible entry point — it added native day-0 support and the workflow is in the template library. The ComfyUI tutorial includes three prompting methods: plain natural language (fast, less precise), structured JSON (slower to write, much more controlled), and LLM-assisted (the best of both — write natural language, LLM converts to JSON automatically).
How it works
- 1
Title
Update ComfyUI to v0.24.0 or later
- 2
Description
The Ideogram 4.0 workflow requires the latest ComfyUI nightly build. Update via the Manager or re-download if you're on the Desktop stable release (Desktop auto-updates will follow).
- 3
Download the five model files from Hugging Face
You need ideogram4_fp8_scaled.safetensors and ideogram4_unconditional_fp8_scaled.safetensors (both ~13.8 GB each) in models/diffusion_models/, qwen3vl_8b_fp8_scaled.safetensors (~8 GB) and optionally gemma4_e4b_it_fp8_scaled.safetensors (~2 GB) in models/text_encoders/, and flux2-vae.safetensors (~335 MB) in models/vae/.
- 4
Load the Ideogram 4 workflow from the Template Library
Open ComfyUI → Workflow Templates → find Ideogram 4 Text-to-Image. This gives you the pre-wired node graph with the JSON prompt input already connected.
- 5
Write your prompt — plain text or JSON Description
For plain text, just type normally. For JSON, paste your structured prompt directly. For LLM-assisted prompting, connect the Ideogram4 Caption Prompt Template node to any LLM tool node and describe your scene in plain language.
- 6
Run and iterate on bounding boxes
First generations rarely nail bounding-box placement exactly. Adjust y_min/x_min/y_max/x_max values in 0–1000 normalized coordinates. The model is not pixel-perfect, but each iteration should get visibly closer. If you see "Image blocked by safety filter" — this is the model's own built-in filter, not ComfyUI. Rephrase the offending element.
Where it wins, where it loses — an honest breakdown
The launch coverage was largely positive, but most of it came from people who tested the model for a few hours. Community testing over the following days has been more nuanced. Here's the honest picture from both the official benchmarks and what builders have actually reported.
| Use case | Verdict | Notes | Rows |
|---|---|---|---|
| 0 | Posters with readable text 1 | Best open-weight option 2 | 0.97 OCR accuracy. Typography moat is real. |
| 0 | Logo design with company name 1 | Wins decisively 2 | Nails text on second attempt vs 15+ for Midjourney. |
| 0 | Social media graphics with copy 1 | Strong 2 | SON color palette control + text elements = precise brand match. |
| 0 | Product packaging mockups 1 | Strong 2 | Bounding-box layout helps with label placement. |
| 0 | Multilingual text rendering 1 | Promising (community-reported) 2 | Spanish reported as better than other open models. Not formally benchmarked. |
| 0 | Photorealistic human portraiture 1 | GPT Image 2 wins 2 | Ideogram has a "slightly designed" quality. Portraits feel composed, not candid. |
| 0 | General illustration / no text 1 | Roughly on par with FLUX.2 2 | Strong but not the differentiator. FLUX still has a larger fine-tune ecosystem. |
| 0 | Complex multi-element layout precision 1 | Good, not perfect 2 | 7Bench mIoU of 0.69 means bounding boxes need iteration. Not pixel-perfect. Caption: Based on Ideogram benchmarks + community testing across Reddit, LocalLLaMA, and the ComfyUI forums (June 2026) |
Ideogram 4.0 wins decisively on typography and design intent. It is the parameter-efficient specialist, not the everything-model. Attribution: BuildFastWithAI analysis, June 2026
The licensing reality — read this before you build on it
This is the part most enthusiastic coverage glossed over, and it's important if you're a builder planning to use this model in a product.
The inference code is Apache 2.0. The model weights themselves are governed by Ideogram's Non-Commercial Model Agreement. The distinction matters enormously:
- Non-commercial use (research, personal projects, hobbyist testing): Full access. Download, inspect, fine-tune, run locally. No restrictions.
- Commercial deployment (any product or service where the model outputs have commercial value): You need a separate paid commercial license from Ideogram. The weights themselves cannot be used in commercial products without this.
- Using the API or web platform: Standard platform terms apply. Images generated through ideogram.ai are covered. API pricing runs from $0.03/image (Turbo) to $0.09/image (Quality tier).
This is not unusual — it's the same structure as many "open-weight" models including some FLUX variants. But it is a meaningful constraint if you were planning to build a commercial image generation feature using the local weights without a commercial license.
Before you ship anything commercial
If you're building a product on Ideogram 4.0's weights — not the API, the weights — contact Ideogram about commercial licensing before you launch. The non-commercial license is clear that production deployments require a separate agreement.
The strategic move Ideogram just made
Open-sourcing your best model is a counterintuitive decision for a company that built its business on a closed API. The ELO leaderboard advantage that justified Ideogram's pricing model just became a commodity anyone can download. Why?
The most convincing explanation: Ideogram believes the defensible value has already moved. The model weights aren't the moat anymore. The moat is the platform — the editable text layers, the print-on-demand integration, the style controls, the fine-tuning infrastructure, and the workflow ecosystem built around the model. Giving away the weights accelerates community adoption, brings Ideogram into the ComfyUI ecosystem, generates research contributions, and positions the company as the canonical provider of the best open design model — which drives API usage and platform subscriptions more effectively than keeping the weights closed.
There's a second read: "best open-weight model" is a fragile title. It changes hands every few weeks. Z-Image, Qwen Image, the next FLUX point release — the leaderboard shuffles constantly. What won't change as fast is Ideogram's specific training advantage in text and layout, which comes from proprietary data and a research culture focused on design. The company is betting that the typography moat outlasts the ELO ranking. Based on how hard text rendering is to improve through standard training approaches, that bet looks reasonable.
The weights are public, the moat isn't
Ideogram can open-source 4.0 because the real advantage is in training data, fine-tuning infrastructure, and the platform ecosystem — not the frozen weights.
Typography is the durable edge
Text rendering is hard to replicate quickly. A 0.97 OCR score from a 9.3B model that outperforms 80B competitors suggests a data and training advantage that won't close overnight.
The community multiplier
Day-0 ComfyUI support and a JSON prompt LLM builder mean the community is already extending the model faster than Ideogram's own team could.
The commercial license is the business
Non-commercial weights get developers hooked. Commercial licenses and API usage convert them into revenue. It's the open-core playbook.
What this means for you as a builder
If you build anything that involves images with text in them — and more products do than you'd think — Ideogram 4.0 is worth testing this week. Not next month. This week.
The JSON prompting system has a learning curve, but the community has already built tools that smooth it out. The LLM-assisted prompting workflow in ComfyUI means you don't need to write JSON manually for most use cases. And the payoff for design-heavy workflows — posters, ads, UI mockups, social graphics, packaging — is measurable. The "text that actually says the right thing" problem has been a silent tax on every image generation pipeline. Ideogram 4.0 is the first open-weight model that mostly fixes it.
For builders not working with commercial weights: the API is the practical path. At $0.03–$0.09 per image, the economics work for most product use cases. You get the same model, the same JSON prompting, and you don't need to manage 40 GB of local model files.
For builders evaluating whether to switch from FLUX or Midjourney: the answer depends entirely on your use case. General illustration with no text? FLUX has a larger fine-tune ecosystem and similar quality. Design work with readable copy? Ideogram 4.0 is the strongest open option available and it's not particularly close.
- 2022
Ideogram founded by former Google Brain researchers focused on text-in-image generation.
- Aug 2023
Ideogram 1.0 launches publicly — first model to reliably render readable text in images.
- Mar 2025
Ideogram 3.0 released with expanded style controls and improved realism.
- Jun 3, 2026
Ideogram 4.0 drops as the company's first open-weight model. 9.3B params, JSON prompting, day-0 ComfyUI support.
- Jun 3–7, 2026
Community confirms multilingual text improvements, builds LLM-assisted prompt tools, reports successful local runs on 24 GB GPUs.
Points
- Ideogram 4.0 is the best open-weight model for text rendering — 0.97 OCR accuracy while being 3–9× smaller than competitors that score lower.
- The JSON prompting system is the core differentiator: bounding boxes, hex color palettes, and typed text elements give you design control that flat prompts cannot.
- ComfyUI has day-0 native support with three prompting modes, including LLM-assisted conversion from plain language to JSON.
- Commercial use of the weights requires a paid license. API usage at $0.03–$0.09/image is the practical path for most builders.
- The typography moat is real and durable. Text rendering is architecturally hard to improve — Ideogram's lead here won't close in a single model release from competitors.
- Where it doesn't win: photorealistic portraiture (GPT Image 2 still leads), exact pixel-perfect bounding box placement (0.69 mIoU means iteration is needed), and any use case where FLUX's fine-tune ecosystem matters more than text quality.
Further reading
- [1]Ideogram AI. Ideogram 4.0 Technical Details. June 2026.
- [2]Gutflaish et al. Generating an Image From 1,000 Words. 2025
- [3]Hugging Face — ideogram-ai/ideogram-4-nf4 model card (ELO and designer preference data)
- [4]ComfyUI blog — Ideogram 4 Day-0 Support in ComfyUI. June 2026
- [5]PromptSlove — Ideogram 4.0 Complete Guide (22 use cases, community prompting results). June 2026.
- [6]BuildFastWithAI — Ideogram 4.0: Best Open-Weight Image Model of 2026? June 2026
- [7]GoEnhance AI — I Tested Ideogram 4.0: A Strong Design Model with a Messy Open-Weight Story. June 2026.
- [8]The Decoder — Ideogram 4.0 drops as an open-weight model. June 2026.
More breakdowns like this, every week. Description: ProdBlie covers AI, web development, and product building — with the depth and honesty that other blogs skip
Start reading