ProdBlie
AllBlogArticlesAnalysis

Explore products

ProdBlie

Read

BlogArticlesAnalysis

Company

About UsContact

Legal

Terms of UsePrivacy PolicyCookie Policy

ProdBlie © 2026

Article·AI

Ideogram 4.0: The Open-Weight Model That Actually Renders Text

A 9.3B model that beats 80B competitors on typography — and what that means for every builder working with design.

AI·June 7, 2026·10 min read
Ideogram 4.0: The Open-Weight Model That Actually Renders Text

TL;DR

  • → Ideogram 4.0 is a 9.3B open-weight text-to-image model released June 3, 2026 — the first open model from Ideogram.
  • →It beats every other open-weight model on text rendering, scoring 0.97 OCR accuracy — while being 3–9× smaller than competitors.
  • →The secret weapon is JSON prompting: structured captions with bounding boxes, hex color palettes, and per-element text control.
  • →Community is already running it in ComfyUI with day-0 native support and three prompting methods.
  • →The catch: commercial use of the weights requires a paid license. The API and web app are covered under standard terms.
  • →Where it genuinely falls short: photorealistic human portraiture still belongs to GPT Image 2 and closed models.

On June 3, 2026, Ideogram did something the open-source image generation community has been waiting for: they dropped their weights. Ideogram 4.0 is a 9.3-billion-parameter text-to-image model you can download, run locally, fine-tune, and build on. More importantly, it's the first model in its parameter class that can render readable, correctly spelled, properly styled text inside images — reliably. That one capability changes more workflows than the benchmark numbers suggest.

Parameters

9.3B

▲ 3–9× smaller than rivals

Designer preference ELO

1062

▲ #1 open, #2 overall

Client-work usability score

3.55/5 Change vs 2.84 for Nano Banana 2

Why text rendering in images is harder than it looks

Every image generation model struggles with text. Ask Midjourney, FLUX, or Stable Diffusion to render the word "Breakfast" on a café sign and you will get something that looks like text from three feet away and reads like gibberish up close. Letters get transposed. Fonts blend into backgrounds. Apostrophes become lowercase Ls. This isn't a bug — it's a fundamental limitation of how diffusion models learn.

Standard text-to-image models learn from image–caption pairs. The caption describes the image, but the model isn't specifically supervised to produce readable glyphs. It learns that "text-looking pixels" correlate with certain prompts, but it doesn't learn the alphabet as a structured system. The result is plausible-looking text that doesn't actually say the right thing.

Ideogram's core research bet — one they've been building toward since the original Ideogram 1.0 — is that you can fix this by changing what you train on. Instead of natural-language captions, you train on structured JSON descriptions that exhaustively describe every element in the image, including text elements with their exact strings, styling, and position. The model learns text as a first-class output, not as a side effect of image style.

Good to know

The paper behind the approach Text: Ideogram's structured caption training is based on research published as "Generating an Image From 1,000 Words: Enhancing Text-to-Image With Structured Captions" (Gutflaish et al., 2025). The core insight: the more relationships a caption pins down per training pair, the more grounded the model's supervision becomes.

What Ideogram 4.0 actually is under the hood

The architecture is a single-stream Diffusion Transformer (DiT) — 34 transformer layers where text and image tokens share one sequence and one set of projections at every layer. This is the same pattern as HunyuanImage 3.0, Z-Image, and HiDream-O1, so the broad strokes aren't novel. What distinguishes Ideogram 4.0 are two specific design choices.

First: the text encoder. Most single-stream DiT models use either a single hidden state from a text encoder or no external encoder at all. Ideogram 4.0 uses Qwen3-VL-8B-Instruct — a vision-language model — as its text encoder, and the DiT consumes hidden states from 13 of its intermediate layers, concatenated along the feature dimension. That's a much richer text representation than a single final-layer embedding. It's a significant reason the model understands complex, compositional prompts better than its parameter count would suggest.

Second: asymmetric classifier-free guidance. Standard CFG runs a conditional pass (text + image) and an unconditional pass (text replaced with a null embedding). Ideogram 4.0's unconditional pass drops the text tokens entirely — it runs only over image tokens. This means the two passes can be tuned independently, giving separate control over prompt adherence and image quality across the sampling trajectory. In practice, this is why the "quality" presets run 45 steps with a guidance weight of 7 followed by 3 polish steps at weight 3 near the end of sampling — tightening fine detail without over-saturating the global composition.

Comparison

FieldValueRows
0Parameters 19.3B
0Transformer layers 134
0Embedding dimension 14,608
0Text encoder 1Qwen3-VL-8B-Instruct (frozen)
0Encoder layers consumed 113 intermediate layers
0Sampler 1Euler flow-matching + asymmetric CFG
0VAE 1KL autoencoder, 8× spatial compression
0Resolution range 1256–2048 px per side, any aspect ratio
0Max text tokens 12,048
0Quantization options 1fp8 and nf4
0Minimum GPU for nf4 124 GB VRAM
Source: Ideogram 4.0 technical blog, June 2026

The benchmark numbers — and what they actually mean

Ideogram published four benchmark scores at launch. Let's look at each one honestly, because two of them are extraordinary and two require some context.

Ideogram 4.0 benchmark scores (0–1 scale) Bars:
Text rendering (X-Omni OCR)
Spatial reasoning (SpatialGenEval)
Layout control (7Bench mIoU)

Text rendering at 0.97 is the genuinely remarkable number. X-Omni measures English OCR accuracy on text generated inside images. A score of 0.97 means the model correctly renders text 97% of the time as measured by optical character recognition. No other open-weight model comes close — FLUX.2 dev (32B parameters, 3.4× larger) scores significantly lower. HunyuanImage v3 (80B MoE, ~9× the active parameters) also scores lower. Ideogram 4.0 at 9.3B is doing more per parameter on this specific capability than any other released open model.

Prompt alignment at 0.89 on Prism-bench is strong but not unusual for a well-trained model at this scale. Prism measures how well a model follows long, compositional inputs. The structured JSON training pays dividends here — the model was literally trained on exhaustive scene descriptions, so it's particularly good at following complex multi-element prompts.

Spatial reasoning at 0.76 and layout control at 0.69 are good but more competitive. 7Bench measures how tightly generated objects land inside requested bounding boxes. A 0.69 mIoU means there's meaningful room to improve — bounding box placement is not pixel-perfect, especially for complex scenes. This matters if you're building design workflows where exact element positioning is critical.

One benchmark caveat to know

Ideogram used gemini-2.5-flash instead of Qwen2.5-VL-72B (the leaderboard standard) to judge SpatialGenEval. They applied this change uniformly across all compared models, so cross-model comparisons remain valid — but the absolute scores are not directly comparable to other SpatialGenEval results you'll see cited elsewhere.

Designer preference: what the community actually voted

Beyond the automated benchmarks, Ideogram ran an internal arena where graphic designers picked the better of two generations without knowing which model produced each image. The results are more interesting than the radar chart suggests.

In head-to-head pairwise preference across 4,366 designer votes, Ideogram 4.0 came second overall with an ELO of 1,062 — behind only GPT Image 2 (1,141). Every other model scored lower: Nano Banana 2 (1,004), Grok Imagine 1.0 (990), FLUX.2 Pro (982). Among open-weight models, Ideogram 4.0's lead is substantial — roughly 160 ELO points ahead of the next open competitor (FLUX.2 dev at 900).

The more practically useful number: when asked "Would you use this in real client work?", designers scored Ideogram 4.0 at 3.55 out of 5, compared to 2.84 for Nano Banana 2, 2.61 for Grok Imagine 1.0, and 2.49 for FLUX.2 max. That gap between preference and usability is worth noting — designers found Ideogram 4.0 both prettier and more practically useful, which doesn't always go together.

4,366

Designer votes in the ELO arena

3.55/5

Client-work usability score

#1 Label

Ranked open-weight model on LMArena

JSON prompting — the feature that changes everything (and the learning curve that comes with it)

This is the part of Ideogram 4.0 that most coverage treats as a footnote. It shouldn't be. The model was trained exclusively on structured JSON captions, and the official inference pipeline validates every prompt against a JSON schema before generation. This isn't just a prompting technique — it's how the model fundamentally understands input.

A JSON prompt has three parts: a high_level_description (the overall scene), a style_description block (aesthetics, lighting, medium, color palette), and a compositional_deconstruction block (background + an array of typed elements). Each element can be an object ("type": "obj") or a text element ("type": "text"). Text elements carry the exact string to render and a separate visual description for styling.

Three things become possible with JSON that aren't possible with flat prompts:

  • Color palette conditioning. You can specify up to 16 hex colors per image, and up to 5 per element. The model steers the dominant colors directly from the hex values rather than from descriptive language ("a warm amber sunset" vs #E6B422). The results are meaningfully more controllable.
  • Bounding-box layout. Any element can be placed using [y_min, x_min, y_max, x_max] in 0–1000 normalized coordinates. The model respects these through its shared 3D Multimodal RoPE positional space — which is the same positional encoding used for both text and image tokens. Not pixel-perfect, but substantially better than any natural-language spatial instruction.
  • Typed text elements. This is the core of the text rendering capability. A text element carries the literal string to render as one field and a separate visual description for how it should look. The separation means the model knows what to write and what to draw independently.

Example JSON prompt

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
{
"high_level_description": "A square event poster for a product launch, clean and modern.",
"style_description": {
"aesthetics": "Minimal, editorial, high contrast.",
"lighting": "Flat. No shadows.",
"medium": "Digital graphic design.",
"color_palette": ["#0A0A0A", "#FFFFFF", "#3B82F6"]
},
"compositional_deconstruction": {
"background": "Clean white background filling the full square frame.",
"elements": [
{
"type": "text",
"bbox": [80, 80, 300, 920],
"text": "LAUNCH",
"desc": "Large bold condensed sans-serif headline in deep black, filling the upper third."
},
{
"type": "text",
"bbox": [310, 80, 420, 920],
"text": "June 12, 2026 · San Francisco",
"desc": "Small caps in blue (#3B82F6), centered, tracking wide."
}
]
}
}

Community shortcut

use an LLM to write the JSON Text: Writing JSON prompts by hand is verbose. The community has already solved this. In ComfyUI, the KJ nodes include a JSON Prompt Builder that lets you describe your scene in plain language and converts it to a schema-valid JSON prompt automatically. You can also pipe plain text into any LLM with the Ideogram4 Caption Prompt Template as a system prompt.

Running it locally: what you actually need

Ideogram 4.0 ships in two quantized variants — fp8 and nf4 — both available on Hugging Face through the Comfy-Org repackaged collection. The nf4 checkpoint is the practical one for local use: it fits on a single 24 GB GPU. If you're on a 16 GB card, community reports suggest it's tight but possible with careful VRAM management.

The full model needs five files across four directories. ComfyUI is the most accessible entry point — it added native day-0 support and the workflow is in the template library. The ComfyUI tutorial includes three prompting methods: plain natural language (fast, less precise), structured JSON (slower to write, much more controlled), and LLM-assisted (the best of both — write natural language, LLM converts to JSON automatically).

How it works

  1. 1

    Title

    Update ComfyUI to v0.24.0 or later

  2. 2

    Description

    The Ideogram 4.0 workflow requires the latest ComfyUI nightly build. Update via the Manager or re-download if you're on the Desktop stable release (Desktop auto-updates will follow).

  3. 3

    Download the five model files from Hugging Face

    You need ideogram4_fp8_scaled.safetensors and ideogram4_unconditional_fp8_scaled.safetensors (both ~13.8 GB each) in models/diffusion_models/, qwen3vl_8b_fp8_scaled.safetensors (~8 GB) and optionally gemma4_e4b_it_fp8_scaled.safetensors (~2 GB) in models/text_encoders/, and flux2-vae.safetensors (~335 MB) in models/vae/.

  4. 4

    Load the Ideogram 4 workflow from the Template Library

    Open ComfyUI → Workflow Templates → find Ideogram 4 Text-to-Image. This gives you the pre-wired node graph with the JSON prompt input already connected.

  5. 5

    Write your prompt — plain text or JSON Description

    For plain text, just type normally. For JSON, paste your structured prompt directly. For LLM-assisted prompting, connect the Ideogram4 Caption Prompt Template node to any LLM tool node and describe your scene in plain language.

  6. 6

    Run and iterate on bounding boxes

    First generations rarely nail bounding-box placement exactly. Adjust y_min/x_min/y_max/x_max values in 0–1000 normalized coordinates. The model is not pixel-perfect, but each iteration should get visibly closer. If you see "Image blocked by safety filter" — this is the model's own built-in filter, not ComfyUI. Rephrase the offending element.

Where it wins, where it loses — an honest breakdown

The launch coverage was largely positive, but most of it came from people who tested the model for a few hours. Community testing over the following days has been more nuanced. Here's the honest picture from both the official benchmarks and what builders have actually reported.

Ideogram 4.0 — win/lose by use case
Use caseVerdictNotes Rows
0Posters with readable text 1Best open-weight option 20.97 OCR accuracy. Typography moat is real.
0Logo design with company name 1Wins decisively 2Nails text on second attempt vs 15+ for Midjourney.
0Social media graphics with copy 1Strong 2SON color palette control + text elements = precise brand match.
0Product packaging mockups 1Strong 2Bounding-box layout helps with label placement.
0Multilingual text rendering 1Promising (community-reported) 2Spanish reported as better than other open models. Not formally benchmarked.
0Photorealistic human portraiture 1GPT Image 2 wins 2Ideogram has a "slightly designed" quality. Portraits feel composed, not candid.
0 General illustration / no text 1Roughly on par with FLUX.2 2Strong but not the differentiator. FLUX still has a larger fine-tune ecosystem.
0Complex multi-element layout precision 1Good, not perfect 27Bench mIoU of 0.69 means bounding boxes need iteration. Not pixel-perfect. Caption: Based on Ideogram benchmarks + community testing across Reddit, LocalLLaMA, and the ComfyUI forums (June 2026)
Ideogram 4.0 wins decisively on typography and design intent. It is the parameter-efficient specialist, not the everything-model. Attribution: BuildFastWithAI analysis, June 2026

The licensing reality — read this before you build on it

This is the part most enthusiastic coverage glossed over, and it's important if you're a builder planning to use this model in a product.

The inference code is Apache 2.0. The model weights themselves are governed by Ideogram's Non-Commercial Model Agreement. The distinction matters enormously:

  • Non-commercial use (research, personal projects, hobbyist testing): Full access. Download, inspect, fine-tune, run locally. No restrictions.
  • Commercial deployment (any product or service where the model outputs have commercial value): You need a separate paid commercial license from Ideogram. The weights themselves cannot be used in commercial products without this.
  • Using the API or web platform: Standard platform terms apply. Images generated through ideogram.ai are covered. API pricing runs from $0.03/image (Turbo) to $0.09/image (Quality tier).

This is not unusual — it's the same structure as many "open-weight" models including some FLUX variants. But it is a meaningful constraint if you were planning to build a commercial image generation feature using the local weights without a commercial license.

Before you ship anything commercial

If you're building a product on Ideogram 4.0's weights — not the API, the weights — contact Ideogram about commercial licensing before you launch. The non-commercial license is clear that production deployments require a separate agreement.

The strategic move Ideogram just made

Open-sourcing your best model is a counterintuitive decision for a company that built its business on a closed API. The ELO leaderboard advantage that justified Ideogram's pricing model just became a commodity anyone can download. Why?

The most convincing explanation: Ideogram believes the defensible value has already moved. The model weights aren't the moat anymore. The moat is the platform — the editable text layers, the print-on-demand integration, the style controls, the fine-tuning infrastructure, and the workflow ecosystem built around the model. Giving away the weights accelerates community adoption, brings Ideogram into the ComfyUI ecosystem, generates research contributions, and positions the company as the canonical provider of the best open design model — which drives API usage and platform subscriptions more effectively than keeping the weights closed.

There's a second read: "best open-weight model" is a fragile title. It changes hands every few weeks. Z-Image, Qwen Image, the next FLUX point release — the leaderboard shuffles constantly. What won't change as fast is Ideogram's specific training advantage in text and layout, which comes from proprietary data and a research culture focused on design. The company is betting that the typography moat outlasts the ELO ranking. Based on how hard text rendering is to improve through standard training approaches, that bet looks reasonable.

The weights are public, the moat isn't

Ideogram can open-source 4.0 because the real advantage is in training data, fine-tuning infrastructure, and the platform ecosystem — not the frozen weights.

Typography is the durable edge

Text rendering is hard to replicate quickly. A 0.97 OCR score from a 9.3B model that outperforms 80B competitors suggests a data and training advantage that won't close overnight.

The community multiplier

Day-0 ComfyUI support and a JSON prompt LLM builder mean the community is already extending the model faster than Ideogram's own team could.

The commercial license is the business

Non-commercial weights get developers hooked. Commercial licenses and API usage convert them into revenue. It's the open-core playbook.

What this means for you as a builder

If you build anything that involves images with text in them — and more products do than you'd think — Ideogram 4.0 is worth testing this week. Not next month. This week.

The JSON prompting system has a learning curve, but the community has already built tools that smooth it out. The LLM-assisted prompting workflow in ComfyUI means you don't need to write JSON manually for most use cases. And the payoff for design-heavy workflows — posters, ads, UI mockups, social graphics, packaging — is measurable. The "text that actually says the right thing" problem has been a silent tax on every image generation pipeline. Ideogram 4.0 is the first open-weight model that mostly fixes it.

For builders not working with commercial weights: the API is the practical path. At $0.03–$0.09 per image, the economics work for most product use cases. You get the same model, the same JSON prompting, and you don't need to manage 40 GB of local model files.

For builders evaluating whether to switch from FLUX or Midjourney: the answer depends entirely on your use case. General illustration with no text? FLUX has a larger fine-tune ecosystem and similar quality. Design work with readable copy? Ideogram 4.0 is the strongest open option available and it's not particularly close.

  • 2022

    Ideogram founded by former Google Brain researchers focused on text-in-image generation.

  • Aug 2023

    Ideogram 1.0 launches publicly — first model to reliably render readable text in images.

  • Mar 2025

    Ideogram 3.0 released with expanded style controls and improved realism.

  • Jun 3, 2026

    Ideogram 4.0 drops as the company's first open-weight model. 9.3B params, JSON prompting, day-0 ComfyUI support.

  • Jun 3–7, 2026

    Community confirms multilingual text improvements, builds LLM-assisted prompt tools, reports successful local runs on 24 GB GPUs.

Points

  • Ideogram 4.0 is the best open-weight model for text rendering — 0.97 OCR accuracy while being 3–9× smaller than competitors that score lower.
  • The JSON prompting system is the core differentiator: bounding boxes, hex color palettes, and typed text elements give you design control that flat prompts cannot.
  • ComfyUI has day-0 native support with three prompting modes, including LLM-assisted conversion from plain language to JSON.
  • Commercial use of the weights requires a paid license. API usage at $0.03–$0.09/image is the practical path for most builders.
  • The typography moat is real and durable. Text rendering is architecturally hard to improve — Ideogram's lead here won't close in a single model release from competitors.
  • Where it doesn't win: photorealistic portraiture (GPT Image 2 still leads), exact pixel-perfect bounding box placement (0.69 mIoU means iteration is needed), and any use case where FLUX's fine-tune ecosystem matters more than text quality.

Further reading

  • ↗Ideogram 4.0 Technical Details (official)
  • ↗Ideogram 4.0 model weights on Hugging Face (nf4)
  • ↗Ideogram GitHub — inference code and prompting guide
  • ↗BuildFastWithAI — independent analysis of the open-weight strategy
  1. [1]Ideogram AI. Ideogram 4.0 Technical Details. June 2026.
  2. [2]Gutflaish et al. Generating an Image From 1,000 Words. 2025
  3. [3]Hugging Face — ideogram-ai/ideogram-4-nf4 model card (ELO and designer preference data)
  4. [4]ComfyUI blog — Ideogram 4 Day-0 Support in ComfyUI. June 2026
  5. [5]PromptSlove — Ideogram 4.0 Complete Guide (22 use cases, community prompting results). June 2026.
  6. [6]BuildFastWithAI — Ideogram 4.0: Best Open-Weight Image Model of 2026? June 2026
  7. [7]GoEnhance AI — I Tested Ideogram 4.0: A Strong Design Model with a Messy Open-Weight Story. June 2026.
  8. [8]The Decoder — Ideogram 4.0 drops as an open-weight model. June 2026.

More breakdowns like this, every week. Description: ProdBlie covers AI, web development, and product building — with the depth and honesty that other blogs skip

Start reading
Read the analysis
aiimage generationopen sourcedesign toolsdiffusion models

Keep reading

View all
Context Engineering: The Skill That Replaced Prompt Engineering
Context Engineering: The Skill That Replaced Prompt Engineering

AI AdoptionJune 8, 2026