Google DeepMindGoogle ResearchUniversity of Southern California

3DCodeBench: Benchmarking Agentic Procedural 3D Modeling Via Code

1Google DeepMind·2Google Research·3University of Southern California
Overview of 3DCodeBench

Overview of 3DCodeBench: 212 procedural object classes, 12 frontier VLMs evaluated under multi-turn agentic refinement, plus a public 3DCodeArena for human-preference Elo rankings.

Agentic data curation pipeline

A VLM-driven agent transforms deeply nested procedural factories from Infinigen into standalone Blender Python scripts via API migration, geometric validation, and iterative visual refinement. Every generated (prompt, code, mesh) triplet passes a human-in-the-loop verification gate to guarantee benchmark quality.

Agentic data curation pipeline: skills library, experience library, visual refinement loop, human-in-the-loop verification.
Skills Library

Execution feedback — every successful API migration, fix, or refactor becomes a reusable skill the agent retrieves on later factories.

Experience Library

Established solutions to recurring patterns (custom nodegroups, displacement, particle systems) are indexed and surfaced when the agent encounters a similar factory shape.

Human-in-the-loop

Each (prompt, code, mesh) triplet is reviewed by a human annotator. Failed or visually drifted outputs are rejected and re-queued for the agent loop.

At a Glance

12
Models evaluated
212
Object categories
13K
3D objects with code
52K
Multi-view renders
Elo
Human-preference rankings

Physical Plausibility supersedes Executability. Models frequently produce disconnected parts and misaligned structures, revealing a critical lack of physical-world understanding.

Test-Time Scaling helps. Multi-turn refinement with deterministic Blender feedback improves performance — the agentic harness matters as much as the model.

Open data for training and evaluation. 3DCodeData ships 12,720 factory instances (212 categories × 60 seeds) with per-instance Blender code, multi-view renders, baked GLBs, and LLM-generated captions — ready for SFT, instruction tuning, and shape scoring.

Examples

3DCodeData reference and model outputs side by side, shown with original materials. Drag to rotate · scroll to zoom.

Chameleon

Chameleon
1 / 23

Tip: use ← / → keys to step through examples.

Results

Quantitative results from the 3DCodeBench paper. Click any column header in the main table to sort; ablation tabs swap the metric being plotted.

Cost vs. Human-Preference Elo

Live BT-Elo from 3DCodeArena against per-query list price across paid frontier VLMs. Dashed line traces the Pareto frontier.

See full leaderboard →
$0.01$0.180090010001100Per-query list-price cost (USD, log scale)3DCodeArena BT-Elo ↑Pareto frontierGemini 3.1 Flash Lite$0.01 / 850Gemini 3 Flash$0.02 / 1012Claude Haiku 4.5$0.03 / 772Gemini 3.5 Flash$0.04 / 1114GPT-5.4 Mini$0.11 / 920Gemini 3.1 Pro$0.12 / 1116Claude Opus 4.7$0.14 / 977GPT-5.4$0.20 / 1042Claude Sonnet 4.6$0.29 / 984GPT-5.5$0.32 / 1135

Human-preference Elo vs. automated metrics

Each panel plots 3DCodeArena BT-Elo against one automated metric. Toggle between the text-to-3D, image-to-3D, and combined tracks; hover a point to see the model. Chamfer uses a reversed x-axis so every panel reads left-to-right as “better metric → higher Elo”.

GeminiGemmaClaudeGpt
SigLIP-2 view-paired ↑r = +0.956Gemini 3 FlashGemini 3.1 Flash LiteGemini 3.1 ProGemini 3.5 FlashGemma 4 26BGemma 4 31BClaude Haiku 4.5Claude Sonnet 4.6Claude Opus 4.7GPT-5.4 MiniGPT-5.4GPT-5.5SigLIP-2 view-pairedBT-Elo
Uni3D 3D↔3D ↑r = +0.939Gemini 3 FlashGemini 3.1 Flash LiteGemini 3.1 ProGemini 3.5 FlashGemma 4 26BGemma 4 31BClaude Haiku 4.5Claude Sonnet 4.6Claude Opus 4.7GPT-5.4 MiniGPT-5.4GPT-5.5Uni3D 3D↔3DBT-Elo
DINOv3 view-paired ↑r = +0.928Gemini 3 FlashGemini 3.1 Flash LiteGemini 3.1 ProGemini 3.5 FlashGemma 4 26BGemma 4 31BClaude Haiku 4.5Claude Sonnet 4.6Claude Opus 4.7GPT-5.4 MiniGPT-5.4GPT-5.5DINOv3 view-pairedBT-Elo
Uni3D Text↔3D ↑r = +0.875Gemini 3 FlashGemini 3.1 Flash LiteGemini 3.1 ProGemini 3.5 FlashGemma 4 26BGemma 4 31BClaude Haiku 4.5Claude Sonnet 4.6Claude Opus 4.7GPT-5.4 MiniGPT-5.4GPT-5.5Uni3D Text↔3DBT-Elo
Executability ↑r = +0.481Gemini 3 FlashGemini 3.1 Flash LiteGemini 3.1 ProGemini 3.5 FlashGemma 4 26BGemma 4 31BClaude Haiku 4.5Claude Sonnet 4.6Claude Opus 4.7GPT-5.4 MiniGPT-5.4GPT-5.5ExecutabilityBT-Elo
Chamfer Distance ↓r = +0.770Gemini 3 FlashGemini 3.1 Flash LiteGemini 3.1 ProGemini 3.5 FlashGemma 4 26BGemma 4 31BClaude Haiku 4.5Claude Sonnet 4.6Claude Opus 4.7GPT-5.4 MiniGPT-5.4GPT-5.5Chamfer Distance (←better)BT-Elo

Main results — single-shot

212 categories, one model call per instance (no agent, retry, or tool use). Toggle Text → 3D / Image → 3D above the table; click any column to sort. Exec. is the Blender 5.0 pass rate. Image-grounded compares rendered vs. reference views. 3D-shape compares the exported GLBs. Cost is mean per-query list price.

ModelExec.SigLIP-2DINOv3ChamferUni3DUni3D t/i–3DEloTokensTime (s)Tok/sCost $
GPT-5.50.8730.8270.5510.0660.5270.26311673,281161.120$0.25
Gemini 3.1 Pro0.6930.8230.5400.0790.5180.25111492,245123.018$0.16
Gemini 3.5 Flash0.4100.8280.5480.0710.5710.27611123,81180.647$0.05
GPT-5.40.8630.8150.5410.0710.5240.26410742,652167.616$0.17
Gemini 3 Flash0.6080.8180.5110.0710.4970.25210342,06530.767$0.02
Claude Sonnet 4.60.7920.8210.5450.0710.5050.256102219,494239.481$0.30
Claude Opus 4.70.8810.8170.5550.0670.5140.27310084,77857.184$0.14
GPT-5.4 mini0.5900.7980.5110.1210.4180.22895024,530151.4162$0.11
Gemma 4 31B0.5820.7980.5180.0760.4940.2629464,288121.635
Gemini 3.1 Flash Lite0.5990.7810.4810.0750.4450.2488808,92740.9218$0.01
Gemma 4 26B0.5170.7800.4830.0780.4360.2498606,126132.746
Claude Haiku 4.50.5110.7630.4150.0830.3650.2237985,24830.7171$0.03
Claude Opus 4.80.8680.7760.5160.0650.5340.2676,12475.981$0.20
Fable 50.9200.7710.5380.0610.5580.2798,035100.480$0.56

Main results — coding agent

The same 212 categories, but each model runs inside an autonomous coding-agent harness (Claude Code / Codex / Gemini CLI / Antigravity) that executes Blender, reads the errors, and retries. Same metrics and convention as above, with the same Text / Image toggle. Agents reach near-perfect executability, so the contest shifts to image- and 3D-shape fidelity. Fable 5 · Claude Code leads on shape; GPT-5.5 · Codex on image similarity.

ModelExec.SigLIP-2DINOv3ChamferUni3DUni3D t/i–3DTokensTime (s)Tok/sCost $
GPT-5.5 · Codex0.9950.8160.5440.0620.5190.2676,20097.064$0.30
Fable 5 · Claude Code1.0000.7990.5570.0610.5640.2785,24175.170$0.68
Claude Opus 4.8 · Claude Code1.0000.7990.5380.0620.5290.2705,64371.279$0.33
Claude Opus 4.7 · Claude Code1.0000.7980.5300.0680.5240.2644,05155.473$0.17
Gemini 3.1 Pro · Gemini CLI0.9910.7950.5150.0730.5070.2603,392227.815$0.37
Gemini 3 Flash · Gemini CLI0.9950.7880.5050.0730.5010.2495,94499.360$0.12
Gemini 3.5 Flash · Antigravity0.9860.7870.5180.0710.5190.26179.1
GPT-5.4 · Codex1.0000.7860.5060.0670.4960.26413,668230.159$0.31
Claude Sonnet 4.6 · Claude Code0.9860.7830.5060.0620.5140.2625,27975.770$0.12
GPT-5.4 mini · Codex1.0000.7680.4630.0720.4440.2449,40175.7124$0.06
Claude Haiku 4.5 · Claude Code0.9860.7380.4010.0790.3850.21313,13584.9155$0.12
Gemini 3.1 Flash Lite · Gemini CLI1.0000.7250.3860.0890.3710.2052,09637.157$0.04

Ablations

Test-time scaling lifts executability dramatically; coding-agent harnesses push every backbone to near-ceiling; multi-view input budgets show diminishing returns past N=2 on cheaper models.

Multi-turn error feedback

T=3 retry loop on the failed instances. Switch metric →

Pass rate before and after the multi-turn loop. Up to two stateless retries that consume the previous code and Blender traceback.

Single-turn
Multi-turn
0.000.250.500.751.00Gemini 3 Flash0.5470.937Gemini 3.1 Flash Lite0.5800.930Gemini 3.5 Flash0.4790.946Gemma 4 26B0.5420.938Gemma 4 31B0.5570.981Claude Sonnet 4.60.8420.993Claude Opus 4.70.9371.000GPT-5.4 mini0.7310.996GPT-5.40.8661.000GPT-5.50.9441.000

Coding-agent harness

Each backbone wrapped in its native coding-agent harness (Gemini CLI / Claude Code / Codex CLI / Antigravity CLI) on the text-to-3D track. Switch metric →

Executability: fraction of prompts that produce a non-empty mesh in Blender 5.0. Bigger bar = more successful runs.

Single-turn
With agent harness
0.000.250.500.751.00Gemini 3 Flash0.6080.995Gemini 3.1 Flash Lite0.6081.000Gemini 3.1 Pro0.6930.991Gemini 3.5 Flash0.4100.986Claude Sonnet 4.60.7690.986Claude Opus 4.70.8871.000GPT-5.4 mini0.6701.000GPT-5.40.8631.000GPT-5.50.8730.995

Multi-view input budget

How does giving the model more reference views (N) affect output quality? Each line is one backbone, mean across 3 seeds at thinking=high.

0.450.500.550.600.650.700.750.80N = 1N = 2N = 3N = 4Number of reference viewsExecutability
Gemini 3 Flash
Gemini 3.5 Flash
Gemini 3.1 Flash Lite
Gemini 3.1 Pro
Gemma 4 26B
Gemma 4 31B

Appendix

Click a row to expand the full breakdown. All numbers come from the appendix tables in the paper.

Contribute

3DCodeBench is open. The benchmark grows by adding new categories — chairs, plants, hardware, vehicles, anything we don’t cover yet. Each category is just three text files; the workflow is friendly to a single PR.

1. Write a factory

A self-contained Blender 5.0 Python script that builds your object. No external imports, deterministic at seed=0, under 5 minutes on one CPU core.

2. Add two prompts

prompt_description.txt— a one-paragraph caption a human could visualise. prompt_instruction.txt — a structured spec covering parts, proportions, and finish.

3. Open a PR

Drop the three files into benchmark/categories/<Name>_seed0/, include a 200×200 render and one line of motivation, and submit.

New eval tasks (e.g. sketch-to-3D) and metrics(e.g. material-fidelity scorer) are also welcome — please open an issue first to align on scope.

Cite

@inproceedings{3dcodebench2026,
  title  = {3DCodeBench: Benchmarking Agentic Procedural 3D Modeling Via Code},
  author = {Gao, Yipeng and Shu, Lei and Ye, Genzhi and Xiong, Xi and
            Makadia, Ameesh and Guo, Meiqi and Itti, Laurent and Chen, Jindong},
  booktitle = {arXiv preprint},
  year   = {2026},
  note   = {Coming soon}
}

BibTeX coming soon — arXiv link pending.