


3DCodeBench: Benchmarking
Agentic Procedural 3D Modeling Via Code

Overview of 3DCodeBench: 212 procedural object classes, 12 frontier VLMs evaluated under multi-turn agentic refinement, plus a public 3DCodeArena for human-preference Elo rankings.
Agentic data curation pipeline
A VLM-driven agent transforms deeply nested procedural factories from Infinigen into standalone Blender Python scripts via API migration, geometric validation, and iterative visual refinement. Every generated (prompt, code, mesh) triplet passes a human-in-the-loop verification gate to guarantee benchmark quality.

Execution feedback — every successful API migration, fix, or refactor becomes a reusable skill the agent retrieves on later factories.
Established solutions to recurring patterns (custom nodegroups, displacement, particle systems) are indexed and surfaced when the agent encounters a similar factory shape.
Each (prompt, code, mesh) triplet is reviewed by a human annotator. Failed or visually drifted outputs are rejected and re-queued for the agent loop.
At a Glance
Physical Plausibility supersedes Executability. Models frequently produce disconnected parts and misaligned structures, revealing a critical lack of physical-world understanding.
Test-Time Scaling helps. Multi-turn refinement with deterministic Blender feedback improves performance — the agentic harness matters as much as the model.
Open data for training and evaluation. 3DCodeData ships 12,720 factory instances (212 categories × 60 seeds) with per-instance Blender code, multi-view renders, baked GLBs, and LLM-generated captions — ready for SFT, instruction tuning, and shape scoring.
Examples
3DCodeData reference and model outputs side by side, shown with original materials. Drag to rotate · scroll to zoom.
Chameleon
ChameleonTip: use ← / → keys to step through examples.
Results
Quantitative results from the 3DCodeBench paper. Click any column header in the main table to sort; ablation tabs swap the metric being plotted.
Cost vs. Human-Preference Elo
Live BT-Elo from 3DCodeArena against per-query list price across paid frontier VLMs. Dashed line traces the Pareto frontier.
Human-preference Elo vs. automated metrics
Each panel plots 3DCodeArena BT-Elo against one automated metric. Toggle between the text-to-3D, image-to-3D, and combined tracks; hover a point to see the model. Chamfer uses a reversed x-axis so every panel reads left-to-right as “better metric → higher Elo”.
Main results — single-shot
212 categories, one model call per instance (no agent, retry, or tool use). Per-model values averaged across thinking-effort levels and both tracks. Exec. is the Blender 5.0 pass rate. Image-grounded compares rendered vs. reference views. 3D-shape compares the exported GLBs. Cost is mean per-query list price.
| Model | Exec.↑↕ | SigLIP-2↑↕ | DINOv3↑↕ | Chamfer↓↕ | Uni3D↑↕ | Uni3D t/i–3D↑↕ | Elo↑▼ | Tokens↕ | Time (s)↕ | Tok/s↕ | Cost $↕ |
|---|---|---|---|---|---|---|---|---|---|---|---|
| GPT-5.5 | 0.881 | 0.833 | 0.574 | 0.059 | 0.549 | 0.285 | 1167 | 9,904 | 178.0 | 56 | $0.32 |
| Gemini 3.1 Pro | 0.710 | 0.819 | 0.558 | 0.072 | 0.555 | 0.277 | 1149 | 10,289 | 134.7 | 76 | $0.12 |
| Gemini 3.5 Flash | 0.454 | 0.816 | 0.558 | 0.136 | 0.470 | 0.243 | 1112 | 11,471 | 54.5 | 210 | $0.04 |
| GPT-5.4 | 0.795 | 0.812 | 0.549 | 0.064 | 0.543 | 0.283 | 1074 | 12,727 | 196.4 | 65 | $0.20 |
| Gemini 3 Flash | 0.561 | 0.813 | 0.532 | 0.074 | 0.528 | 0.271 | 1034 | 6,056 | 32.1 | 189 | $0.02 |
| Claude Sonnet 4.6 | 0.738 | 0.806 | 0.536 | 0.070 | 0.517 | 0.273 | 1022 | 18,609 | 233.9 | 80 | $0.29 |
| Claude Opus 4.7 | 0.881 | 0.817 | 0.555 | 0.067 | 0.514 | 0.273 | 1008 | 4,778 | 57.1 | 84 | $0.14 |
| GPT-5.4 mini | 0.590 | 0.798 | 0.511 | 0.121 | 0.418 | 0.228 | 950 | 24,530 | 151.4 | 162 | $0.11 |
| Gemma 4 31B | 0.582 | 0.798 | 0.518 | 0.076 | 0.494 | 0.262 | 946 | 4,288 | 121.6 | 35 | — |
| Gemini 3.1 Flash Lite | 0.599 | 0.781 | 0.481 | 0.075 | 0.445 | 0.248 | 880 | 8,927 | 40.9 | 218 | $0.01 |
| Gemma 4 26B | 0.517 | 0.780 | 0.483 | 0.078 | 0.436 | 0.249 | 860 | 6,126 | 132.7 | 46 | — |
| Claude Haiku 4.5 | 0.511 | 0.763 | 0.415 | 0.083 | 0.365 | 0.223 | 798 | 5,248 | 30.7 | 171 | $0.03 |
Ablations
Test-time scaling lifts executability dramatically; coding-agent harnesses push every backbone to near-ceiling; multi-view input budgets show diminishing returns past N=2 on cheaper models.
Multi-turn error feedback
T=3 retry loop on the failed instances. Switch metric →
Pass rate before and after the multi-turn loop. Up to two stateless retries that consume the previous code and Blender traceback.
Coding-agent harness
Each backbone wrapped in its native coding-agent harness (Gemini CLI / Claude Code / Codex CLI / Antigravity CLI) on the text-to-3D track. Switch metric →
Executability: fraction of prompts that produce a non-empty mesh in Blender 5.0. Bigger bar = more successful runs.
Multi-view input budget
How does giving the model more reference views (N) affect output quality? Each line is one backbone, mean across 3 seeds at thinking=high.
Appendix
Click a row to expand the full breakdown. All numbers come from the appendix tables in the paper.
Contribute
3DCodeBench is open. The benchmark grows by adding new categories — chairs, plants, hardware, vehicles, anything we don’t cover yet. Each category is just three text files; the workflow is friendly to a single PR.
A self-contained Blender 5.0 Python script that builds your object. No external imports, deterministic at seed=0, under 5 minutes on one CPU core.
prompt_description.txt— a one-paragraph caption a human could visualise. prompt_instruction.txt — a structured spec covering parts, proportions, and finish.
Drop the three files into benchmark/categories/<Name>_seed0/, include a 200×200 render and one line of motivation, and submit.
New eval tasks (e.g. sketch-to-3D) and metrics(e.g. material-fidelity scorer) are also welcome — please open an issue first to align on scope.
Cite
@inproceedings{3dcodebench2026,
title = {3DCodeBench: Benchmarking Agentic Procedural 3D Modeling Via Code},
author = {Gao, Yipeng and Shu, Lei and Ye, Genzhi and Xiong, Xi and
Makadia, Ameesh and Guo, Meiqi and Itti, Laurent and Chen, Jindong},
booktitle = {arXiv preprint},
year = {2026},
note = {Coming soon}
}BibTeX coming soon — arXiv link pending.