A new benchmark for testing LLMs for deterministic outputs

Name: A new benchmark for testing LLMs for deterministic outputs
Author: khurdula

When building workflows that rely on LLMs, we commonly use structured output for programmatic use cases like converting an invoice into rows or meeting transcripts into tickets or even complex PDFs into database entries. The model may return the schema you want, but with hallucinated values like `invoice_date` being off by 2 months or the transcript array ordered wrongly. The JSON is valid, but the values are not. Structured output today is a big part of using LLMs, especially when building deterministic workflows. Current structured output benchmarks (e.g., JSONSchemaBench) only validate the pass rate for JSON schema and types, and not the actual values within the produced JSON. So we designed the Structured Output Benchmark (SOB) that fixes this by measuring both the JSON schema pass rate, types, and the value accuracy across all three modalities, text, image, and audio. For our test set, every record is paired with a JSON Schema and a ground-truth answer that was verified against the source context manually by a human and an LLM cross-check, so a missing or hallucinated value will be considered to be wrong. Open source is doing pretty well with GLM 4.7 coming in number 2 right after GPT 5.4. We noticed the rankings shift across modalities: GLM-4.7 leads text, Gemma-4-31B leads images, Gemini-2.5-Flash leads audio. For example, GPT-5.4 ranks 3rd on text but 9th on images. Model size is not a predictor, either: Qwen3.5-35B and GLM-4.7 beat GPT-5 and Claude-Sonnet-4.6 on Value Accuracy. Phi-4 (14B) beats GPT-5 and GPT-5-mini on text. Structured hallucinations are the hardest bug. Such values are type-correct, schema-valid, and plausible, so they slip through most guardrails. For example, in one audio record, the ground truth is "target_market_age": "15 to 35 years", and a model returns "25 to 35". This is invisible without field-level checks. Our goal is to be the best general model for deterministic tasks, and a key aspect of determinism is a controllable and consistent output structure. The first step to making structured output better is to measure it and hold ourselves against the best.

AI Tools BOTH · khurdula

Visit A new benchmark for testing LLMs for deterministic outputs

N/A

Revenue not available

LLMs consume 5.4x less mobile energy than ad-supported web search

The standard AI energy debate compares server-side LLM inference to a server-side Google query. I think this misses most of what actually happens on a mobile device during a real search session. I built a parametric model of the full end-to-end mobile search session: 4G/5G radio energy, SoC rendering cost for a 2.5MB page, programmatic advertising RTB auctions running in the background, and network transmission costs for both sides. Then compared it to an equivalent LLM session. Main finding across 10,000 Monte Carlo draws: on mobile, a standard LLM session uses on average 5.4x less energy than a classic ad-supported web search session. Programmatic advertising alone accounts for up to 41% of device battery drain per session. Caveats I tried to be explicit about: - Advantage disappears on fixed Wi-Fi/fiber - Reverses for reasoning models - Parametric model, not empirical device measurement. Greenspector has offered to run terminal measurements for v2 - Jevons paradox applies SSRN working paper, not peer-reviewed. Methodology and Monte Carlo distributions fully documented in the paper. Happy to defend the assumptions. DOI: 10.2139/ssrn.6287918

Revenue N/A

Quick Facts

Category: AI Tools
Audience: BOTH
Founder: khurdula
Revenue data: Unknown

Twitter LinkedIn

A new benchmark for testing LLMs for deterministic outputs

AI Analysis

Similar Products

My retired dad and I made a daily, somewhat difficult, quiz

SyncVibe

Figma alternative where AI works with vector primitives, not code

Time Pin

LLMs consume 5.4x less mobile energy than ad-supported web search

Quick Facts

Share