A new benchmark for testing LLMs for deterministic outputs

Name: A new benchmark for testing LLMs for deterministic outputs
Author: khurdula

When building workflows that rely on LLMs, we commonly use structured output for programmatic use cases like converting an invoice into rows or meeting transcripts into tickets or even complex PDFs into database entries. The model may return the schema you want, but with hallucinated values like `invoice_date` being off by 2 months or the transcript array ordered wrongly. The JSON is valid, but the values are not. Structured output today is a big part of using LLMs, especially when building deterministic workflows. Current structured output benchmarks (e.g., JSONSchemaBench) only validate the pass rate for JSON schema and types, and not the actual values within the produced JSON. So we designed the Structured Output Benchmark (SOB) that fixes this by measuring both the JSON schema pass rate, types, and the value accuracy across all three modalities, text, image, and audio. For our test set, every record is paired with a JSON Schema and a ground-truth answer that was verified against the source context manually by a human and an LLM cross-check, so a missing or hallucinated value will be considered to be wrong. Open source is doing pretty well with GLM 4.7 coming in number 2 right after GPT 5.4. We noticed the rankings shift across modalities: GLM-4.7 leads text, Gemma-4-31B leads images, Gemini-2.5-Flash leads audio. For example, GPT-5.4 ranks 3rd on text but 9th on images. Model size is not a predictor, either: Qwen3.5-35B and GLM-4.7 beat GPT-5 and Claude-Sonnet-4.6 on Value Accuracy. Phi-4 (14B) beats GPT-5 and GPT-5-mini on text. Structured hallucinations are the hardest bug. Such values are type-correct, schema-valid, and plausible, so they slip through most guardrails. For example, in one audio record, the ground truth is "target_market_age": "15 to 35 years", and a model returns "25 to 35". This is invisible without field-level checks. Our goal is to be the best general model for deterministic tasks, and a key aspect of determinism is a controllable and consistent output structure. The first step to making structured output better is to measure it and hold ourselves against the best.

AI Tools BOTH · khurdula

Перейти на A new benchmark for testing LLMs for deterministic outputs

N/A

Данные о доходе недоступны

AI-анализ

Анализ скоро появится.

Похожие продукты

AI Tools

Daily competitor-change briefs for local businesses

Show HN: Daily competitor-change briefs for local businesses

Доход N/A

AI Tools

Loop

Loop helps you close the loop on email introductions. It is added to the CC line (loop@getlooped.cc) on any intro email and after 2-3 weeks it will follow up with each side privately to determine the status or outcome of the connection, including any follow-on intros that were made from one side to the other. Other than actively adding Loop to CC, it is permissionless. It requires no account or external app - running entirely through regular email workflow. Over 90% of the time the same etiquette that sends the connector to BCC after the intro, will do so for Loop as well - so it only has visibility on the first two emails to determine who is the connector, who the sides are, what are their roles and when to emerge with the ask. Aside from the challenge of maximum datapoint extraction with minimal data at hand, it also had to be built with more than a dozen timing, attribution and meaning guardrails so it wont act irreversibly on inferred state. Some examples include out of office messages, personal assistants, weekends/holidays, intros of 3 or more people, replies from the same person with a different email address or emails that are actually not intros at all where it was CC'd by accident. I built loop because I do about ~20 intros per month and realized that I rarely hear back on how they are going (e.g. - if a deal is on the table) or what the outcome eventually was. This visibility makes me a better connector and dealmaker, and it puts me top of mind for attribution without me needing to do awkward reach outs. Loop can also be used for intros you receive, so I use it to see who keeps opening doors for me + gather statistics. Stack: Next.js 16 (App Router) + React 19 + TypeScript on Vercel, Postgres (Supabase) via Prisma, Auth.js for login, Resend for inbound + outbound email, and Claude Haiku classifying the incoming mail. Sentry + Vercel Cron as well.

Доход N/A

AI Tools

How many people live within an hour walk of you?

I've been building a bunch of US geography tools/games/visualizations and thought folks on HN might like this one. There are also drive-time and bike-time tools available on the same page. For folks interested, all of the data comes from OpenStreetMap and the US Census at the moment - OSM for routing and points interest and the US Census for population data. Anonymous users can use it 5x/day. I put up a waitlist page to gauge interest in a paid version but there's no way for the public to make an account - feel free to bypass the limits in an incognito tab. The site is US-only for the time being - sorry for international folks!

Доход N/A

AI Tools

Tines 3B

Hey HN! This is Yannick from Tines, really excited to share what we’re launching today. Tines 3B is a new product we’ve built for teams building agents, apps, and automations with AI, on top of the assumption that people are already doing this, whether or not IT or security knows about it. Finance, marketing, and more are building dashboards in Claude Code or Codex. None of it is malicious, people are doing exactly what they were asked to do. The issue is there's no easy way for that work to run securely: it ends up on a laptop or personal account, with a credential pasted directly into the code, with no one else able to see it exists. Tines 3B is where that work runs instead: everyone can still build the same way, but it executes in an environment that IT and security can actually see and control. It's code-first, describe what you want, your LLM writes the implementation, with isolated executions and credentials handled through a proxy rather than embedded in the code. Our other product has been running inside security and IT teams' most sensitive systems for the last 8 years. This is the same underlying problem (who gets access to what, and who can see it), just for a new category of builder. Our explorer edition is free forever, entirely self-serve: https://login.tines.com/saml_idp/signup

Доход N/A

AI Tools

tale.fyi, we deserve a home for fiction

for decades, i have been concerned that the internet was being built around non-fiction, so i built something to show how we could celebrate great fiction on the web. i started with an amazing library from the public domain, and i also added tools to add your own stories super interested to hear any feedback, and if you read anything good!

Доход N/A

Ключевые факты

Категория: AI Tools
Аудитория: BOTH
Основатель: khurdula
Данные о доходе: Неизвестно

Twitter LinkedIn

A new benchmark for testing LLMs for deterministic outputs

AI-анализ

Похожие продукты

Daily competitor-change briefs for local businesses

Loop

How many people live within an hour walk of you?

Tines 3B

tale.fyi, we deserve a home for fiction

Ключевые факты

Поделиться