Roaster
RU / EN
A new benchmark for testing LLMs for deterministic outputs

A new benchmark for testing LLMs for deterministic outputs

When building workflows that rely on LLMs, we commonly use structured output for programmatic use cases like converting an invoice into rows or meeting transcripts into tickets or even complex PDFs into database entries. The model may return the schema you want, but with hallucinated values like `invoice_date` being off by 2 months or the transcript array ordered wrongly. The JSON is valid, but the values are not. Structured output today is a big part of using LLMs, especially when building deterministic workflows. Current structured output benchmarks (e.g., JSONSchemaBench) only validate the pass rate for JSON schema and types, and not the actual values within the produced JSON. So we designed the Structured Output Benchmark (SOB) that fixes this by measuring both the JSON schema pass rate, types, and the value accuracy across all three modalities, text, image, and audio. For our test set, every record is paired with a JSON Schema and a ground-truth answer that was verified against the source context manually by a human and an LLM cross-check, so a missing or hallucinated value will be considered to be wrong. Open source is doing pretty well with GLM 4.7 coming in number 2 right after GPT 5.4. We noticed the rankings shift across modalities: GLM-4.7 leads text, Gemma-4-31B leads images, Gemini-2.5-Flash leads audio. For example, GPT-5.4 ranks 3rd on text but 9th on images. Model size is not a predictor, either: Qwen3.5-35B and GLM-4.7 beat GPT-5 and Claude-Sonnet-4.6 on Value Accuracy. Phi-4 (14B) beats GPT-5 and GPT-5-mini on text. Structured hallucinations are the hardest bug. Such values are type-correct, schema-valid, and plausible, so they slip through most guardrails. For example, in one audio record, the ground truth is "target_market_age": "15 to 35 years", and a model returns "25 to 35". This is invisible without field-level checks. Our goal is to be the best general model for deterministic tasks, and a key aspect of determinism is a controllable and consistent output structure. The first step to making structured output better is to measure it and hold ourselves against the best.

AI Tools BOTH · khurdula
N/A
Данные о доходе недоступны

AI-анализ

Анализ скоро появится.

Похожие продукты

AI Tools
Brightdeck

Brightdeck

Show HN: Brightdeck – an OOXML-compatible AI presentation maker

Доход N/A
AI Tools
Stillwind

Stillwind

We’ve spent the last couple of months building Stillwind Search, a search engine for electronic components that helps users find parts that fit even the most complex set of specifications. After talking to the people that actually build PCBs we found out that finding the exact part you are looking for, is consuming enormous amounts of times, is very tedious and then often doesn’t yield the best results. So we tried to cut down this search time by just requiring a (broad) description of the specifications and we find the correct part in minutes, not hours. This is possible through our own database of parts and their properties. We used LLMs to extract every parameter about a part into >1k schemas, collectively covering more than 130k properties. This depth of properties could no longer be visualized, so the database is queried interactively by an AI agent (Sonnet 4.6) to find the needle in the haystack of parts. Before results are shown, we use another model to verify the data (that’s why it might take a moment before the first results appear). We currently have almost all microcontrollers, sensors, and other advanced ICs on the market, as well as a wide selection of passives and generic parts. We are working on adding more parts and are more than happy to take suggestions. I know that folks on HN like technical details on how this works, so let me give a short overview: Frontend: SvelteKit + Cloudflare Workers + Hyperdrive Backend: PostgreSQL 18 (with io_uring) database, with extensions on NVMe drives hosted on a beefy server. We appreciate all feedback and are happy to answer any questions :) Btw: We are already working on a way that you can search combinations of parts, finding the optimal combination of parts.

Доход N/A
AI Tools
Social network where inviting someone makes you accountable for them

Social network where inviting someone makes you accountable for them

Chirpper is invite-only. When you vouch someone in, they join your TrustChain. Their behavior affects your TrustRank, and that propagates up the lineage. No moderators. The accountability is architectural, not policy-based. You can be pseudonymous, but you can't be unaccountable. Happy to get into the mechanics in comments.

Доход N/A
AI Tools
Command Center, the AI coding env for people who care about quality

Command Center, the AI coding env for people who care about quality

Hi HN! We’re Jimmy and Ray. Jimmy is a Thiel Fellow with a Ph. D. from MIT who has worked on programming tools for 15 years; Ray became VP of Sales at a $2B company when he was 19 and has built side-businesses vibe-coding. Last year, we set to answer the question “If AI can write code 100x faster, then why aren’t you shipping 100x faster?” What we learned shocked us — even fairly nontechnical people and solo founders told us they were spending more than half of their development time reading the AI-written code. And much of the rest of the time was spent either de-slop-ping it, or wishing they had done so. As luck turns out, our last two products were a tool that quickly onboards people to large codebases ( https://x.com/0xjimmyk/status/1873357324229984677 ) and trainings that taught deep concepts of code quality to CEOs, YC founders, and engineers at top companies ( mirdin.com ), so we were extremely well-positioned to solve these problems. Command Center is an agentic coding environment focused on quality. With a few keypresses, you can start building 3 features at once and soon have 3 diffs ready, each consisting of 2000 changed lines across 50 files…. This is normally the point where you think “Crap, what now?” With Command Center, at this point you simply click “Refactor,” and watch the vibed slop turn into readable robustness. Then you click “Generate Walkthrough,” and then suddenly, to read a 2000 line diff, instead of scrolling up and down trying to make sense of it, you just press the right arrow key 200 times. See something you don’t like? Click on line 37, type “Do this and all other network fetches in the background Cmd+Enter,” and you have a few more agents getting your code into final shape. Click or type “Commit,” “Push,” “Create PR” — you just shipped a high quality, non-slop feature We’re striving to be the best at every step of the pipeline, but can just try Command Center in pieces wherever you feel your current workflow is weakest. We have users who do all their coding in Zed or the Codex app, and then jump over to Command Center for a walkthrough when it finishes running. There’s even a skill that will pop open a Command Center walkthrough from the environment of your choice. Or you can just keep Command Center running while you do your work elsewhere, and if your AI deletes anything, you have Command Center’s snapshots to the rescue. We launched quietly last year and have been refining since. The quality and usability have kept going up, and Command Center is now ready for a lot more attention. Since our quiet launch, we’ve seen at least a dozen other agentic coding environments appear….approximately all of which have the same feature set focused on the part which is already easy (generating the first version of the code) and with at best a shoddy answer to the hard part (everything that comes after). Command Center’s focus is making the hard parts easy. Here’s what our users have to say: “[The refactorings] give your LLM taste. I’ve never seen an LLM write code this good before.” — Doug Slater, Staff Engineer, Climavision “With Command Center walkthroughs, I can get through a 400-line diff in less than half the time.” — Prateek Kumar, Platfor Engineer, Sumo Logic This product is not for everyone. If you’re someone who preaches “the prompt is the source, the code is the compiler output,” then you probably won’t enjoy Command Center. But if you want to uphold traditional engineering discipline while also shipping 20 PRs a day, then this is the environment for you.

Доход N/A
AI Tools
A Highly Available Distributed Router for Global Realtime AI

A Highly Available Distributed Router for Global Realtime AI

Show HN: A Highly Available Distributed Router for Global Realtime AI

Доход N/A

Ключевые факты

Категория
AI Tools
Аудитория
BOTH
Основатель
khurdula
Данные о доходе
Неизвестно

Поделиться

Twitter LinkedIn