classification

Structured Logprobs for Better Classification

Combining chain-of-thought reasoning with logprob extraction improves LLM classification accuracy while giving you real confidence scores.

Cole Hoffer
Hero image for Structured Logprobs for Better Classification

Most LLM classification setups look something like this: send a prompt with label definitions, request a single-token true/false response, extract the logprob (log probability, a numerical measure of the model's confidence in its output), and convert it to a confidence score. It works. It's fast. But there's a substantial accuracy gap between what this approach achieves and what the model is capable of.

I've been running experiments with a structured variant that forces chain-of-thought reasoning before the prediction, then extracts logprobs specifically from the prediction field. The accuracy improvements were significant enough that I've moved most of my classifiers to this pattern.

This post walks through the technique, the tradeoffs, and the implementation details.

Common Baseline Approaches

Before getting into why structured logprobs work better, it's worth cataloging what most LLM classification implementations look like today.

String Matching

The simplest approach: prompt the model to output a label, then string-match against expected values.

You extract "Yes" (or try to), normalize casing, and map to your label. If no match, fall back to a default.

The problems compound quickly:

  • Models don't reliably output just the label-they hedge, explain, or use synonyms
  • Casing varies ("Yes", "YES", "yes")
  • Synonyms slip in ("correct", "affirmative", "true")
  • Partial matches are ambiguous ("not yes" contains "yes")
  • No confidence score-every match looks equally certain

You end up writing increasingly complex parsing logic, and the default fallback gets hit more than you'd like.

Single-Token Logprobs

A more sophisticated approach: restrict output to a single token and extract logprobs.

This gets you confidence scores, but introduces new constraints:

  • Must restrict to single-token outputs (multi-token labels won't work)
  • Still have casing issues ("a" vs "A" are different tokens with different logprobs)
  • Letter labels (A/B) work better than words because they're reliably single tokens
  • The model has no space to reason-it's a snap judgment
  • You're measuring confidence in pattern-matching, not in a reasoned conclusion

Both approaches share a fundamental limitation: they force the model to compress its entire decision process into a single output token, then try to extract meaning from that compressed signal.

Why Single-Token Classification Underperforms

The specific failure modes of these baseline patterns are worth understanding in detail.

When you prompt a model to output only true or false, you're asking it to compress its entire reasoning process into a single token decision. The model doesn't have space to:

  1. Decompose ambiguous cases - Many classification tasks have edge cases where multiple factors interact. Without space to reason, the model pattern-matches on surface features.

  2. Resolve conflicting signals - Real inputs often contain evidence for multiple labels. The model needs to weigh these signals, but single-token output forces an immediate commitment.

  3. Self-correct - Chain-of-thought allows models to catch their own errors mid-reasoning. Single-token output bypasses this entirely.

The logprobs you extract from a single-token prediction reflect confidence in that immediate pattern-match, not confidence in a reasoned conclusion. For straightforward classification, this is fine. For nuanced tasks, you're measuring the wrong thing.

How Structured Logprobs Work

The power of this approach comes from two independent mechanisms working together: structured output constrains the model's behavior, while logprobs give you a probability distribution over that constrained space. Neither is sufficient alone.

Structured Output: Constraining the Decision Space

When you ask an LLM to "return true or false," you're hoping it interprets that instruction correctly. Sometimes it does. Sometimes you get "True", "TRUE", "yes", "The answer is true", or a multi-paragraph explanation followed by the answer buried somewhere in the middle.

Structured output eliminates this variance entirely. By enforcing a JSON schema, you're telling the model: "Your output must be valid JSON with exactly these fields." The model can't deviate, hedge, or produce malformed responses.

But the real value isn't just format consistency-it's forcing an execution path. When you define the schema as:

You're not just asking for reasoning-you're requiring it as a prerequisite to the prediction. The model cannot emit the prediction field until it has completed the reasoning field. This is chain-of-thought by construction, not by hope.

This matters because:

  1. Ambiguous cases get decomposed - The model has space to work through conflicting signals before committing
  2. The reasoning is auditable - You can inspect why the model made a decision, not just what it decided
  3. Edge cases surface explicitly - When the model is uncertain, the reasoning often reveals the specific factor causing hesitation

Without structured output, you're relying on the model to voluntarily reason before answering. With structured output, reasoning is architecturally guaranteed.

How Type Constraints Work

Structured output APIs (like OpenAI's response_format: { type: "json_schema" }) don't just parse the output after generation-they constrain generation itself. At each token, the model's logits are masked to only allow tokens that could produce valid JSON conforming to your schema.

For a boolean field like prediction, this means:

  • When the model reaches the prediction field, only tokens that start true or false are permitted
  • The model cannot output "yes", 1, "True", or any other representation
  • The probability mass is redistributed across only the valid options

This has important implications for logprobs:

Pros:

  • Cleaner probability space - For a boolean field, you get a true binary distribution. The logprob for true directly reflects P(true) vs P(false), not P(true) vs P(false) vs P(yes) vs P(1) vs P("True") vs ...
  • No parsing ambiguity - You know exactly what token to extract logprobs from
  • Forced valid output - The model cannot produce malformed responses that break downstream processing

Cons:

  • Logprobs reflect constrained distribution - The logprob you see is P(true | must be true or false), not P(true | unconstrained). If the model "wanted" to output "yes" with high probability, that probability mass gets redistributed to true or false
  • May mask model uncertainty - A model that would naturally hedge ("probably true") is forced into a binary choice, potentially inflating the apparent confidence
  • Schema overhead - Complex schemas add tokens to the prompt and slightly increase latency

In practice, the cleaner probability space outweighs the cons for classification tasks. You're explicitly asking for a boolean decision-constraining the output to boolean values aligns the logprobs with what you're actually measuring.

Logprobs: Building a Probability Space

The second mechanism is extracting logprobs from the prediction field. This transforms classification from a binary output into a continuous probability distribution.

Why does this matter? Because binary outputs hide information. When a model outputs true, you don't know if it was 51% confident or 99% confident. Both look identical in the response. Logprobs expose the underlying probability mass the model assigned to each option.

This unlocks several capabilities that are impossible with binary outputs:

Threshold Tuning - Instead of accepting every prediction at face value, you can set confidence thresholds. "Only auto-approve predictions where logprob > -0.1" (roughly 90% confidence). This lets you trade off precision vs. recall based on your use case requirements.

Uncertainty Quantification - You can identify which inputs the model is genuinely uncertain about. A logprob of -0.7 (roughly 50/50) indicates the model is guessing. These cases can be routed to human review or a more capable model.

Calibration Analysis - By collecting predictions with their logprobs and comparing to ground truth, you can measure how well-calibrated the model is. If predictions with logprob -0.1 are only correct 70% of the time (not 90%), you know the model is overconfident and can adjust thresholds accordingly.

Prioritization - Logprobs let you prioritize review. Low-confidence predictions are most likely to be wrong, so review those first. High-confidence predictions can wait or be auto-approved.

A/B Test Power - Confidence scores give you a richer signal for evaluating model changes. Two models might have the same accuracy, but if one is more confident on correct predictions and less confident on incorrect ones, it's meaningfully better, and logprobs reveal this.

The key insight: logprobs turn your classifier into a probability estimator. You're no longer asking "is this true or false?" You're asking "what's the probability this is true?" That's a fundamentally more useful question.

The Combination: Confident Reasoning

Neither mechanism alone is sufficient:

  • Structured output without logprobs - You get reliable formatting and forced reasoning, but no way to quantify confidence. Every prediction looks equally certain.
  • Logprobs without structured output - You get confidence scores, but they measure confidence in a single-token snap judgment, not confidence in a reasoned conclusion. The model hasn't had space to think.

Together, you get the best of both: the model must reason through the problem (structured output), and you can measure how confident it is in its conclusion after that reasoning (logprobs on the prediction field).

This is why the accuracy gains are so dramatic on hard classification tasks. You're not just measuring "how confident is the model in its first instinct?" You're measuring "how confident is the model after working through the problem?"

Interactive Comparison: Baseline vs Structured Logprobs

The interactive demo below shows side-by-side comparisons of baseline single-token classification versus structured logprobs on the same inputs. Try different examples to see how structured logprobs improves accuracy on ambiguous and conflicting cases, and provides better confidence calibration even when both methods are correct.

Baseline vs Structured Logprobs Comparison

Input:
I can't log into my account, the password reset isn't working
Topic: authentication issues
Ground Truth: true
Difficulty: Easy
Baseline (Single-Token)
Prediction:true
Logprob (predicted):-0.120
Logprob (alternative):-2.800
Confidence:94%
No reasoning provided — single-token snap judgment
Structured Logprobs
Prediction:true
Reasoning:
The message explicitly mentions 'log into my account' and 'password reset isn't working', which are clear indicators of authentication problems. This directly matches the topic of authentication issues.
Logprob (predicted):-0.080
Logprob (alternative):-3.100
Confidence:97%
Better confidence calibration
Both methods are correct, but structured logprobs provides higher confidence (97% vs 94%) after reasoning through the problem.
How it works: The baseline method outputs a single token (true/false) with no reasoning, while structured logprobs forces the model to reason through the problem before making a prediction. Notice how structured logprobs improves accuracy on ambiguous and conflicting cases, and provides better confidence calibration even when both methods are correct. The reasoning field makes the decision process transparent and auditable.

The Structured Logprobs Pattern

The implementation is straightforward. Force the model to output structured JSON with a reasoning field before the prediction field, then extract logprobs specifically from the prediction token(s).

The key insight: the model generates reasoning tokens before it generates the prediction. By the time it commits to true or false, it has already worked through the problem. The logprob on that prediction reflects confidence after reasoning, not confidence in an immediate pattern-match.

Extracting Logprobs from Structured Outputs

When you request structured JSON output with logprobs enabled, the API returns:

  1. The complete JSON string
  2. An array of tokens with their individual logprobs

The challenge: mapping tokens back to specific field values. JSON tokenization doesn't align cleanly with semantic field boundaries.

The Token Alignment Problem

Consider this response:

The API might tokenize this as:

Notice the problems:

  • Field names split across tokens ("reason + ing\":)
  • Punctuation merged with content (" pricing\" includes the closing quote)
  • Structural tokens merged with field names (",\"pred")

You need Token[10] for the prediction logprob, but finding it requires careful parsing.

The Extraction Algorithm

The approach is conceptually simple:

The key insight: you can't use standard JSON parsers because they discard positional information. You need to track where each value starts and ends in the original string, then map those character positions back to tokens.

For boolean fields like prediction, this usually yields a single token (true or false). For string fields like reasoning, you'll get multiple tokens, and the summed logprob represents the joint probability of that exact sequence.

Converting Logprobs to Confidence

Once you've extracted the logprob for the prediction field, convert it to a usable confidence score. Here's a complete implementation with the OpenAI API:

The key formula for converting logprobs to a confidence score is:

This gives you a normalized probability between 0 and 1. For example:

  • logprob_true = -0.1, logprob_false = -3.2confidence ≈ 0.96 (96% confident)
  • logprob_true = -0.7, logprob_false = -0.7confidence = 0.50 (50/50, uncertain)
  • logprob_true = -3.5, logprob_false = -0.05confidence ≈ 0.03 (3% confident in true, i.e., 97% confident in false)

With this confidence score, you can:

  • Set thresholds: Only accept predictions above 0.90 confidence
  • Route to review: Send predictions below 0.70 confidence to human reviewers
  • Prioritize: Process low-confidence items first for verification

Cost and Latency Analysis

Structured logprobs add overhead. Here's what I measured (note: I used GPT-5.1 Instant, which doesn't perform reasoning, so the structured approach may behave differently than with reasoning-capable models):

MetricGPT-4.1GPT-5.1 Instant
Added input tokens~400~400
Added output tokens~75~75
Added input cost$0.0008/request$0.0005/request
Added output cost$0.0006/request$0.00075/request
Total added cost$0.0014/request$0.00125/request
Added latency+2.35s+2.1s

Latency Considerations

The +2 second latency is the harder constraint for many systems. This is driven by:

  1. Additional prompt tokens - The schema instructions add ~400 tokens of input
  2. Reasoning generation - The model produces ~50-100 reasoning tokens
  3. JSON formatting overhead - Structured output has slightly higher per-token latency

For synchronous user-facing classification, this may be prohibitive. For async batch processing, it's negligible. For mixed workloads, consider a tiered approach where you route to fast baseline for obvious cases and structured logprobs for ambiguous ones.

Conclusion

Structured logprobs represent a meaningful improvement over baseline LLM classification for complex tasks. The technique leverages two complementary mechanisms:

Structured output constrains the model to reason before predicting. This isn't optional reasoning-it's architecturally enforced. The model cannot emit a prediction until it has completed the reasoning field.

Logprobs transform binary classification into probability estimation. Instead of "true or false," you get "85% confident true." This enables threshold tuning, uncertainty quantification, calibration analysis, and intelligent routing.

Together, these yield meaningful accuracy improvements on hard classification problems while giving you the operational levers to optimize the precision-recall tradeoff for your specific use case.

The tradeoff is latency. The additional reasoning tokens add ~2 seconds per request, which rules out real-time user-facing classification or anything requiring sub-100ms response times. For simple, clear-cut decisions where baseline accuracy is already high, the added complexity may not be worth it.

But for high-stakes classification-compliance, safety, customer impact-or tasks with complex decision boundaries that require multi-factor reasoning, this pattern pays for itself quickly. The confidence scores let you route uncertain predictions to human review, catch adversarial inputs that might game simpler classifiers, and continuously calibrate your thresholds based on observed accuracy.

If accuracy matters more than latency, this is the better way to do LLM classification.

Cite this post

/

Cole Hoffer. (Dec 2025). Structured Logprobs for Better Classification. Cole Hoffer. https://www.colehoffer.ai/articles/structured-logprobs-for-classification

Stay in the loop

Get notified when I publish new articles on RAG, search, and AI systems.

 

Cole Hoffer·TwitterLinkedIn