
[classification]
Structured Logprobs for Better Classification
Combining chain-of-thought reasoning with logprob extraction improves LLM classification accuracy while giving you real confidence scores.

Most LLM classification setups look something like this: send a prompt with label definitions, request a single-token true/false response, extract the logprob (log probability—a numerical measure of the model's confidence in its output), and convert it to a confidence score. It works. It's fast. But there's a substantial accuracy gap between what this approach achieves and what the model is capable of.
We've been running experiments with a structured variant that forces chain-of-thought reasoning before the prediction, then extracts logprobs specifically from the prediction field. The accuracy improvements were significant enough that we've moved most of our classifiers to this pattern.
This post walks through the technique, the tradeoffs, and the implementation details.
Common Baseline Approaches
Before getting into why structured logprobs work better, it's worth cataloging what most LLM classification implementations look like today.
String Matching
The simplest approach: prompt the model to output a label, then string-match against expected values.
1Prompt: "Is this message about billing? Reply with 'yes' or 'no'."2Output: "Yes, this appears to be about billing."
You extract "Yes" (or try to), normalize casing, and map to your label. If no match, fall back to a default.
The problems compound quickly:
- Models don't reliably output just the label—they hedge, explain, or use synonyms
- Casing varies (
"Yes","YES","yes") - Synonyms slip in (
"correct","affirmative","true") - Partial matches are ambiguous (
"not yes"contains "yes") - No confidence score—every match looks equally certain
You end up writing increasingly complex parsing logic, and the default fallback gets hit more than you'd like.
Single-Token Logprobs
A more sophisticated approach: restrict output to a single token and extract logprobs.
1Prompt: "Is this about billing? Reply with only 'A' for yes or 'B' for no."2Output: "A"3Logprob for "A": -0.15
This gets you confidence scores, but introduces new constraints:
- Must restrict to single-token outputs (multi-token labels won't work)
- Still have casing issues (
"a"vs"A"are different tokens with different logprobs) - Letter labels (
A/B) work better than words because they're reliably single tokens - The model has no space to reason—it's a snap judgment
- You're measuring confidence in pattern-matching, not in a reasoned conclusion
Both approaches share a fundamental limitation: they force the model to compress its entire decision process into a single output token, then try to extract meaning from that compressed signal.
Why Single-Token Classification Underperforms
The specific failure modes of these baseline patterns are worth understanding in detail.
When you prompt a model to output only true or false, you're asking it to compress its entire reasoning process into a single token decision. The model doesn't have space to:
-
Decompose ambiguous cases — Many classification tasks have edge cases where multiple factors interact. Without space to reason, the model pattern-matches on surface features.
-
Resolve conflicting signals — Real inputs often contain evidence for multiple labels. The model needs to weigh these signals, but single-token output forces an immediate commitment.
-
Self-correct — Chain-of-thought allows models to catch their own errors mid-reasoning. Single-token output bypasses this entirely.
The logprobs you extract from a single-token prediction reflect confidence in that immediate pattern-match, not confidence in a reasoned conclusion. For straightforward classification, this is fine. For nuanced tasks, you're measuring the wrong thing.
How Structured Logprobs Work
The power of this approach comes from two independent mechanisms working together: structured output constrains the model's behavior, while logprobs give you a probability distribution over that constrained space. Neither is sufficient alone.
Structured Output: Constraining the Decision Space
When you ask an LLM to "return true or false," you're hoping it interprets that instruction correctly. Sometimes it does. Sometimes you get "True", "TRUE", "yes", "The answer is true", or a multi-paragraph explanation followed by the answer buried somewhere in the middle.
Structured output eliminates this variance entirely. By enforcing a JSON schema, you're telling the model: "Your output must be valid JSON with exactly these fields." The model can't deviate, hedge, or produce malformed responses.
But the real value isn't just format consistency—it's forcing an execution path. When you define the schema as:
1{2 "reasoning": string, // Model MUST generate this first3 "prediction": boolean // Model generates this AFTER reasoning4}
You're not just asking for reasoning—you're requiring it as a prerequisite to the prediction. The model cannot emit the prediction field until it has completed the reasoning field. This is chain-of-thought by construction, not by hope.
This matters because:
- Ambiguous cases get decomposed — The model has space to work through conflicting signals before committing
- The reasoning is auditable — You can inspect why the model made a decision, not just what it decided
- Edge cases surface explicitly — When the model is uncertain, the reasoning often reveals the specific factor causing hesitation
Without structured output, you're relying on the model to voluntarily reason before answering. With structured output, reasoning is architecturally guaranteed.
How Type Constraints Work
Structured output APIs (like OpenAI's response_format: { type: "json_schema" }) don't just parse the output after generation—they constrain generation itself. At each token, the model's logits are masked to only allow tokens that could produce valid JSON conforming to your schema.
For a boolean field like prediction, this means:
- When the model reaches the
predictionfield, only tokens that starttrueorfalseare permitted - The model cannot output
"yes",1,"True", or any other representation - The probability mass is redistributed across only the valid options
This has important implications for logprobs:
Pros:
- Cleaner probability space — For a boolean field, you get a true binary distribution. The logprob for
truedirectly reflects P(true) vs P(false), not P(true) vs P(false) vs P(yes) vs P(1) vs P("True") vs ... - No parsing ambiguity — You know exactly what token to extract logprobs from
- Forced valid output — The model cannot produce malformed responses that break downstream processing
Cons:
- Logprobs reflect constrained distribution — The logprob you see is P(true | must be true or false), not P(true | unconstrained). If the model "wanted" to output "yes" with high probability, that probability mass gets redistributed to
trueorfalse - May mask model uncertainty — A model that would naturally hedge ("probably true") is forced into a binary choice, potentially inflating the apparent confidence
- Schema overhead — Complex schemas add tokens to the prompt and slightly increase latency
In practice, the cleaner probability space outweighs the cons for classification tasks. You're explicitly asking for a boolean decision—constraining the output to boolean values aligns the logprobs with what you're actually measuring.
Logprobs: Building a Probability Space
The second mechanism is extracting logprobs from the prediction field. This transforms classification from a binary output into a continuous probability distribution.
Why does this matter? Because binary outputs hide information. When a model outputs true, you don't know if it was 51% confident or 99% confident. Both look identical in the response. Logprobs expose the underlying probability mass the model assigned to each option.
This unlocks several capabilities that are impossible with binary outputs:
Threshold Tuning — Instead of accepting every prediction at face value, you can set confidence thresholds. "Only auto-approve predictions where logprob > -0.1" (roughly 90% confidence). This lets you trade off precision vs. recall based on your use case requirements.
Uncertainty Quantification — You can identify which inputs the model is genuinely uncertain about. A logprob of -0.7 (roughly 50/50) indicates the model is guessing. These cases can be routed to human review or a more capable model.
Calibration Analysis — By collecting predictions with their logprobs and comparing to ground truth, you can measure how well-calibrated the model is. If predictions with logprob -0.1 are only correct 70% of the time (not 90%), you know the model is overconfident and can adjust thresholds accordingly.
Prioritization — Logprobs let you prioritize review. Low-confidence predictions are most likely to be wrong—review those first. High-confidence predictions can wait or be auto-approved.
A/B Test Power — Confidence scores give you a richer signal for evaluating model changes. Two models might have the same accuracy, but if one is more confident on correct predictions and less confident on incorrect ones, it's meaningfully better—and logprobs reveal this.
The key insight: logprobs turn your classifier into a probability estimator. You're no longer asking "is this true or false?" You're asking "what's the probability this is true?" That's a fundamentally more useful question.
The Combination: Confident Reasoning
Neither mechanism alone is sufficient:
- Structured output without logprobs — You get reliable formatting and forced reasoning, but no way to quantify confidence. Every prediction looks equally certain.
- Logprobs without structured output — You get confidence scores, but they measure confidence in a single-token snap judgment, not confidence in a reasoned conclusion. The model hasn't had space to think.
Together, you get the best of both: the model must reason through the problem (structured output), and you can measure how confident it is in its conclusion after that reasoning (logprobs on the prediction field).
This is why the accuracy gains are so dramatic on hard classification tasks. You're not just measuring "how confident is the model in its first instinct?" You're measuring "how confident is the model after working through the problem?"
Interactive Comparison: Baseline vs Structured Logprobs
The interactive demo below shows side-by-side comparisons of baseline single-token classification versus structured logprobs on the same inputs. Try different examples to see how structured logprobs improves accuracy on ambiguous and conflicting cases, and provides better confidence calibration even when both methods are correct.
Baseline vs Structured Logprobs Comparison
The Structured Logprobs Pattern
The implementation is straightforward. Force the model to output structured JSON with a reasoning field before the prediction field, then extract logprobs specifically from the prediction token(s).
1[System]2You are a classifier. Determine if the user message is about {topic}.34Definitions:5- true: The message directly discusses {topic}6- false: The message is unrelated to {topic}78Return a JSON object with this exact schema:9{10 "reasoning": "Your step-by-step analysis of why this does or doesn't match",11 "prediction": true | false12}1314[User]15{message}1617[Assistant]18{19 "reasoning": "The user mentions 'valid words not accepted' which is a20 clear reference to the Spelling Bee game mechanics...",21 "prediction": true22}23 ↑24 extract logprob from this specific field
The key insight: the model generates reasoning tokens before it generates the prediction. By the time it commits to true or false, it has already worked through the problem. The logprob on that prediction reflects confidence after reasoning, not confidence in an immediate pattern-match.
Extracting and Converting Logprobs to Confidence
Once you have the structured response with logprobs, you need to extract the probability for the prediction field and convert it to a usable confidence score. Here's how to do it with the OpenAI API:
1from openai import OpenAI2import json3import math45client = OpenAI()67# Request structured output with logprobs8response = client.chat.completions.create(9 model="gpt-4o",10 messages=[11 {"role": "system", "content": "You are a classifier..."},12 {"role": "user", "content": "User message here"}13 ],14 response_format={15 "type": "json_schema",16 "json_schema": {17 "name": "classification",18 "schema": {19 "type": "object",20 "properties": {21 "reasoning": {"type": "string"},22 "prediction": {"type": "boolean"}23 },24 "required": ["reasoning", "prediction"]25 }26 }27 },28 logprobs=True,29 top_logprobs=2 # Get top 2 to see both true/false probabilities30)3132# Parse the JSON response33result = json.loads(response.choices[0].message.content)34prediction = result["prediction"]3536# Extract logprobs for the prediction field37# The logprobs are in the token-level content array38tokens_with_logprobs = response.choices[0].logprobs.content3940# Find the token corresponding to the boolean value41# Look for "true" or "false" token after the "prediction" field42prediction_token = None43for i, token_data in enumerate(tokens_with_logprobs):44 if token_data.token in ["true", "false"]:45 # Check if this is the prediction field (not in reasoning text)46 # by looking for the "prediction" key nearby47 for j in range(max(0, i-3), i):48 if "prediction" in tokens_with_logprobs[j].token:49 prediction_token = token_data50 break51 if prediction_token:52 break5354# Extract logprobs for both true and false55logprob_predicted = prediction_token.logprob56logprob_alternative = None5758# Get the alternative probability from top_logprobs59for alt in prediction_token.top_logprobs:60 if alt.token != prediction_token.token and alt.token in ["true", "false"]:61 logprob_alternative = alt.logprob62 break6364# Calculate confidence using the formula:65# P(prediction) = exp(logprob_prediction) / (exp(logprob_true) + exp(logprob_false))66if logprob_alternative is not None:67 # Convert from log probabilities to probabilities68 prob_predicted = math.exp(logprob_predicted)69 prob_alternative = math.exp(logprob_alternative)7071 # Normalize to get confidence72 confidence = prob_predicted / (prob_predicted + prob_alternative)73else:74 # If we only have one logprob, convert directly75 confidence = math.exp(logprob_predicted)7677print(f"Prediction: {prediction}")78print(f"Confidence: {confidence:.2%}")79print(f"Logprob (predicted): {logprob_predicted:.4f}")80print(f"Logprob (alternative): {logprob_alternative:.4f}" if logprob_alternative else "N/A")
The key formula for converting logprobs to a confidence score is:
1P(true) = exp(logprob_true) / (exp(logprob_true) + exp(logprob_false))
This gives you a normalized probability between 0 and 1. For example:
logprob_true = -0.1, logprob_false = -3.2→confidence ≈ 0.96(96% confident)logprob_true = -0.7, logprob_false = -0.7→confidence = 0.50(50/50, uncertain)logprob_true = -3.5, logprob_false = -0.05→confidence ≈ 0.03(3% confident in true, i.e., 97% confident in false)
With this confidence score, you can:
- Set thresholds: Only accept predictions above 0.90 confidence
- Route to review: Send predictions below 0.70 confidence to human reviewers
- Prioritize: Process low-confidence items first for verification
Cost and Latency Analysis
Structured logprobs add overhead. Here's what we measured (note: we used GPT-5.1 Instant, which doesn't perform reasoning, so the structured approach may behave differently than with reasoning-capable models):
| Metric | GPT-4.1 | GPT-5.1 Instant |
|---|---|---|
| Added input tokens | ~400 | ~400 |
| Added output tokens | ~75 | ~75 |
| Added input cost | $0.0008/request | $0.0005/request |
| Added output cost | $0.0006/request | $0.00075/request |
| Total added cost | $0.0014/request | $0.00125/request |
| Added latency | +2.35s | +2.1s |
Latency Considerations
The +2 second latency is the harder constraint for many systems. This is driven by:
- Additional prompt tokens — The schema instructions add ~400 tokens of input
- Reasoning generation — The model produces ~50-100 reasoning tokens
- JSON formatting overhead — Structured output has slightly higher per-token latency
For synchronous user-facing classification, this may be prohibitive. For async batch processing, it's negligible. For mixed workloads, consider a tiered approach where you route to fast baseline for obvious cases and structured logprobs for ambiguous ones.
Extracting Logprobs from Structured Output
When you request structured JSON output with logprobs enabled, the API returns:
- The complete JSON string
- An array of tokens with their individual logprobs
The challenge: mapping tokens back to specific field values. JSON tokenization doesn't align cleanly with semantic field boundaries.
The Token Alignment Problem
Consider this response:
1{ "reasoning": "The user asks about pricing", "prediction": true }
The API might tokenize this as:
1Token[0]: "{" logprob: -0.012Token[1]: "\"reason" logprob: -0.033Token[2]: "ing\":" logprob: -0.024Token[3]: "\"The" logprob: -0.285Token[4]: " user" logprob: -0.356Token[5]: " asks" logprob: -0.427Token[6]: " about" logprob: -0.388Token[7]: " pricing\"" logprob: -0.559Token[8]: ",\"pred" logprob: -0.0210Token[9]: "iction\":" logprob: -0.0111Token[10]: "true" logprob: -0.08 ← target12Token[11]: "}" logprob: -0.01
Notice the problems:
- Field names split across tokens (
"reason+ing\":) - Punctuation merged with content (
" pricing\"includes the closing quote) - Structural tokens merged with field names (
",\"pred")
We need Token[10] for the prediction logprob, but finding it requires careful parsing.
The Extraction Algorithm
The approach is conceptually simple:
1EXTRACT_FIELD_LOGPROBS(json_string, tokens, target_field):23 # Step 1: Build character-to-token mapping4 char_to_token = {}5 current_char = 06 for each token in tokens:7 for i in range(len(token.text)):8 char_to_token[current_char + i] = token9 current_char += len(token.text)1011 # Step 2: Find field value positions in JSON12 # (Parse JSON manually to track character positions)13 field_start, field_end = find_field_value_position(json_string, target_field)1415 # Step 3: Collect tokens that overlap the field value16 field_tokens = []17 for char in range(field_start, field_end + 1):18 token = char_to_token[char]19 if token not in field_tokens:20 field_tokens.append(token)2122 # Step 4: Sum logprobs for joint probability23 total_logprob = sum(t.logprob for t in field_tokens)2425 return total_logprob
The key insight: you can't use standard JSON parsers because they discard positional information. You need to track where each value starts and ends in the original string, then map those character positions back to tokens.
For boolean fields like prediction, this usually yields a single token (true or false). For string fields like reasoning, you'll get multiple tokens, and the summed logprob represents the joint probability of that exact sequence.
Conclusion
Structured logprobs represent a meaningful improvement over baseline LLM classification for complex tasks. The technique leverages two complementary mechanisms:
Structured output constrains the model to reason before predicting. This isn't optional reasoning—it's architecturally enforced. The model cannot emit a prediction until it has completed the reasoning field.
Logprobs transform binary classification into probability estimation. Instead of "true or false," you get "85% confident true." This enables threshold tuning, uncertainty quantification, calibration analysis, and intelligent routing.
Together, these yield meaningful accuracy improvements on hard classification problems while giving you the operational levers to optimize the precision-recall tradeoff for your specific use case.
The tradeoff is latency. The additional reasoning tokens add ~2 seconds per request, which rules out real-time user-facing classification or anything requiring sub-100ms response times. For simple, clear-cut decisions where baseline accuracy is already high, the added complexity may not be worth it.
But for high-stakes classification—compliance, safety, customer impact—or tasks with complex decision boundaries that require multi-factor reasoning, this pattern pays for itself quickly. The confidence scores let you route uncertain predictions to human review, catch adversarial inputs that might game simpler classifiers, and continuously calibrate your thresholds based on observed accuracy.
If accuracy matters more than latency, this is the better way to do LLM classification.
Cite this post
Cole Hoffer. (Dec 2025). Structured Logprobs for Better Classification. Cole Hoffer. https://www.colehoffer.ai/articles/structured-logprobs-for-classification