Most LLM classification setups look something like this: send a prompt with label definitions, request a single-token true/false response, extract the logprob (log probability—a numerical measure of the model's confidence in its output), and convert it to a confidence score. It works. It's fast. But there's a substantial accuracy gap between what this approach achieves and what the model is capable of.

We've been running experiments with a structured variant that forces chain-of-thought reasoning before the prediction, then extracts logprobs specifically from the prediction field. The accuracy improvements were significant enough that we've moved most of our classifiers to this pattern.

This post walks through the technique, the tradeoffs, and the implementation details.

Common Baseline Approaches

Before getting into why structured logprobs work better, it's worth cataloging what most LLM classification implementations look like today.

String Matching

The simplest approach: prompt the model to output a label, then string-match against expected values.

1Prompt: "Is this message about billing? Reply with 'yes' or 'no'."
2Output: "Yes, this appears to be about billing."

You extract "Yes" (or try to), normalize casing, and map to your label. If no match, fall back to a default.

The problems compound quickly:

Models don't reliably output just the label—they hedge, explain, or use synonyms
Casing varies ("Yes", "YES", "yes")
Synonyms slip in ("correct", "affirmative", "true")
Partial matches are ambiguous ("not yes" contains "yes")
No confidence score—every match looks equally certain

You end up writing increasingly complex parsing logic, and the default fallback gets hit more than you'd like.

Single-Token Logprobs

A more sophisticated approach: restrict output to a single token and extract logprobs.

1Prompt: "Is this about billing? Reply with only 'A' for yes or 'B' for no."
2Output: "A"
3Logprob for "A": -0.15

This gets you confidence scores, but introduces new constraints:

Must restrict to single-token outputs (multi-token labels won't work)
Still have casing issues ("a" vs "A" are different tokens with different logprobs)
Letter labels (A/B) work better than words because they're reliably single tokens
The model has no space to reason—it's a snap judgment
You're measuring confidence in pattern-matching, not in a reasoned conclusion

Both approaches share a fundamental limitation: they force the model to compress its entire decision process into a single output token, then try to extract meaning from that compressed signal.

Why Single-Token Classification Underperforms

The specific failure modes of these baseline patterns are worth understanding in detail.

When you prompt a model to output only true or false, you're asking it to compress its entire reasoning process into a single token decision. The model doesn't have space to:

Decompose ambiguous cases — Many classification tasks have edge cases where multiple factors interact. Without space to reason, the model pattern-matches on surface features.
Resolve conflicting signals — Real inputs often contain evidence for multiple labels. The model needs to weigh these signals, but single-token output forces an immediate commitment.
Self-correct — Chain-of-thought allows models to catch their own errors mid-reasoning. Single-token output bypasses this entirely.

The logprobs you extract from a single-token prediction reflect confidence in that immediate pattern-match, not confidence in a reasoned conclusion. For straightforward classification, this is fine. For nuanced tasks, you're measuring the wrong thing.

How Structured Logprobs Work

The power of this approach comes from two independent mechanisms working together: structured output constrains the model's behavior, while logprobs give you a probability distribution over that constrained space. Neither is sufficient alone.

Structured Output: Constraining the Decision Space

When you ask an LLM to "return true or false," you're hoping it interprets that instruction correctly. Sometimes it does. Sometimes you get "True", "TRUE", "yes", "The answer is true", or a multi-paragraph explanation followed by the answer buried somewhere in the middle.

Structured output eliminates this variance entirely. By enforcing a JSON schema, you're telling the model: "Your output must be valid JSON with exactly these fields." The model can't deviate, hedge, or produce malformed responses.

But the real value isn't just format consistency—it's forcing an execution path. When you define the schema as:

1{
2  "reasoning": string,   // Model MUST generate this first
3  "prediction": boolean  // Model generates this AFTER reasoning
4}

You're not just asking for reasoning—you're requiring it as a prerequisite to the prediction. The model cannot emit the prediction field until it has completed the reasoning field. This is chain-of-thought by construction, not by hope.

This matters because:

Ambiguous cases get decomposed — The model has space to work through conflicting signals before committing
The reasoning is auditable — You can inspect why the model made a decision, not just what it decided
Edge cases surface explicitly — When the model is uncertain, the reasoning often reveals the specific factor causing hesitation

Without structured output, you're relying on the model to voluntarily reason before answering. With structured output, reasoning is architecturally guaranteed.

How Type Constraints Work

Structured output APIs (like OpenAI's response_format: { type: "json_schema" }) don't just parse the output after generation—they constrain generation itself. At each token, the model's logits are masked to only allow tokens that could produce valid JSON conforming to your schema.

For a boolean field like prediction, this means:

When the model reaches the prediction field, only tokens that start true or false are permitted
The model cannot output "yes", 1, "True", or any other representation
The probability mass is redistributed across only the valid options

This has important implications for logprobs:

Pros:

Cleaner probability space — For a boolean field, you get a true binary distribution. The logprob for true directly reflects P(true) vs P(false), not P(true) vs P(false) vs P(yes) vs P(1) vs P("True") vs ...
No parsing ambiguity — You know exactly what token to extract logprobs from
Forced valid output — The model cannot produce malformed responses that break downstream processing

Cons:

Logprobs reflect constrained distribution — The logprob you see is P(true | must be true or false), not P(true | unconstrained). If the model "wanted" to output "yes" with high probability, that probability mass gets redistributed to true or false
May mask model uncertainty — A model that would naturally hedge ("probably true") is forced into a binary choice, potentially inflating the apparent confidence
Schema overhead — Complex schemas add tokens to the prompt and slightly increase latency

In practice, the cleaner probability space outweighs the cons for classification tasks. You're explicitly asking for a boolean decision—constraining the output to boolean values aligns the logprobs with what you're actually measuring.

Logprobs: Building a Probability Space

The second mechanism is extracting logprobs from the prediction field. This transforms classification from a binary output into a continuous probability distribution.

Why does this matter? Because binary outputs hide information. When a model outputs true, you don't know if it was 51% confident or 99% confident. Both look identical in the response. Logprobs expose the underlying probability mass the model assigned to each option.

This unlocks several capabilities that are impossible with binary outputs:

Threshold Tuning — Instead of accepting every prediction at face value, you can set confidence thresholds. "Only auto-approve predictions where logprob > -0.1" (roughly 90% confidence). This lets you trade off precision vs. recall based on your use case requirements.

Uncertainty Quantification — You can identify which inputs the model is genuinely uncertain about. A logprob of -0.7 (roughly 50/50) indicates the model is guessing. These cases can be routed to human review or a more capable model.

Calibration Analysis — By collecting predictions with their logprobs and comparing to ground truth, you can measure how well-calibrated the model is. If predictions with logprob -0.1 are only correct 70% of the time (not 90%), you know the model is overconfident and can adjust thresholds accordingly.

Prioritization — Logprobs let you prioritize review. Low-confidence predictions are most likely to be wrong—review those first. High-confidence predictions can wait or be auto-approved.

A/B Test Power — Confidence scores give you a richer signal for evaluating model changes. Two models might have the same accuracy, but if one is more confident on correct predictions and less confident on incorrect ones, it's meaningfully better—and logprobs reveal this.

The key insight: logprobs turn your classifier into a probability estimator. You're no longer asking "is this true or false?" You're asking "what's the probability this is true?" That's a fundamentally more useful question.

The Combination: Confident Reasoning

Neither mechanism alone is sufficient:

Structured output without logprobs — You get reliable formatting and forced reasoning, but no way to quantify confidence. Every prediction looks equally certain.
Logprobs without structured output — You get confidence scores, but they measure confidence in a single-token snap judgment, not confidence in a reasoned conclusion. The model hasn't had space to think.

Together, you get the best of both: the model must reason through the problem (structured output), and you can measure how confident it is in its conclusion after that reasoning (logprobs on the prediction field).

This is why the accuracy gains are so dramatic on hard classification tasks. You're not just measuring "how confident is the model in its first instinct?" You're measuring "how confident is the model after working through the problem?"

Interactive Comparison: Baseline vs Structured Logprobs

The interactive demo below shows side-by-side comparisons of baseline single-token classification versus structured logprobs on the same inputs. Try different examples to see how structured logprobs improves accuracy on ambiguous and conflicting cases, and provides better confidence calibration even when both methods are correct.

Baseline vs Structured Logprobs Comparison

Input:

I can't log into my account, the password reset isn't working

Topic: authentication issues

Ground Truth: true

Difficulty: Easy

Baseline (Single-Token)

Prediction:true

Logprob (predicted):-0.120

Logprob (alternative):-2.800

Confidence:94%

No reasoning provided — single-token snap judgment

Structured Logprobs

Prediction:true

Reasoning:

The message explicitly mentions 'log into my account' and 'password reset isn't working', which are clear indicators of authentication problems. This directly matches the topic of authentication issues.

Logprob (predicted):-0.080

Logprob (alternative):-3.100

Confidence:97%

Better confidence calibration

Both methods are correct, but structured logprobs provides higher confidence (97% vs 94%) after reasoning through the problem.

How it works: The baseline method outputs a single token (true/false) with no reasoning, while structured logprobs forces the model to reason through the problem before making a prediction. Notice how structured logprobs improves accuracy on ambiguous and conflicting cases, and provides better confidence calibration even when both methods are correct. The reasoning field makes the decision process transparent and auditable.

The Structured Logprobs Pattern

The implementation is straightforward. Force the model to output structured JSON with a reasoning field before the prediction field, then extract logprobs specifically from the prediction token(s).

1[System]
2You are a classifier. Determine if the user message is about {topic}.
3
4Definitions:
5- true: The message directly discusses {topic}
6- false: The message is unrelated to {topic}
7
8Return a JSON object with this exact schema:
9{
10  "reasoning": "Your step-by-step analysis of why this does or doesn't match",
11  "prediction": true | false
12}
13
14[User]
15{message}
16
17[Assistant]
18{
19  "reasoning": "The user mentions 'valid words not accepted' which is a
20               clear reference to the Spelling Bee game mechanics...",
21  "prediction": true
22}
23               ↑
24          extract logprob from this specific field

The key insight: the model generates reasoning tokens before it generates the prediction. By the time it commits to true or false, it has already worked through the problem. The logprob on that prediction reflects confidence after reasoning, not confidence in an immediate pattern-match.

Extracting and Converting Logprobs to Confidence

Once you have the structured response with logprobs, you need to extract the probability for the prediction field and convert it to a usable confidence score. Here's how to do it with the OpenAI API:

1from openai import OpenAI
2import json
3import math
4
5client = OpenAI()
6
7# Request structured output with logprobs
8response = client.chat.completions.create(
9    model="gpt-4o",
10    messages=[
11        {"role": "system", "content": "You are a classifier..."},
12        {"role": "user", "content": "User message here"}
13    ],
14    response_format={
15        "type": "json_schema",
16        "json_schema": {
17            "name": "classification",
18            "schema": {
19                "type": "object",
20                "properties": {
21                    "reasoning": {"type": "string"},
22                    "prediction": {"type": "boolean"}
23                },
24                "required": ["reasoning", "prediction"]
25            }
26        }
27    },
28    logprobs=True,
29    top_logprobs=2  # Get top 2 to see both true/false probabilities
30)
31
32# Parse the JSON response
33result = json.loads(response.choices[0].message.content)
34prediction = result["prediction"]
35
36# Extract logprobs for the prediction field
37# The logprobs are in the token-level content array
38tokens_with_logprobs = response.choices[0].logprobs.content
39
40# Find the token corresponding to the boolean value
41# Look for "true" or "false" token after the "prediction" field
42prediction_token = None
43for i, token_data in enumerate(tokens_with_logprobs):
44    if token_data.token in ["true", "false"]:
45        # Check if this is the prediction field (not in reasoning text)
46        # by looking for the "prediction" key nearby
47        for j in range(max(0, i-3), i):
48            if "prediction" in tokens_with_logprobs[j].token:
49                prediction_token = token_data
50                break
51        if prediction_token:
52            break
53
54# Extract logprobs for both true and false
55logprob_predicted = prediction_token.logprob
56logprob_alternative = None
57
58# Get the alternative probability from top_logprobs
59for alt in prediction_token.top_logprobs:
60    if alt.token != prediction_token.token and alt.token in ["true", "false"]:
61        logprob_alternative = alt.logprob
62        break
63
64# Calculate confidence using the formula:
65# P(prediction) = exp(logprob_prediction) / (exp(logprob_true) + exp(logprob_false))
66if logprob_alternative is not None:
67    # Convert from log probabilities to probabilities
68    prob_predicted = math.exp(logprob_predicted)
69    prob_alternative = math.exp(logprob_alternative)
70
71    # Normalize to get confidence
72    confidence = prob_predicted / (prob_predicted + prob_alternative)
73else:
74    # If we only have one logprob, convert directly
75    confidence = math.exp(logprob_predicted)
76
77print(f"Prediction: {prediction}")
78print(f"Confidence: {confidence:.2%}")
79print(f"Logprob (predicted): {logprob_predicted:.4f}")
80print(f"Logprob (alternative): {logprob_alternative:.4f}" if logprob_alternative else "N/A")

The key formula for converting logprobs to a confidence score is:

1P(true) = exp(logprob_true) / (exp(logprob_true) + exp(logprob_false))

This gives you a normalized probability between 0 and 1. For example:

logprob_true = -0.1, logprob_false = -3.2 → confidence ≈ 0.96 (96% confident)
logprob_true = -0.7, logprob_false = -0.7 → confidence = 0.50 (50/50, uncertain)
logprob_true = -3.5, logprob_false = -0.05 → confidence ≈ 0.03 (3% confident in true, i.e., 97% confident in false)

With this confidence score, you can:

Set thresholds: Only accept predictions above 0.90 confidence
Route to review: Send predictions below 0.70 confidence to human reviewers
Prioritize: Process low-confidence items first for verification

Cost and Latency Analysis

Structured logprobs add overhead. Here's what we measured (note: we used GPT-5.1 Instant, which doesn't perform reasoning, so the structured approach may behave differently than with reasoning-capable models):

Metric	GPT-4.1	GPT-5.1 Instant
Added input tokens	~400	~400
Added output tokens	~75	~75
Added input cost	$0.0008/request	$0.0005/request
Added output cost	$0.0006/request	$0.00075/request
Total added cost	$0.0014/request	$0.00125/request
Added latency	+2.35s	+2.1s

Latency Considerations

The +2 second latency is the harder constraint for many systems. This is driven by:

Additional prompt tokens — The schema instructions add ~400 tokens of input
Reasoning generation — The model produces ~50-100 reasoning tokens
JSON formatting overhead — Structured output has slightly higher per-token latency

For synchronous user-facing classification, this may be prohibitive. For async batch processing, it's negligible. For mixed workloads, consider a tiered approach where you route to fast baseline for obvious cases and structured logprobs for ambiguous ones.

Extracting Logprobs from Structured Output

When you request structured JSON output with logprobs enabled, the API returns:

The complete JSON string
An array of tokens with their individual logprobs

The challenge: mapping tokens back to specific field values. JSON tokenization doesn't align cleanly with semantic field boundaries.

The Token Alignment Problem

Consider this response:

1{ "reasoning": "The user asks about pricing", "prediction": true }

The API might tokenize this as:

1Token[0]:  "{"              logprob: -0.01
2Token[1]:  "\"reason"       logprob: -0.03
3Token[2]:  "ing\":"         logprob: -0.02
4Token[3]:  "\"The"          logprob: -0.28
5Token[4]:  " user"          logprob: -0.35
6Token[5]:  " asks"          logprob: -0.42
7Token[6]:  " about"         logprob: -0.38
8Token[7]:  " pricing\""     logprob: -0.55
9Token[8]:  ",\"pred"        logprob: -0.02
10Token[9]:  "iction\":"      logprob: -0.01
11Token[10]: "true"           logprob: -0.08  ← target
12Token[11]: "}"              logprob: -0.01

Notice the problems:

Field names split across tokens ("reason + ing\":)
Punctuation merged with content (" pricing\" includes the closing quote)
Structural tokens merged with field names (",\"pred")

We need Token[10] for the prediction logprob, but finding it requires careful parsing.

The Extraction Algorithm

The approach is conceptually simple:

1EXTRACT_FIELD_LOGPROBS(json_string, tokens, target_field):
2
3    # Step 1: Build character-to-token mapping
4    char_to_token = {}
5    current_char = 0
6    for each token in tokens:
7        for i in range(len(token.text)):
8            char_to_token[current_char + i] = token
9        current_char += len(token.text)
10
11    # Step 2: Find field value positions in JSON
12    # (Parse JSON manually to track character positions)
13    field_start, field_end = find_field_value_position(json_string, target_field)
14
15    # Step 3: Collect tokens that overlap the field value
16    field_tokens = []
17    for char in range(field_start, field_end + 1):
18        token = char_to_token[char]
19        if token not in field_tokens:
20            field_tokens.append(token)
21
22    # Step 4: Sum logprobs for joint probability
23    total_logprob = sum(t.logprob for t in field_tokens)
24
25    return total_logprob

The key insight: you can't use standard JSON parsers because they discard positional information. You need to track where each value starts and ends in the original string, then map those character positions back to tokens.

For boolean fields like prediction, this usually yields a single token (true or false). For string fields like reasoning, you'll get multiple tokens, and the summed logprob represents the joint probability of that exact sequence.

Conclusion

Structured logprobs represent a meaningful improvement over baseline LLM classification for complex tasks. The technique leverages two complementary mechanisms:

Structured output constrains the model to reason before predicting. This isn't optional reasoning—it's architecturally enforced. The model cannot emit a prediction until it has completed the reasoning field.

Logprobs transform binary classification into probability estimation. Instead of "true or false," you get "85% confident true." This enables threshold tuning, uncertainty quantification, calibration analysis, and intelligent routing.

Together, these yield meaningful accuracy improvements on hard classification problems while giving you the operational levers to optimize the precision-recall tradeoff for your specific use case.

The tradeoff is latency. The additional reasoning tokens add ~2 seconds per request, which rules out real-time user-facing classification or anything requiring sub-100ms response times. For simple, clear-cut decisions where baseline accuracy is already high, the added complexity may not be worth it.

But for high-stakes classification—compliance, safety, customer impact—or tasks with complex decision boundaries that require multi-factor reasoning, this pattern pays for itself quickly. The confidence scores let you route uncertain predictions to human review, catch adversarial inputs that might game simpler classifiers, and continuously calibrate your thresholds based on observed accuracy.

If accuracy matters more than latency, this is the better way to do LLM classification.