14 min read

Managing Tokens and Context Windows for Large Language Models

An exploration of token management techniques in Generative AI applications

When building Generative AI applications, you’re not just juggling CPU/GPU and memory, you’re budgeting and managing tokens, whether you recognize it or not: input, output, and (on reasoning models) thinking tokens. Ignoring this, or getting it wrong means truncated model outputs, latency spikes, unpredictable costs, and very frustrated users. Conversely, getting it right means reliability, speed, and predictable costs, increasing user satisfaction, stakeholders' trust, and your team's quality of life.

Below, I'll discuss:

  • Some definitions, e.g., what even are "tokens" and what is a "context window"
  • Why token management matters (even with million-token context windows)
  • A pragmatic playbook for token and context management

TL;DR

Treat tokens like CPU cycles: budget them, cap them, watch them. Dynamic budgeting + prompt caching + structured outputs = reliable UX and predictable bills and performance.

1. What are “tokens”?

LLMs don’t read “words”. They read groups of characters, including spaces and punctuation. These are called tokens. A rough rule for how characters map to tokens in English is ~1 token ≈ 4 characters ≈ ¾ word. That’s an approximation, real counts will vary by model family and language (e.g., see Hugging Face Tokenizer docs).

Different model providers tokenize (the process of turning words to tokens) differently (OpenAI tiktoken vs. SentencePiece variants, etc.). If you’re building multi-model pipelines, don’t assume the same string or word has the same count everywhere.

Input vs. output vs. reasoning tokens

Input/Prompt tokens: what you send to the model. These are called "input" tokens because from the model's perspective you're giving it those tokens.

Output/Completion tokens: what the model returns (the visible answer).

Reasoning tokens: hidden “thinking” steps used by reasoning models. They show up in usage metadata and count toward billing. Azure/OpenAI surface these under completion_tokens_details.reasoning_tokens. (OpenAI usage)

Context windows vs. output caps

The context window for a model represents the total token capacity for the model, including input and output tokens.

context window = input tokens + output tokens

Reasoning tokens are tracked separately in usage and, on most providers, count toward the completion/output budget; if your max_output_tokens is too low, reasoning can consume it and leave little (or no) visible text. Plan headroom. (Microsoft Learn)

The context window and output caps are fundamentally separate limits. A model may accept very large inputs but cap generated output much lower. Model vendors often headline huge context windows but smaller output caps.

  • Gemini 1.5 Pro/Flash: about 1,000,000 input tokens, about 8,000 output tokens (Google documents these as approximate limits). (Google Gemini tokens)

  • OpenAI: Models vary widely by family/version; for example GPT‑4.1 supports up to ~1M tokens of context, while newer GPT‑5 variants are model‑specific—always check your deployment’s model card. (GPT‑4.1, OpenAI models)

  • Anthropic Claude: Sonnet 4/4.5 now support up to 1M tokens of context (beta/tier‑gated), with standard tiers commonly at lower limits—check the model card. (Context windows)

2. Why you should manage tokens (even at 1M context window)

A Generative AI application that consistently exceeds context limits is characterized by high latency, poor output quality, truncated outputs, angry users, and stakeholders slowly losing trust in you, your team, and "AI". In my experience, here are a few things you gain by treating context management as a first-class citizen in your applications:

Predictable outputs / fewer truncations

LLMs, by default, do not know how many tokens they are about to output before they output them. Consequently, if a model is about to output 3,094 tokens (representing an answer to a question asked by Jenny from Legals), the model would stop at 1,500 tokens if that were all that was left. The rest of the 1,594 tokens would simply not be returned. Because tokens are parts of words, this would mean answers that stop mid-sentence, for instance. Can you feel your users' frustration yet?

To counter this, it is important to set explicit caps that reflect your application's typical expected output lengths. Do not rely solely on defaults.

If you’re using a reasoning model, remember that reasoning tokens consume completion budget, so a too‑tight max_output_tokens can silently yield empty/partial answers—reserve headroom. (OpenAI Developer Community)

Performance really does dip at longer input token lengths (AKA context rot)

Models miss information buried in massive prompts. Newer models improve long-context retrieval, but they don’t subvert physics. The keen-eyed among you would have already started thinking about the "Needle in a haystack" test that suggests models are increasingly able to reason and retrieve information correctly even with longer context lengths.

The test requires LLMs to identify a specific fact (the needle) that is placed in a large corpus of text (the haystack), often with similar facts surrounding it. However, the test often contains lexically similar questions, meaning the model has to do very little reasoning and deal with very little ambiguity. It is therefore not very representative of the majority of real-world conversations and workflows.

The fact remains that at longer context lengths, models do struggle to reason, retrieve relevant information and maintain coherence. Capping input tokens and context to only relevant information reduces the likelihood of hitting these performance dips. (Lost in the Middle)

Costs track tokens

The more input and output tokens you send and consume, the higher your cost. Model providers bill per input/output token (output is often pricier). Some advanced prompt engineering techniques such as Prompt Caching can help reduce costs by caching common prefixes in your prompts. The cache applies to the longest shared prefix ≥ 1,024 tokens and grows in 128-token increments—keep prefixes byte‑identical to maximize hits. (Prompt Caching) However, the easiest way to keep your costs low and predictable by far is to manage your input and output tokens. If your application doesn't need large outputs, cap the output tokens lower than the default.

Latency & quotas

More tokens often means slower responses and a higher chance of hitting tokens per minute (TPM) / requests per minute (RPM) limits on Azure or other model API providers. Hitting these will cause your requests to get throttled, or dropped due to rate limit violations. (Azure OpenAI quotas/limits)

Once again, ensuring you set the right caps for your application will prevent unconstrained token consumption and reduce the likelihood that you experience the consequences of increased latency: unhappy users and a loss of trust from stakeholders.

3. Techniques for managing tokens & context windows

Don’t manage tokens & context windows (fine for throwaway prototypes)

Skip any caps on token consumption, especially output tokens, relying solely on defaults. Fast to ship, terrible for reliability. Why?

  • Random truncation of outputs
  • Unbounded costs in worst-case inputs
  • Context length blowups as interactions with the model grow (for example, a long chat)
  • Reduced model coherence and performance at longer context lengths

Good for throwaway prototypes; avoid in production unless tightly cost-capped.

Static caps (simple & predictable)

Pick fixed per-route limits (e.g., “FAQ answers ≤ 512 tokens; draft sections ≤ 2,048”). You’ll cap cost/latency but risk under-allocation and truncation on outlier prompts. Set explicit caps via the provider parameter (for example, max_output_tokens in the Responses API for OpenAI). (OpenAI Responses API)

Limits at a glance (examples — checked on: 2025-09-09; always check your model card)

Provider / model Example context window Example output cap Docs
Google Gemini 1.5 Pro/Flash ≈1,000,000 ≈8,000 tokens
OpenAI (e.g., GPT‑4.1) up to ≈1,000,000 model‑specific models
Anthropic Claude (Sonnet 4/4.5) up to 1,000,000* model‑specific via max_tokens context windows
  • 1M context is currently beta/tier‑gated; standard tiers may be lower.

These are anchors, not guarantees. Providers change limits frequently.

Compute a per-request budget from your model's context window, the provider’s output cap, your product SLO (cost/latency), and optional headroom for reasoning tokens on thinking models.

This is by far the best approach; it ensures your application only consumes the tokens required for its workload. There is a slight latency penalty associated with counting tokens before making LLM calls, but in most cases it is negligible or justified given the payoff.

The gist:

"""Compute a safe output-token budget from context and caps.
 
Args:
    context_window: Total available tokens for the model (input + output + reasoning).
    input_tokens: Tokens already used by the current request input.
    reasoning_headroom: Reserved tokens for hidden reasoning on thinking models.
    app_cap: Application-specific maximum allowed output tokens.
    provider_output_cap: Model/provider maximum output tokens.
 
Returns:
    max_output_tokens: Safe maximum output tokens for this request.
"""
max_out_for_context = context_window - input_tokens - reasoning_headroom
max_output_tokens   = min(app_cap, provider_output_cap, max_out_for_context)

Full implementation example:

"""Utilities for dynamic token management with OpenAI Responses API and tiktoken.
 
This module counts tokens, trims/summarizes history to fit a context window,
and computes a safe max_output_tokens honoring application and provider caps.
It also reserves optional reasoning headroom for thinking models.
"""
# pip install openai tiktoken
import os
from typing import List, Dict, Any, Optional, Tuple
import tiktoken
from openai import OpenAI
import random
import time
 
MODEL_ID = os.getenv("OPENAI_MODEL", "gpt-5")   # or your deployed model name
SUMMARIZER_MODEL = os.getenv("OPENAI_SUMMARIZER_MODEL", MODEL_ID)
CONTEXT_WINDOW = int(os.getenv("MODEL_CONTEXT_WINDOW", 128_000))  # set per-deployment
PROVIDER_OUTPUT_CAP = int(os.getenv("PROVIDER_OUTPUT_CAP", 8_192))  # see docs for your model
APP_OUTPUT_CAP = int(os.getenv("APP_OUTPUT_CAP", 2_048)) # your application's specific output cap
REASONING_HEADROOM = int(os.getenv("REASONING_HEADROOM", 0))  # e.g., 1024+ for reasoning models
MIN_OUTPUT_FLOOR = int(os.getenv("MIN_OUTPUT_FLOOR", 256))
MAX_SUMMARY_PASSES = int(os.getenv("MAX_SUMMARY_PASSES", 3))
 
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
 
def _enc_for_model(model: str):
    """Return a tokenizer encoding for the given model, with sensible fallback.
 
    Args:
        model: Model name hint used to resolve the most appropriate tokenizer.
 
    Returns:
        A tiktoken ``Encoding`` instance suitable for the provided model.
 
    Notes:
        Falls back to the generic ``cl100k_base`` encoding when the model is
        unknown to ``tiktoken.encoding_for_model``.
    """
    try:
        return tiktoken.encoding_for_model(model)
    except Exception:
        return tiktoken.get_encoding("cl100k_base")
 
def count_text_tokens(text: str, model: str) -> int:
    """Count tokens for a text string using the model's tokenizer.
 
    Args:
        text: The input string to tokenize.
        model: Model name used to choose the correct encoding.
 
    Returns:
        The number of tokens produced by encoding ``text``.
    """
    enc = _enc_for_model(model)
    return len(enc.encode(text))
 
def count_message_tokens(messages: List[Dict[str, Any]], model: str) -> int:
    """Estimate token count for chat messages, including envelope overhead.
 
    Args:
        messages: A list of message dicts with a ``content`` field. ``content``
            may be a string or a list of parts (strings or dicts with ``text``
            / ``input`` fields).
        model: Model name used to choose the correct encoding.
 
    Returns:
        Estimated total token count for all messages.
 
    Notes:
        Adds a 4‑token per-message overhead and a 2‑token envelope overhead to
        better approximate OpenAI-style chat tokenization.
    """
    total = 0
    for m in messages:
        c = m.get("content", "")
        if isinstance(c, list):
            flat = "".join(
                str(p.get("text") or p.get("input") or p)
                if isinstance(p, dict) else str(p)
                for p in c
            )
            total += count_text_tokens(flat, model)
        else:
            total += count_text_tokens(str(c), model)
        total += 4  # heuristic per-message overhead
    total += 2      # envelope overhead
    return total
 
def summarize_block(block: List[Dict[str, Any]]) -> Dict[str, Any]:
    """Summarize a block of messages into a compact, reusable system note.
 
    Args:
        block: Older messages to condense while preserving salient facts.
 
    Returns:
        A single system message dict containing the concise summary.
 
    Notes:
        Uses the currently configured model at low temperature and caps the
        summary to 256 output tokens to control growth.
    """
    summary_input = [
        {"role": "system", "content": "Summarize concisely; keep only reusable facts."},
        {"role": "user", "content": "\n\n".join(f"{m['role'].upper()}: {m.get('content','')}" for m in block)},
    ]
    resp = client.responses.create(
        model=SUMMARIZER_MODEL,
        input=summary_input,
        max_output_tokens=256,
        temperature=0
    )
    return {"role": "system", "content": "[EARLIER SUMMARY]\n" + resp.output_text}
 
def fit_messages_to_window(msgs: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
    """Trim/summarize messages so there is safe room for model output.
 
    Ensures the remaining context allows at least ``MIN_OUTPUT_FLOOR`` tokens
    of output after reserving ``REASONING_HEADROOM``.
 
    Args:
        msgs: Original list of chat messages.
 
    Returns:
        A possibly condensed list of messages that fits the context window.
 
    Notes:
        Prefers summarizing older turns first; as a last resort, truncates the
        longest recent user input if still over budget.
    """
    messages = list(msgs)
    def fits(ms): 
        used = count_message_tokens(ms, MODEL_ID)
        return (CONTEXT_WINDOW - used - REASONING_HEADROOM) >= MIN_OUTPUT_FLOOR
 
    # quick exit
    if fits(messages): 
        return messages
 
    # summarize older turns first
    older, recent = (messages[:-2], messages[-2:]) if len(messages) > 3 else ([], messages)
    passes = 0
    while older and passes < MAX_SUMMARY_PASSES:
        mid = max(1, len(older)//2)
        older = [summarize_block(older[:mid])] + older[mid:]
        if fits(older + recent): 
            return older + recent
        passes += 1
 
    # last resort: trim long recent input
    while not fits(older + recent):
        i = max(range(len(recent)), key=lambda k: len(str(recent[k].get("content",""))))
        c = str(recent[i].get("content",""))
        if len(c) < 400: break
        recent[i]["content"] = c[:300] + "\n[...truncated to fit context...]"
    return older + recent
 
def compute_max_output_tokens(messages: List[Dict[str, Any]]) -> int:
    """Compute a safe ``max_output_tokens`` within context and cap limits.
 
    Args:
        messages: Messages to be sent to the model (already fitted if needed).
 
    Returns:
        The maximum number of output tokens permitted for this request.
 
    Raises:
        ValueError: If the computed budget falls below ``MIN_OUTPUT_FLOOR``.
    """
    used = count_message_tokens(messages, MODEL_ID)
    context_room = max(0, CONTEXT_WINDOW - used - REASONING_HEADROOM)
    provisional = min(APP_OUTPUT_CAP, PROVIDER_OUTPUT_CAP, context_room)
    if provisional < MIN_OUTPUT_FLOOR:
        raise ValueError("Insufficient room for a safe output; reduce input further.")
    return provisional
 
def generate(messages: List[Dict[str, Any]], stop: Optional[List[str]] = None, temperature: float = 0.2):
    """Generate a response with a dynamically budgeted output token cap.
 
    Args:
        messages: Chat messages to send to the model.
        stop: Optional list of stop strings to end generation early.
        temperature: Sampling temperature for the model.
 
    Returns:
        A dict with ``text`` (model output) and ``usage`` (token accounting).
 
    Raises:
        ValueError: If there is insufficient room for minimum safe output.
    """
    fitted = fit_messages_to_window(messages)
    max_out = compute_max_output_tokens(fitted)
    resp = client.responses.create(
        model=MODEL_ID,
        input=fitted,
        max_output_tokens=max_out,
        temperature=temperature,
        stop=stop or []
        # optionally: add structured output schema here
    )
    usage = getattr(resp, "usage", None)
    reasoning = None
    if usage and getattr(usage, "completion_tokens_details", None):
        reasoning = usage.completion_tokens_details.reasoning_tokens
    return {
        "text": resp.output_text,
        "usage": {
            "input_tokens": getattr(usage, "input_tokens", None),
            "output_tokens": getattr(usage, "output_tokens", None),
            "reasoning_tokens": reasoning
        }
    }

Note: Defaults above are conservative. Always confirm your model’s context window and output caps from the provider docs. (OpenAI models, Google Gemini tokens)

You’ve put the math in place, now make sure the telemetry tells you whether it’s working.

Observe: what to measure

  • p50/p95 input tokens, output tokens
  • truncation rate (percent of responses that hit cap)
  • cache hit rate (if using prompt caching)
  • TPM/RPM headroom and backoff outcomes
  • cost/request and cost/route

Tiny log example:

{"route":"faq","p50_in":480,"p95_in":1900,"p50_out":220,"p95_out":800,"truncated":0.07,"cache_hit":0.61,"tpm_headroom":0.32,"rpm_headroom":0.45,"cost_cents":1.8}

Create a single dashboard with these dials and alert when truncation spikes or headroom collapses.

Numbers will drift. When they do, bend the UX instead of letting it fail loud.

Degrade gracefully when budgets are tight

  • Return a short‑form structured schema (summary + 3 bullets) if max_out < 400.
  • Ask back when input blows the budget: “Too long—summarize source or narrow scope?”
  • Support intentional output chunking: emit a clear “Continue” signal and resume.

Structured output starter:

{"summary":"","bullets":["","",""],"sources":[]}

Prefer structured outputs. They bound verbosity and make parsing and evaluation trivial. If you use tool calls or JSON mode, reserve extra output headroom for serialization overhead and add stop sequences.

A few implementation tips I always consider

  • Set a floor (min_output_floor) so you don’t return schema-invalid half-answers.
  • Prefer structured outputs to bound verbosity and simplify parsing.
  • Summarize & prune chat history aggressively. Long prompts are latency and cost multipliers.
  • Exploit prompt caching by putting stable instructions and few-shots at the top of the prompt (the cacheable prefix). Keep them byte‑identical to maximize cache hits (TTL and minimum lengths vary by provider).
  • Add stop sequences to prevent run-ons.

Quick cheat sheet (rules of thumb)

  • Set a floor (min_output_floor, e.g., 256) to avoid schema‑invalid half‑answers.
  • Reserve 15–25% of output budget as reasoning headroom for thinking models.
  • Keep cacheable prefixes byte‑identical (≥ 1,024 tokens; grows in 128‑token increments).
  • Set explicit per‑route caps; defaults are not a product decision.
  • Monitor p95 input/output tokens and truncation; degrade with structured short‑form when tight.

4. Wrapping up

Tokens are a resource. Treat them like CPU cycles: budget them, cap them, measure them. Use dynamic budgeting, prune and summarize aggressively, prefer structured outputs, and exploit prompt caching. Set output caps that reflect real user needs, not defaults. Monitor usage and rate limits. Keep your guardrails close to the evolving model specs—your reliability, latency, and bills depend on it. Context windows keep growing, but physics and costs still apply. The teams that win don’t throw more tokens at the problem—they budget them, constrain them, and observe them. Bake these practices into your code paths now, keep your limits in step with provider docs, and your users will feel the difference in reliability, speed, and trust.

© 2025 Fauzi Hussein. All rights reserved.