Case study: a production language-model pipeline

The problem.

NYC renters commit to leases nearly blind. The data that would warn them exists (housing violations, building complaints, marshal evictions, bedbug filings, the Public Advocate's landlord watchlist), but it sits scattered across half a dozen city portals that nobody checks during a 15-minute apartment viewing. I rent in NYC, and I built the tool I wished I'd had before my own last lease.

The engineering problem is what happens after the demo. The naive version of this app is a fetch and a model call, and it works right up until real traffic hits it. The city's Socrata endpoints rate-limit and stall without warning. Listing sites block scrapers. Language models occasionally return malformed JSON, and they will happily invent a violation count if you let them. And every lookup costs real money, on a free product, from anonymous traffic.

So the pipeline was designed around three rules:

A slow data source must never hold the whole report hostage.
The model must never be the source of truth for anything scored or counted.
An anonymous stranger must never be able to spend more than 20 cents of my money in a day.

How a lookup moves through the pipeline.

A request arrives as an address or a listing URL and passes through ten phases: validate (Zod schemas), scrape the listing if one was given, geocode to the city's canonical building ID, check a 24-hour cache, fan out to the nine open-data sources in parallel, compute the risk score, call the model once, persist everything, and stream the result back as newline-delimited JSON the whole way.

Two decisions in that flow do most of the work.

First, the fan-out treats every dataset as optional. Each parallel fetch runs under its own 5-second deadline. A source that stalls resolves with a fallback value and a flag instead of an exception:

backend/src/routes/lookup.ts

/** Race a promise against a deadline. On timeout, resolves with the fallback
 *  value and `timedOut: true`; the underlying promise keeps running but its
 *  result is discarded. Used to cap dataset fan-out tail latency. */
function withDeadline<T>(
  p: Promise<T>,
  ms: number,
  fallback: T,
): Promise<{ value: T; timedOut: boolean }> {
  let to: ReturnType<typeof setTimeout> | undefined;
  const timeout = new Promise<{ value: T; timedOut: boolean }>((resolve) => {
    to = setTimeout(() => resolve({ value: fallback, timedOut: true }), ms);
  });
  const settled = p
    .then((value) => ({ value, timedOut: false }))
    .catch(() => ({ value: fallback, timedOut: false }));
  return Promise.race([settled, timeout]).then((r) => {
    if (to) clearTimeout(to);
    return r;
  });
}

A dataset that misses its deadline is listed as partial in the response, the frontend labels it, and the report ships anyway. One sick endpoint degrades one section, not the product.

Second, the AI never decides the score. The 0-to-100 risk score comes from a deterministic penalty matrix in plain TypeScript, computed from the records themselves. The model receives the score and its contributing factors as input and only narrates them. It also cannot misquote the listing: anything it claims came from the listing text is checked verbatim against the source and dropped if it does not match. Grounding here is not a prompt instruction, it is a code path.

The streaming exists for a business reason, not a technical one. The deterministic score and record counts are ready seconds before the model finishes writing, so the stream pushes an early data_ready event and the user watches the report assemble instead of staring at a spinner. Perceived speed is free; a faster model tier is not.

Retries, timeouts, and what they defend against.

Every external dependency gets the same treatment: a hard timeout, one retry on the failures worth retrying (429s and 5xx), a 2-second backoff, and a typed error if that fails. Socrata calls time out at 10 seconds. The model call times out at 30, enforced with an AbortController rather than hope:

backend/src/ai/openai-client.ts

export async function callChat(payload: ChatRequest): Promise<ChatResponse> {
  const apiKey = process.env.OPENAI_API_KEY;
  if (!apiKey) throw new OpenAIError('OPENAI_API_KEY not set');

  for (let attempt = 1; attempt <= 2; attempt++) {
    const ctrl = new AbortController();
    const timer = setTimeout(() => ctrl.abort(), 30_000);
    try {
      const res = await fetch('https://api.openai.com/v1/chat/completions', {
        method: 'POST',
        headers: {
          'Content-Type': 'application/json',
          Authorization: `Bearer ${apiKey}`,
        },
        body: JSON.stringify(payload),
        signal: ctrl.signal,
      });
      clearTimeout(timer);
      if ((res.status === 429 || res.status >= 500) && attempt === 1) {
        await new Promise((r) => setTimeout(r, 2000));
        continue;
      }
      if (!res.ok) throw new OpenAIError(`OpenAI ${res.status}`, res.status);
      return (await res.json()) as ChatResponse;
    } catch (e) {
      clearTimeout(timer);
      if (attempt === 2) throw e instanceof OpenAIError ? e : new OpenAIError(String(e));
    }
  }
  throw new OpenAIError('unreachable');
}

One retry is a deliberate number. A renter is watching the page; a report that arrives 40 seconds late is roughly as useless as one that never arrives. If the second attempt fails, the request fails fast, and the 24-hour cache means the next attempt usually succeeds on stored data. Listing scrapes get a true fallback chain instead: when Zillow or StreetEasy blocks the fetch, the pipeline parses the address out of the URL itself and downgrades to an address-only report with a visible flag, rather than returning an error page.

Every model call writes its own receipt.

Cost control is two layers: a ledger and a gate.

The ledger: every model response includes exact token counts, and the pipeline prices them on the spot (at $0.15 per million input tokens and $0.60 per million output for gpt-4o-mini), rounds up, and writes the cost in cents to an ai_usage row tied to the caller:

backend/src/ai/summary.ts

// Pricing: convert token counts to cents
const inputCents = (res.usage.prompt_tokens * PRICE_INPUT_PER_M) / 10_000;
const outputCents = (res.usage.completion_tokens * PRICE_OUTPUT_PER_M) / 10_000;
const cost_cents = Math.max(1, Math.ceil(inputCents + outputCents));

const usageRows = await getDb()
  .insert(aiUsage)
  .values({
    userId: subject.type === 'user_id' ? subject.value : null,
    email: subject.type === 'email' ? subject.value : null,
    route: 'lookup',
    costCents: cost_cents,
    modelUsed: 'gpt-4o-mini',
  })
  .returning({ id: aiUsage.id });

The gate: before any model call, the pipeline sums the caller's last 24 hours of ledger rows and refuses to proceed past a cap. Anonymous visitors get $0.20 a day, email-verified users $0.50, signed-in users $5.00. Combined with a 24,000-character input cap and an 1,800-token output cap, the worst-case cost of a report is arithmetic, not hope: under $0.002 in tokens. Repeat lookups of the same building within 24 hours skip the model entirely, and the static system prompt is eligible for OpenAI's prompt-caching discount.

This is the unglamorous half of AI engineering, and it is exactly the half that lets a free product face anonymous internet traffic without a surprise bill.

The numbers.

9 NYC Open Data sources fetched in parallel per lookup, each under a 5-second deadline; slow sources are marked partial and never block the report
Under $0.002 in tokens per report, worst case, enforced by input and output caps rather than monitoring
24-hour spend caps per caller: $0.20 anonymous, $0.50 email-verified, $5.00 signed in
Rate limits by trust tier, from 10 requests an hour for anonymous visitors to 60 for signed-in users
24-hour caches on both raw building data and finished AI summaries; a popular building costs the model nothing
80 test files across backend and frontend (Vitest and Playwright) gate every deploy, over a pipeline of roughly 6,800 lines of TypeScript

What I would tell you over coffee: the single-retry policy means a truly bad open-data day produces visibly partial reports, the free hosting tier cold-starts, and there is no human review queue. The mitigations are the deterministic score, the source link on every claim, and the verbatim quote checks. Those are the right trade-offs for a free consumer tool, and they would be different trade-offs for your product. Knowing which ones to change is most of the job.

I build pipelines like this for products that already have users.

Email me about yours