LangChain and LangGraph in Production: What Works, What Breaks, and What We'd Change

By turn 6, the agent forgot which property the guest had picked. By turn 8, it was re-inferring dates from scratch on every message. LangChain's AgentExecutor worked in the sandbox and failed almost immediately in production — here's what replaced it and what we'd skip next time.

This post is a deep dive from our WhatsApp hotel booking case study. Here's an honest look at the framework decisions -- what earned its place and what we'd skip next time.

Why We Chose LangChain

We needed pluggable LLM support (evaluating Gemini, Claude, and GPT-4o), pre-built tool abstractions, and a path to multi-step conversation management. LangChain checked all three. The ecosystem -- integrations, community, documentation -- was ahead of alternatives at the time.

We evaluated Semantic Kernel (too Microsoft-centric), raw OpenAI SDK (meant building tool orchestration and state management from scratch), and Haystack (strong for RAG, weaker for agent-with-tools patterns). LangChain won on breadth and the promise of LangGraph for state management. That promise turned out to be the most important factor.

The AgentExecutor Problem

We started with LangChain's AgentExecutor -- one system prompt, a handful of tools, the executor deciding when to call what. Worked in the sandbox. Failed almost immediately in production.

By turn 6 or 7, the agent was over-calling tools -- a guest saying "sounds good, let's book" triggered another availability check instead of confirming. It oscillated between tool calls and freeform answers with no consistency. And it lost structured state -- by turn 8, it regularly forgot which property the guest had picked, or conflated guest counts from different messages.

The root problem: AgentExecutor treats every turn as a fresh decision with the full message history as context. No persistent structured state, no notion of "we're in the ranking phase now," no way to constrain which tools are valid when. The message list grew, attention to early constraint-setting messages degraded, and the model re-inferred dates and guest counts from free text on every turn.

LangGraph: Conversations as State Machines

The core insight was simple: a booking conversation isn't open-ended agent interaction. It's a state machine with known phases, and the model's job is to make decisions within each phase, not to decide which phase we're in.

We built 8 states. Three illustrate the principle:

CollectStayConstraints extracted dates, guests, budget into typed fields on a ConversationState object. Not free-text summaries -- structured fields. Once a date was confirmed, it lived as check_in: date, not as a sentence the next node would re-parse. Structured fields were authoritative. The message list was context.

SearchInventory built a PMS query entirely from those typed fields. No LLM involvement in query construction. Deterministic mapping from {destination, dates, occupancy, budget_max} to API call.

RankAndExplainOptions was where the model earned its keep. Given valid options from the API, it picked and explained 3-5 to the guest, grounded in retrieved data. This was the most token-intensive node, using Gemini 3 Flash for reasoning quality.

Conditional edges governed transitions. CollectStayConstraints only moved to SearchInventory when all required slots were filled. The model never decided whether to call check_availability or create_booking -- the graph already knew based on current state.

The broader pattern -- when to use state machines vs. single agents vs. workflow engines -- is in Agentic Architecture Patterns. The full 8-state walkthrough is in the WhatsApp case study.

What Actually Unlocked

Per-node model routing changed our economics. Tiered model routing -- cheap fast models for classification, capable models for reasoning. This wasn't just "pluggable LLMs" -- it meant we could optimize cost per state. When Gemini 3 Flash pricing shifted, we evaluated alternatives for the ranking node without touching anything else.

Typed tool schemas prevented silent production failures. LangChain's @tool decorator with Pydantic inputs/outputs caught integration mismatches at development time that would have been silent bugs in production. When the PMS changed a response field, the schema validation failed loudly in CI, not quietly on a guest's booking.

Gateway-level Langfuse traces made cost attribution possible. Per-generation observability traces sliceable by property and model. Without this, the per-property P&L dashboard that became our most important tool wouldn't have existed.

Checkpointing enabled real commerce flows. When we sent a payment link, the graph paused. When the webhook arrived hours later, it resumed exactly where it left off. Without interrupt/resume, async flows like payments and human escalation would have required a completely separate state management system.

Where It Broke in Production

Version breakage. LangChain moved fast -- too fast for production. Twice in three months, a patch version changed tool call serialization and silently broke our PMS integration. We caught both in our DeepEval suite, but only because we had coverage for those specific patterns. We pinned exact versions and treated every upgrade as its own PR.

Debugging through framework internals. When a tool call failed, the stack trace was 15 levels deep in LangChain classes before reaching our code. We wrote a custom exception handler to strip framework frames -- an adapter around the framework's error handling, which is the kind of meta-work you adopt a framework to avoid.

LCEL at scale. LangChain Expression Language was clean for simple chains. For complex nodes with conditional logic, retries, and multiple model calls, it added an abstraction layer to debug through without proportional benefit. We rewrote two of our eight nodes as direct Gemini API calls, bypassing LangChain entirely. They were easier to debug and faster to execute.

Memory was our problem. LangChain's built-in memory classes didn't fit. We needed per-conversation persistent state across WhatsApp sessions spanning days, with typed fields that survived serialization. Built our own on Redis. LangChain's memory abstractions were dead weight.

The Migration Numbers

Four weeks of parallel development, then a two-week A/B test.

P50 latency dropped by more than half after the LangGraph migration. P95 improved by a similar margin. Guest re-send rate (a proxy for "the bot feels broken") dropped over 40%.

We measured hallucination as any response containing a property claim (amenity, pricing, policy, availability) not grounded in the structured PMS tool output that informed it. Automated grounding checks in RankAndExplainOptions and PostBookingOps matched claims against tool output. Weekly sampled human review (~5% of conversations, stratified by property) caught what automation missed. Under n8n, where the model assembled property details from free-text message history, a notable fraction of responses in property-facing states contained ungrounded claims -- fabricated amenities most commonly, followed by wrong cancellation policies and conflated room-type pricing. After LangGraph, with typed state and deterministic tool calls feeding generation, the hallucination rate dropped by over 75%. Adding automated grounding checks post-generation and DeepEval's HallucinationMetric as a CI gate locked in those gains.

The hardest part wasn't code -- it was mapping the implicit business logic that had accumulated in the n8n workflow. Condition nodes, branch routing, per-property overrides that nobody had documented. Week one was just mapping the canvas into a spec before writing any LangGraph code.

Code-Switching: The Failure Nobody Tests For

"Bangalore ke paas koi silent hill type ka resort hai? 2 log, Friday se Sunday. Budget tight hai, par peaceful chahiye."

Three things broke at once. Language detection tagged it as English (enough English tokens to cross the threshold). The English-trained embedding model turned the Hindi portions into noise. "Silent hill" was taken literally -- no property by that name, so retrieval degenerated to random results near Bangalore.

Code-switched queries retrieved wrong-language results until we added language-aware retrieval. The fix lived in the pipeline before messages ever reached LangChain: a code-switching detector that normalized mixed-language input, retrieval recall on code-switched queries roughly doubled, and geographic filters made "near Bangalore" a hard constraint instead of a soft embedding signal.

This is the framework abstraction gap in practice. LangChain's retrieval chain assumed clean, single-language input. Real guests in India code-switch without thinking about it. The fix lived entirely outside the framework -- and that tells you something about where framework boundaries actually are.

PII: Infrastructure the Framework Can't Handle

"Just tell the model not to leak PII" was our first instinct. The real risk was multi-party: guests sharing UPI IDs with managers, managers sharing payment details with guests, support agents accessing conversations.

The pipeline: a layered PII architecture with ingress filtering, encrypted storage, and egress validation. The model only saw placeholders. Tools resolved real PII server-side. Langfuse traces were already scrubbed at ingestion.

Government ID images were never forwarded -- stored encrypted, every access logged. Audit logs surfaced property managers repeatedly asking guests for direct phone numbers. Small numbers, but the kind of behavior that erodes platform trust if unchecked.

PII handling lives entirely outside LangChain. It has to. Swapping frameworks, changing prompts, or refactoring chains should never affect PII governance.

If You're Building Today

Use LangGraph when your conversations have distinct phases, you need checkpointing for async flows (payments, human handoff), or you want state-based tool eligibility. This is the layer that earned its place.

Avoid AgentExecutor when conversations go past 5 turns, you need persistent structured state, or you're fighting tool selection accuracy. It's a prototype tool, not a production pattern.

Use direct API calls when a node is just "call the model, parse the response." Two of our eight nodes were simpler and faster without LangChain in the middle. If debuggability matters more than abstraction, skip the abstraction.

Use LiteLLM or a thin gateway for model switching across providers. You get the pluggability without the chain overhead.

Keep frameworks out of PII handling, side-effect orchestration, and anything where "the framework changed" should never be a reason for a production incident.

The Takeaway

LangGraph earned its place. LangChain's chain abstractions probably didn't. The state machine is the real architecture. The framework is scaffolding.

Most of the value in our production system came from what we built around the framework: typed state, gateway-level observability, PII infrastructure, side-effect separation. The framework made the first week faster. Everything after that was us.

If you're about to ship and want the full production checklist: AI Demo to Production: What Changes.