How SaaS Companies Should Be Billing for AI Features: Metering, Entitlements, and the Tools That Exist

Most SaaS billing systems can tell you what a customer paid. None of them could tell us that our most engaged customers were our least profitable — or that the billing layer we needed didn't exist in any off-the-shelf tool.

This post is a deep dive from our WhatsApp hotel booking case study. For the full per-tenant margin analysis and attribution pipeline that informed these decisions, see Per-Customer AI Cost Attribution.

Why AI Features Break Traditional SaaS Pricing

We had properties on a low ßmonthly subscription where inference alone ate more than half the fee — and we didn't know which ones until we built the attribution pipeline.

Every LLM call has a real cost, and it varies wildly by how people use the feature. Some tenants send short, focused requests that resolve in two model calls. Others treat the same feature as a conversational partner, burning 20x the tokens on the same pricing tier. Without metering, your most engaged customers are your least profitable ones — and you won't know it until the margins don't add up.

We learned this firsthand. We'll use "property" as our example throughout, but the pattern generalizes to any per-tenant AI system — copilots, support automation, enterprise tools. Our WhatsApp hotel booking system served hundreds of hotel properties on two pricing models, and the cost distribution was wildly uneven — the full breakdown showed subscription properties losing 30–50% of their fee to inference alone, and PAYG properties approaching OTA commission rates on AI costs. Billing couldn't see any of it.

The billing gap was not theoretical. It showed up in our margins within the first month. If your P90 usage is 2–3x your median, your top 10% of users drive more than 30% of cost, or AI spend is crossing 10–15% of per-tenant revenue — you're already past the point where you need metering. We were past it before we realized.

For the full numbers, the per-property P&L analysis, and the attribution pipeline that made this visible: Per-Customer AI Cost Attribution — Building the Margin Map.

Billing vs Entitlements: Two Layers, Not One

If you take one thing from this piece: billing is accounting, entitlements are control. AI systems need both, and most teams conflate them until it's expensive to untangle.

Billing answers: what happened, and how much do we charge? Razorpay, Stripe, Chargebee — these handle payments, invoices, and taxes. They process transactions after usage occurs.

Entitlements answer: what can this tenant do right now? Can this property's guests start another AI conversation? Has this property exhausted its token budget for the month? Should the agent switch to a cheaper model or shorter responses?

This gap hits hard in AI products because the decision about whether to serve a request — and at what quality — has to happen right now, at the point of the LLM call. A billing system that reconciles usage hours later can't enforce a budget that's already been blown.

In our system, the entitlement check ran at the gateway before each LLM call. The result decided what happened: proceed with the configured model, downgrade to a cheaper one if approaching the limit, or escalate to a human if over budget. Not a kill switch — graceful degradation that kept the guest experience intact while protecting margins.

If your AI features need different behavior per tier — model selection, turn limits, feature access — runtime entitlement gating isn't optional. Bolting it on later is harder than building it in from the start — we saw teams spend 3x the effort retrofitting what could have been a day-one design decision.

Three Pricing Patterns That Actually Work

Before evaluating tools, decide which pricing pattern fits your product. Everything downstream — tool selection, metering schema, entitlement logic — flows from this choice.

Subscription + soft caps. Fixed monthly fee, AI usage metered against a budget. When the tenant approaches the cap, degrade gracefully (cheaper model, shorter responses) rather than cutting off. This is what we ended up with for subscription properties. It gives revenue predictability while protecting margins.

Pure usage-based. Charge per token, per call, or per conversation. Simple, fair, scales with usage. The risk: unpredictable bills scare tenants, and your revenue is directly coupled to how much they use the feature. Works best when usage correlates with value delivered (each AI call saves the tenant measurable time or money).

Credit wallets. Tenants buy prepaid AI credits and draw them down. You get revenue upfront. They get cost predictability. When credits run low, they buy more or downgrade. Enterprise buyers care less about fairness and more about predictability — which is why credit wallets and soft caps show up disproportionately in enterprise deals.

Most production systems end up with a hybrid — we used subscription + soft caps for fixed-fee properties and effectively pure usage for pay-as-you-go. The tools below support different combinations.

The Tools: Evaluating Metering Platforms

Tool choice is reversible. Instrumentation is not. Get your gateway tagging and metering schema right first — everything below can be swapped later.

We evaluated four metering platforms plus native payment gateway billing and the in-house option. The key differentiators: runtime entitlement gating (can the tool enforce budgets at the moment of the LLM call, not just at billing time?), credit wallet support, multi-dimensional metering, and payment gateway compatibility.

We shipped with Stigg — it was the only platform that combined runtime entitlement checks (sub-10ms via sidecar cache), native credit wallets, and Razorpay compatibility. Low four figures annually — real for early-stage teams. The runner-up had strong runtime gating but was Stripe-only, which was a blocker for us.

Building in-house. Redis counters, a rules engine, Razorpay for billing. We estimated 4–6 weeks for v1 plus ongoing maintenance. Stigg's integration in about a week was the pragmatic choice under deadline pressure.

The Decision Tree: Which Tool for Which Situation

Already on Stripe or Razorpay with simple usage-based billing? Start with native metered billing. One metered dimension, straightforward per-unit price — your payment gateway handles this without adding another vendor.

Need runtime entitlements — gating AI features by tier in real time? Look for a platform that supports entitlement checks at the point of the LLM call, not just billing reconciliation. If your AI features need different behavior per tier — model selection, turn limits, feature access — runtime gating is not optional. Payment gateway compatibility matters here — not every platform supports every gateway.

Need credit wallets — prepaid AI usage that tenants purchase and draw down? Credit wallets give customers cost predictability and give you revenue upfront. Some platforms offer this natively; open-source options let you build it with full control.

Maximum flexibility and engineering capacity? Open-source billing platforms give you that if your billing model is genuinely novel. Budget the engineering time honestly — 4–6 weeks for v1 is typical.

Need to ship in a week? Lightweight experimentation tools or native billing. Get usage data flowing, learn from the numbers, and migrate to a more sophisticated tool when the data tells you what you actually need.

What to Meter Is Harder Than How to Meter

The hardest part isn't emitting events. It's choosing what you're measuring. Tokens are precise but unintelligible to customers — nobody wants an invoice denominated in tokens. Conversations are intuitive but hide massive variance (a 3-turn booking and a 20-turn concierge session look the same). We ended up tracking both: tokens internally for cost attribution and model routing decisions, conversations externally for what property managers saw in their usage summaries. Most systems need a dual unit — one for internal economics, one for customer communication.

Implementation: Wiring Metering Events from AI Features

The event flow, grounded in how our WhatsApp booking system worked in practice:

Step 1: User interaction arrives. The WhatsApp webhook receives a guest message, and the LangGraph pipeline picks it up in the appropriate graph state.

Step 2: Each LLM call passes through the gateway. The gateway attaches per-request metadata fields before forwarding to the model provider. This is the instrumentation point. If the gateway isn't the single choke point for every LLM call — including retries, fallbacks, and error recovery — your cost data will be wrong.

Step 3: Observability captures cost metadata. Langfuse logs every generation with full metadata context — token counts, latency, model used, cost. This is the data source for both debugging and cost attribution.

Step 4: Metering event emitted to the entitlement engine. From the same gateway metadata: tenant ID, feature ID (e.g., ai_conversation_tokens), token count consumed, model used. The entitlement engine aggregates against the tenant's budget for the current billing period. If retries and fallback calls don't emit metering events, your reported cost will silently drift from your actual spend.

Step 5: Billing reflects actual usage. PAYG tenants get invoice line items. Subscription tenants get usage dashboards and budget enforcement. The billing provider (Razorpay, Stripe) handles the money; the entitlement engine handles the access control.

Design your metering schema once. Everything else should evolve without touching it. When we shifted subscription tiers to account for AI consumption, we changed the aggregation and pricing rules in Stigg. The gateway, the event emission, the Langfuse traces — none of that moved. The instrumentation was stable. The business logic evolved around it.

Customer-Facing Usage Dashboards

When your product has AI features with variable cost, tenants want to see what they're using and how it maps to what they're paying.

Our property managers lived in WhatsApp, so that is where usage visibility landed — not in a web dashboard. A weekly summary message covering conversation count, booking conversions, allocation percentage, and top guest topics. No token counts, no model names, no engineering jargon. Just conversations, bookings, and allocation status. For premium properties with credit wallets, the summary included credit balance and burn rate.

What worked for us: build the internal ops dashboard first (you need it for your own decisions), then build the customer-facing version as a simplified view of the same data. Same source, different lens. For enterprise deals, being able to show a prospect their projected AI consumption based on real usage patterns goes a long way.

What Metering Doesn't Tell You

Metering told us what we were spending. It didn't tell us which customers were worth keeping.

That required connecting cost to revenue at the tenant level — joining LLM spend from Langfuse with booking revenue and subscription fees, property by property, stage by stage. The per-tenant P&L that came out of it changed more product decisions than any AI feature we built.

That's the next piece: Per-Customer AI Cost Attribution — Building the Margin Map. For the AWS infrastructure patterns underneath all of this: The AWS Infrastructure Checklist.