Feeding Stripe's API to an AI Agent: Lessons

Summary

We fed Stripe's full Billing API to an AI agent and asked it to systematically explore the behavior of discounts, prorations, and billing cycles. The takeaway: AI is excellent at exhaustive exploration of parameter combinations, but a poor substitute for billing domain judgment. The combination, not the AI alone, is what catches the edge cases.

Here's something we tried recently that taught us a lot about both AI and billing APIs: we took Stripe's API reference documentation, fed it to an AI agent, and asked it to systematically explore every combination of parameters to see what would actually happen with a specific issue we ran into with the billing mode parameter.

The goal was simple. Stripe has two billing modes — and the way discounts, prorations, and pricing behave can differ significantly between them. The documentation covers the basics, but there are parameter combinations and interaction effects that aren't explicitly documented. We wanted to map the full terrain.

What we found was illuminating — both about billing system complexity and about how AI can (and can't) help you navigate it.

The Setup

The experiment was straightforward:

Feed the docs. We gave the AI agent the complete Stripe Billing API reference — endpoints, parameters, enums, and the descriptions of how features like prorations, discounts, and billing cycles work.
Define the exploration space. We asked it to focus specifically on discount calculations: how percentage discounts, fixed-amount coupons, backdating and prorations interact across the two billing modes.
Generate test cases. The agent produced a matrix of every meaningful parameter combination — billing mode, discount type, timing of application, proration setting, and subscription state.
Execute and observe. We ran the test cases against Stripe's API in a sandbox environment and collected the results.

Results

The results fell into three categories: things that worked as documented, things that surprised us, and things we were told to look out for but didn't fully understand.

Result 1: We could reverse-engineer details that weren't spelled out in docs

Stripe's documentation outlines that discount calculations work differently with prorations under their two different billing modes - however, it's not documented exactly how this happens. Our experiment allowed us to try a number of different permutations and then determine that in billing_mode = flexible, a single instance of a fixed amount discount is split evenly over periods and the post-discount rate is applied pro-rata to number of days in that period.

Result 2: We found surprising differences in behavior

One undocumented facet of Stripe's behavior was that proration discount calculations use 30.5 days per month in billing_mode = flexible, but 30 days in billing_mode = classic. For subscriptions running into $1000s, this difference definitely adds up!

Result 3: We caught a subtle version incompatibility

When testing combinations with one-off line items added to subscriptions, we found that one of our test accounts was getting a totally different number! Looking into it deeper, we realized the account had been pinned to an API version where price instead of pricing was the parameter that had to be passed. Because this error occurred silently, it required careful investigation.

What This Tells Us About AI and Billing

The experiment gave us a clearer picture of what AI is genuinely good at in the billing domain — and where it falls short.

AI is excellent at systematic exploration

No human is going to sit down and manually test every combination of discount type, billing mode, proration setting, and subscription state. The parameter space is too large, and the testing is too tedious. An AI agent can generate and organize these test matrices quickly and thoroughly. It's a genuine force multiplier for QA and edge-case discovery.

AI is not a substitute for domain judgment

The agent could tell us what happened for each parameter combination, but it couldn't reliably tell us why or what the right behavior ought to be. While we could easily tabulate how amounts differed across scenarios, working into the mathematical calculations took some good, old fashioned human reasoning. "Is this proration amount right?" is a question that requires understanding your contract terms, your customer's billing history, and sometimes the intent behind a particular API design choice. That's domain knowledge, not pattern recognition.

One-shot prompting isn't enough for complex APIs

You can't feed an API reference to an AI and ask it to "figure out all the edge cases" in a single prompt. The exploration needs to be structured: start with a specific area (discounts), define the parameter space, generate test cases, run them, analyze results, and iterate. Each step builds on the previous one. It's a methodical process that happens to be accelerated by AI, not a magic trick.

Practical Takeaways for Finance and Engineering Teams

If you're building on top of billing APIs (or evaluating tools that do), here's what we'd take away from this experiment:

1. Never assume the docs tell the whole story

Even the best-documented APIs — and Stripe is genuinely one of the best — have behaviors that only reveal themselves through systematic testing. If you're building billing logic that depends on specific behaviors around prorations, discounts, or plan changes, test the actual API output. Don't just read the docs and assume.

2. Use AI to explore, not to decide

AI is a fantastic tool for generating test cases, mapping parameter spaces, and finding anomalies. It's not a reliable tool for determining whether a particular billing behavior is correct for your business context. Use it to accelerate your investigation, but keep humans in the loop for interpretation and decision-making.

3. Edge cases compound

The most dangerous edge cases aren't the ones involving a single parameter. They're the ones where multiple features interact: discount + proration + billing cycle + plan change. If you're only testing individual features in isolation, you're missing the cases most likely to cause production issues.

4. Invest in billing expertise, not just billing tools

This is perhaps the most counterintuitive takeaway. In a world where AI can write integration code, the bottleneck isn't writing the code — it's knowing what the code should do. Deep knowledge of billing platforms, their quirks, and their interaction patterns is more valuable than ever, not less.

The meta-lesson

You can use AI for faster development, but you cannot substitute deep knowledge of edge cases and judgment for when you need to search those out in building tools. The combination of AI-powered exploration and human billing expertise is significantly more powerful than either alone.

What We're Doing With This

This experiment reinforces how we think of connectors at Finrite. Rather than hand-coding integration logic based on documentation alone, we now use AI-assisted testing to systematically explore the behavior of every billing platform we integrate with. This lets us catch edge cases before our customers encounter them, rather than after.

But the AI is the tool, not the decision-maker. Every edge case it uncovers gets reviewed by engineers and product folk who understand the billing domain, validated against real customer scenarios, and encoded into deterministic logic that we can guarantee will behave correctly in production.

It's the blend of systematic AI exploration and deep human expertise that makes the difference. And if you're building anything that touches billing, we'd encourage you to adopt a similar approach — or work with someone who already has.

Billing connectors tested against every edge case

We've already done the systematic exploration of Stripe, HubSpot, and Salesforce APIs so you don't have to.

Chat with us