Skip to content
Back to WorkSenior Product Manager, Payments · B2B SaaS platform · Entertainment & leisure · 5 countries · 2025

How a payment system went from crisis to 99% uptime

99.19%

global success rate

-25%

post-release incidents

Role

Senior Product Manager, Payments

Period

2025

Industry

B2B SaaS · Payments · Enterprise

99.19%

global success rate

-25%

post-release incidents

99.19%

global success rate. 100% uptime against a 99% target

Customers were losing money on every transaction. I found the 4 issues that actually mattered out of 44, fixed them in 6 weeks, and took the system to 99%+ reliability.

Customers were being double-charged. Refunds weren't going through. The system broke during the busiest hours. There were 44 open problems and everyone wanted them all fixed now. I found the 4 that were actually costing money, fixed those in 6 weeks, and the system went from broken to 99%+ reliable.

KEY INSIGHT

The instinct in a crisis is to fix and grow at the same time. It doesn't work. The first real product decision is to stop making the hole deeper.

The context

A product that handled real money for hundreds of businesses across five countries. It was the company's fastest-growing revenue line — 42% of total transaction volume, 25% year-over-year growth, the only provider trending upward.

Then the growth outran the product. The original version had been built for a handful of early customers. Now it was running at a much larger scale, and the gap between "technically works" and "customers can rely on it" was impossible to ignore.

The challenge

Over 60% of recently onboarded customers were either unhappy or had stopped using the product. Complaints were piling up. Businesses already using it were seeing double charges, failed refunds, and the system going down during their busiest hours. Every one of these problems cost them real money.

The deeper problem: new customers were still being added at full speed while existing ones were struggling. Every new unhappy customer meant more complaints, more reputation damage, more pressure on the support team. The product had outgrown what it could handle, and growth was making everything worse.

Nobody had a clear picture of what was actually broken, how bad each problem was, or who was in charge of fixing it.

The approach

The first decision was to stop scaling the problem. Growth features went on hold. Customer acquisition slowed down. The entire team shifted to a single priority: make the centers already in production work reliably. This required aligning sales leadership, engineering, and executive stakeholders around the same data. The customer satisfaction numbers made the case.

The first decision is to stop making the hole deeper. Everything else follows from that.

Then came triage. 44 open items sat in the backlog, compiled from engineering, sales, and support with no ranking. The sorting criterion was simple: is this costing a center money right now? That split the backlog into three tiers. 12 items were critical to payments operations. Of those, 4 had immediate financial impact on live centers: double charges, payment timeouts, broken split payments, missing refunds. The remaining 32 were real issues, but they could wait.

The 4 highest-impact items got one owner each, a deadline, and a published commitment. All four shipped within six weeks. Zero complaints on any of them post-deployment.

A single clear sorting criterion gave the team more clarity than any amount of discussion.

While the team stabilized, I dug into a problem others had classified as unsolvable. A major center was about to cancel the integration because they couldn't reconcile their books. Every transaction appeared as "uncategorized income" in the Square dashboard. No breakdown by terminal, no source attribution, no way to close the books.

The root cause was in the Square API. The field where we stored transaction metadata didn't preserve enough information. I traced the data model until I found `external_payment_type`, an existing parameter that could carry source attribution without architectural changes on either side. I wrote the implementation spec and handed it to the team. The center stayed and became a reference account.

Scope discipline mattered as much as speed. Over the full year, 23 of the 44 items shipped. The other 21 were deliberate "not now" decisions, each with a documented rationale. The 12 critical payments items were brought down to 5 remaining, with every resolved issue validated in production. When the refund feature wasn't solid enough before the holiday peak, I pulled it back and ran a full testing day rather than ship broken logic to hundreds of centers during their busiest week.

Parallel to stabilization, two compliance threads ran. I proposed a scope reduction for PCI DSS certification by presenting the certifiers with a three-channel payment architecture, removing significant recurring penalty risk. I also completed the FreedomPay technical certification end-to-end, which unlocked a new segment of enterprise centers.

Communication was deliberate. Structured weekly recaps, operational memos for urgent decisions, meeting minutes published within hours. When the support team needed to help customers with issues that weren't fixed yet, I built video tutorials covering workarounds. Bridge solutions while the real fixes were in the pipeline.

The results

-12%

processing times across all payment flows

99.19%

global success rate. 100% uptime against a 99% target

-25%

post-release incidents

25%

YoY growth for the Square integration, the only payment provider still growing

116

centers integrated. 5 new centers per week onboarding pace

23/44

backlog items shipped. Critical payments issues reduced from 12 to 5 remaining

What I learned

The instinct in a crisis is to fix and grow at the same time. It doesn't work. The first real product decision is to stop making the hole deeper. Everything else follows from that.

The second thing: 44 items, 12 critical, 4 urgent. Most of what felt urgent could actually wait. Knowing which ones can't is the actual skill. A single clear sorting criterion gave the team more clarity than any amount of discussion.

The third: when a problem gets labeled "unsolvable" or "waiting on the partner," it's worth checking whether someone has actually read the API documentation. The source attribution fix was already there. It just needed someone willing to look.

Transferable patterns

Crisis → Freeze → Triage → Fix → Resume

When a system is on fire, the first instinct is to fix everything at once. The pattern that works: freeze scope, triage ruthlessly (we went from 44 issues to 4 priorities), fix what matters, then resume growth.

Metrics as triage tool, not dashboard decoration

The 99.19% uptime number only mattered because it came from instrumenting every failure mode. The dashboard wasn't for executives — it was for the on-call engineer at 2am.

Scope discipline under pressure

Every stakeholder had a pet fix. The pattern: if it doesn't directly reduce failure rate, it waits. Saying no to 40 items made fixing the 4 that mattered possible.

Post-crisis architecture > pre-crisis architecture

The payment system after the rescue was better than the one before the crisis. Crises expose architectural debt that normal operations hide.

These cases tell the how. For the who, there's the about page. If you want to talk, write me.