Engineering blog · Automation COE

When your vendor's API goes dark: building bots that adapt.

Last month, our HR onboarding pipeline turned red at 03:17 Singapore time. Our vendor had deprecated an endpoint we had been using for three years. The deprecation notice arrived the following Tuesday. By that point, our bots had already switched to a UI-based fallback, processed the overnight backlog, and logged a ticket asking a human to pick a longer-term path. Nobody lost a work day. Nobody got an escalation. This is the pattern that made that possible.

The old way: API-first, API-only

For our first few years of RPA we treated API integration as the finish line. If a system had an API, we used the API. If it did not have an API, we used UI automation. We never mixed the two for the same integration, because mixing felt like admitting defeat.

This worked beautifully until it did not. The first time a vendor silently changed response pagination, twelve of our bots failed simultaneously at 2am. The second time, a vendor turned off an endpoint and replaced it with a "modern equivalent" that required a full reauth flow. The third time, a vendor issued a 30-day deprecation notice on something we had assumed was stable, and we spent two weeks rewriting integrations instead of shipping anything new.

At some point around our 200th bot, we stopped treating API drift as an edge case. It was becoming the norm. We had to build a pattern for it, not react to it.

The hybrid pattern, in three paragraphs

Every integration ships with two paths. The primary path calls the vendor API. The secondary path does the same work through UI automation, using the vendor's web application as a human would. Both paths produce the same typed output. Both paths emit the same telemetry.

The bot picks which path to use at the start of each run. The selection logic is dumb on purpose. If the last three API calls failed or returned unexpected schemas, fall back to UI for the next hour, then try the API again. If UI fails, escalate to a human-in-the-loop task. If both succeed, prefer the API.

That is it. There is no clever machine learning on top. We tried machine learning on top. It was clever and it was wrong in ways we could not debug. The dumb rule has been running for eighteen months without needing a change.

Why UI automation is not a downgrade

There is a common belief that UI automation is a worse form of integration. Slower. More brittle. Less elegant. Easier to break. I held this belief for a long time myself. It is half right and entirely missing the point.

UI automation is slower. That is true. Our UI path runs three to five times slower than the equivalent API path. For a bot that processes fifty records a minute versus ten records a minute, that matters. For a bot that processes eight records a night, it does not matter at all.

UI automation is not more brittle. It is brittle in different ways. APIs break when the vendor changes a schema, a rate limit, or an auth flow. UIs break when the vendor changes a label, a button position, or a modal flow. The failure modes are different, but the failure rates, in our experience over four years, are not meaningfully different. What is different is the warning.

API changes often arrive silently. UI changes almost always arrive visibly. That asymmetry is the whole reason the hybrid pattern works.

When a vendor changes their UI, someone on your team notices within days because it looks different. When a vendor changes their API, you find out when your 3am pipeline goes red. The hybrid pattern lets the API path do the heavy lifting while the UI path acts as an always-on canary and an always-ready fallback.

The HR onboarding story

Our HR onboarding flow is the poster child for this pattern. It connects to six different vendor systems, including two that have changed hands through acquisitions in the last eighteen months. One of those was the vendor that deprecated on us last month.

Here is what happened in production. Our API path started returning 410 Gone at 03:17. The circuit breaker tripped after three consecutive failures. The bot switched to UI mode. The UI path logged in, navigated to the same screen the API had been posting against, submitted the same payload as a form, and moved on. The workflow continued. The new hires scheduled to start that morning got their laptop credentials on time. None of them ever knew.

0
Workflows lost to the deprecation
4h
From first failure to root-cause ticket
14d
Runway we bought to plan the real fix

Meanwhile, the incident created a ticket that landed on a human engineer's desk with all the context attached. That engineer spent a week evaluating whether to rewrite the integration against the vendor's new endpoint, switch vendors, or keep running on the UI path indefinitely. We chose to rewrite. The rewrite went live with both paths, naturally.

What our coding agent does here

We use a code-generating agent to author and maintain bots. When an integration fails, the agent does not rewrite it autonomously. That is not the point. The point is that when a human engineer picks up the ticket, the agent has already drafted the proposed change, generated test fixtures from the last successful API response, and opened a pull request with a diff.

The human reviews the diff, tweaks what needs tweaking, and merges. We have measured this. Our COE ships integration fixes three to four times faster than before, and the fix quality, measured by regressions per hundred changes, has actually improved. Not because the agent is a better engineer than our humans. Because the agent does the tedious part of writing boilerplate fast enough that the human can spend their attention on the interesting part.

The coding agent is a pair programmer, not a magic button. Our engineers review every change. Our agent ships faster than either alone.

Three things we would do differently if we started now

Start with the hybrid pattern from bot #1. We retrofitted it at around bot #200. Retrofitting meant re-authoring a lot of code we could have gotten right the first time.

Treat UI automation as a first-class citizen in your authoring experience. If writing the UI path feels like a chore, your engineers will skip it. Tooling nudges matter.

Invest in typed outputs early. The fact that both paths produce the same typed payload is the only reason the rest of the workflow does not care which one ran. This is a boring infrastructure decision that pays back forever.

What is next

We are experimenting with letting the agent choose which fallback to draft when it sees a new kind of vendor drift. Not choose which path to take at runtime. Just draft the proposed code change when the signal is clear. Early days, but promising. We will write a follow-up when we have shipped a few.

If you are curious about the orchestration layer that sits above these bots, my colleague Rohan has a post about how we chose two authoring experiences on the same runtime that is worth a read.

Automation RPA Integration Resilience
Keep reading

More from our engineering team.