When our HR vendor deprecated an API endpoint without a meaningful notice window, our onboarding pipeline did not fall over. A pattern we had invested in two years earlier paid off in a single production incident. This is the full story.
Accrual Services, our internal shared services unit, runs HR onboarding for every new Accrual hire globally. That is about 2,400 new hires a year across 28 countries. The onboarding flow touches six vendor systems and requires actions to complete in a specific sequence, often across time zones. When it works, new hires have laptops, credentials, and a manager meeting on their calendar before their first coffee on day one. When it does not work, new hires show up to their first day and cannot log in.
By early 2025, our onboarding flow had grown to the point where vendor-side changes were causing more production incidents than any other process in Services. We had 14 incidents in a single quarter, all of them triggered by upstream vendor changes we did not control. Several of them resulted in new hires starting without credentials. In one memorable case, 23 new hires across two offices spent their first morning unable to log in because a partner system had silently changed its authentication flow overnight.
The business impact was bad. New hire time-to-productivity was slipping. Manager satisfaction scores on onboarding had dropped from an average of 8.4 to 6.7 year over year. HR was spending the equivalent of two full-time headcounts per month manually unblocking stuck onboardings.
The engineering impact was worse. Every incident consumed platform engineers during our highest-intensity hours. We were spending roughly 30 percent of platform engineering time on onboarding fire drills. That is an enormous tax on a team of eight.
We rebuilt the onboarding flow around a hybrid automation pattern: every integration ships with two paths, a typed API client and a UI automation client against the same vendor system. A runtime circuit breaker picks which path to use based on recent reliability. Both paths emit the same telemetry and produce the same typed output.
This pattern was already running in parts of Retail and Capital. Services adopted it aggressively during the rebuild because the vendor landscape in HR technology is the messiest we deal with. Acquisitions, rebrands, and authentication model shifts are almost monthly events. We needed every integration to be able to survive a bad day.
Three weeks after the rebuild went live, one of our HR vendors deprecated an endpoint we had been relying on for years. We learned about the deprecation because our API path started returning 410 Gone at 03:17 Singapore time. The circuit breaker tripped after three consecutive failures. The bot switched to the UI path. Onboarding continued.
By 07:30 the same morning, our on-call platform engineer had a ticket with a full diagnosis, a proposed remediation, and a draft pull request from our coding agent suggesting how to update the API integration to the vendor's replacement endpoint. The engineer reviewed, adjusted, and merged. By the end of the day, the API path was back online against the new endpoint, with the UI path retained as the long-term fallback.
No new hires were impacted. No incident was reported to the business. The entire event registered in our system as a routine ticket.
New hires are productive faster. The 41 percent reduction in time-to-productivity was driven as much by the elimination of blocked onboardings as by any individual step getting faster. When an onboarding does not get stuck, it closes on schedule.
HR is no longer unblocking individual cases. The two-FTE equivalent spent on manual unblocking dropped to roughly 10 percent of one person's time, reserved for genuinely unusual scenarios that the automation could not handle.
Platform engineering is no longer firefighting. Our platform time spent on Services onboarding incidents dropped from 30 percent to under 5 percent. That time has been reallocated to case management and orchestration work.
Our code-generating agent played a specific role in the rebuild and continues to play one in maintenance. During the rebuild, the agent drafted the initial UI automation path for each integration, using the API contract as a specification. Our engineers reviewed, tested, and shipped. This sped up the build by roughly 40 percent compared to authoring both paths manually.
In ongoing operations, when an API integration breaks, the agent drafts the proposed code change against the new endpoint and attaches test fixtures generated from the last successful response. The engineer reviews and merges. We have measured this: integration fixes now ship three to four times faster than they did before the agent was part of the loop, with regression rates actually slightly lower than fully-human authored changes.
The coding agent is not autonomous. It drafts. A human reviews and ships. That is the part people miss about why it works at a bank.
We would invest in the hybrid pattern at integration number one, not integration number two hundred. Retrofitting the pattern onto existing integrations was the slowest part of the project.
We would have instrumented the telemetry identically from day one. Our API and UI paths had different telemetry shapes initially, which made the circuit breaker harder to reason about than it needed to be. Unified telemetry is a small decision that pays back forever.
We would have trained our HR business partners on the new tool during the build, not after. We shipped, then trained. The first month of live running had a higher-than-necessary rate of HR asking "is this broken" when the system had actually done exactly what it was supposed to do.
Our head of Automation COE wrote a more technical post about the pattern itself: When your vendor's API goes dark. If you are curious about how the orchestration layer above these integrations works, this post from Rohan covers that.
We are hiring across Platform Engineering and the Automation COE. Come compound with us.