On the early morning of Feb 18, 2025 (Pacific Time), Lever product engineers were alerted by internal monitoring tools for a database instance being unavailable. Around 50% of customers may have initially experienced higher latency on Lever Hire and the Lever API.
A few hours later, compounding issues in the database replicas caused Lever Hire to be inaccessible for those ~50% of customers, for a few hours. The impacted customer accounts were unable to access Lever Hire and candidate-related Data API endpoints at all on Feb 18 from 5:55-10:05 PST (the longest outage, ~4 hours), 14:08-14:18, 14:48-14:52, 21:52-22:09; and Feb 19 from 00:06-00:14. Lever-hosted job sites continued to work for all customers.
Lever product engineers were engaged for investigation and troubleshooting pointed to database issues caused by unusual external Lever API load. The issue was resolved by:
Rebuilding the affected database replicas
Spreading Lever API load across additional database replicas
As part of this database recovery and mitigation measures, the Lever API also became inaccessible for all customers for a few hours. The database rebuild also caused some Lever Hire pipeline numbers and search results to be temporarily out of sync.
To mitigate this situation from occurring in the future and to reduce the risk of a future impact the following measures have been put into place:
Per above, spreading Lever API load across additional database replicas
Limiting database time for individual Lever API requests, to prevent a few individual requests from having a wider impact
Optimizing database query performance