Re-architect JKJAVMY AstraZeneca vaccine appointment system
An alternative approach to the appointment system
On 26 May 2021, the third round of AstraZeneca vaccine registration leaving many in frustration and anger. The system failed to handle the traffic spikes.
Here is my approach on the appointment system architecture.
I am making assumption on the project requirements
Due to the system only serve temporary, I would prefer to keep it simple, use managed service whenever possible. Demonstrating with Google Cloud, because of familiarity.
Overview of the architecture
Deploy as multiple services
There are 2 main actions from the appointment system, get the list of available slots, and submit the booking request.
Each service only serve one API, while maintaining the same code base. This allow services scale independently.
- Slot service responsible for getting available slots from cache DB.
- Sync service responsible for sync available slots from cache DB and main DB, every 5 seconds.
- Submit service only responsible for processing appointment request.
Use cache database
While we expecting more than 500K requests coming in the first minute, we cannot use main database to get the list of available slots. We need an alternative (cache) database for fast reading. I would choose Cloud Firestore for this, due to its scalability and real-time capability. This will provide real-time update on the appointment site without users to keep refreshing the page.
To keep the cache DB updated, we would need update sync from main DB, every 5 seconds.
For the API server, I think any server-side language would do. For deployment, I would choose containers solution, either Cloud Run or Kubernetes Engine. This is because we can control the requested resources for the containers to run.
For example Cloud Run, we can allocate more resources for the service that needs more processing power (submit service)
And also for Kubernetes
# example values
And we need to scale beforehand, to avoid cold start. For Cloud Run we can specify min instance
For Kubernetes we just have to increase the replicas.
On the other hand, for the main DB to update the counter, do use increment field method
Or increment operator if using MongoDB
Overview architecture diagram
What else can be improved
Please do remove
console.log, and make more visual feedback on the UI.
I am not expert in CDN, as far as I know caching can screw up very bad. I would avoid to use CDN if I not sure how to flush the cache during an event of emergency.
here is some insight regarding CDN
Run more tests
It was not unusual spiked of traffics. It was expected to have large traffics at that period of time. Replicate the whole infrastructure setup and run load test against it. Make sure the average requests per second (RPS) hit at least 1 million (as assumed it will have more than 500K requests upon available), with average latency less than 300ms.
This may not be the
suitable architecture design, but I do hope the team behind take it seriously and rethink current approach, as vaccination is really important for all of us. I take this as learning lesson, and would love to hear feedback of this architecture, regardless good or bad.
Did you find this article valuable?
Support Wei Lun by becoming a sponsor. Any amount is appreciated!