Cloud Zone

Cloud Zone

Re-architect JKJAVMY AstraZeneca vaccine appointment system

An alternative approach to the appointment system

On 26 May 2021, the third round of AstraZeneca vaccine registration leaving many in frustration and anger. The system failed to handle the traffic spikes.

WhatsApp Image 2021-05-26 at 12.29.53 PM.jpeg

Here is my approach on the appointment system architecture.

I am making assumption on the project requirements

Due to the system only serve temporary, I would prefer to keep it simple, use managed service whenever possible. Demonstrating with Google Cloud, because of familiarity.

Overview of the architecture

Deploy as multiple services

There are 2 main actions from the appointment system, get the list of available slots, and submit the booking request.

Each service only serve one API, while maintaining the same code base. This allow services scale independently.

  1. Slot service responsible for getting available slots from cache DB.
  2. Sync service responsible for sync available slots from cache DB and main DB, every 5 seconds.
  3. Submit service only responsible for processing appointment request.

image.png

Use cache database

While we expecting more than 500K requests coming in the first minute, we cannot use main database to get the list of available slots. We need an alternative (cache) database for fast reading. I would choose Cloud Firestore for this, due to its scalability and real-time capability. This will provide real-time update on the appointment site without users to keep refreshing the page.

To keep the cache DB updated, we would need update sync from main DB, every 5 seconds.

API server

For the API server, I think any server-side language would do. For deployment, I would choose containers solution, either Cloud Run or Kubernetes Engine. This is because we can control the requested resources for the containers to run.

For example Cloud Run, we can allocate more resources for the service that needs more processing power (submit service) image.png

And also for Kubernetes

# example values 
 resources:
    requests:
       memory: "64Mi"
       cpu: "250m"

And we need to scale beforehand, to avoid cold start. For Cloud Run we can specify min instance image.png

For Kubernetes we just have to increase the replicas.

Database counter

On the other hand, for the main DB to update the counter, do use increment field method

UPDATE yourTableName
set yourColumnName=yourColumnName+1
where condition;

Or increment operator if using MongoDB

Overview architecture diagram

vaksincovid architecture.png

What else can be improved

Please do remove console.log, and make more visual feedback on the UI.

WhatsApp Image 2021-05-26 at 12.19.13 PM.jpeg

I am not expert in CDN, as far as I know caching can screw up very bad. I would avoid to use CDN if I not sure how to flush the cache during an event of emergency.

here is some insight regarding CDN

Run more tests

It was not unusual spiked of traffics. It was expected to have large traffics at that period of time. Replicate the whole infrastructure setup and run load test against it. Make sure the average requests per second (RPS) hit at least 1 million (as assumed it will have more than 500K requests upon available), with average latency less than 300ms.

Recommend to use benchmark tools like hey, bombardier, k6 or ApacheBench. Run it on a separate virtual machine to get most accurate results.

Final words

This may not be the best or suitable architecture design, but I do hope the team behind take it seriously and rethink current approach, as vaccination is really important for all of us. I take this as learning lesson, and would love to hear feedback of this architecture, regardless good or bad.

 
Share this