The following article was adapted from a STACK-X Webinar conducted by the National Digital Identity Team at GovTech.
This is the second part of a two-part series that focuses on the NDI tech stack. You can read our first part here if you missed it.
Welcome back to the second part of our series on the NDI tech stack. In this article, we’ll be looking at the last three areas of focus:
Availability — How do we minimise scheduled downtimes and ensure that services will continue to work during deployment?
Repeatability — How do we ensure the consistency of all our environments across all our deployments?
Visibility — How can we gain useful metrics and insights to foresee and prevent potential problems, as well as better manage our system?
Let’s get straight into it, shall we?
Traditionally, availability is measured by the subtraction of the unplanned downtime from the total uptime of the service. However, this ignores scheduled downtime, which is still downtime as far as end users are concerned. How do we avoid this?
NDI uses a combination of approaches to achieve zero-downtime deployment. These approaches include rolling updates, blue-green deployment, and canary deployment. Let’s look at how we carry out blue-green deployment.
As a prerequisite for NDI, all our infrastructure setups are driven by Infra-as-Code (IAC), an IAC automation built into the CI/CD pipeline to test deployments across all environments before we deploy them to the production environment. This is reflected in the flow on the right of the diagram.
On the left, we have the blue and green zones. The blue zone represents the existing workload running the old version. Up top, you can see that we rely on AWS Route 53, a cloud native and extremely reliable DNS. This is where we map the external Fully Qualified Domain Name (FQDN) — id.singpass.gov.sg — to the internal-facing FQDN, which is provided to the blue and green zones.
We use IAC Deployment (IACD) to roll out new releases into the green zone through the CI/CD pipeline. Sanity checks are conducted too.
Once all of this is done and we have our Version 2, we initiate another IACD script to swap the internal FQDN mapping, redirecting traffic from the blue zone to the green zone. If any abnormalities are detected, we can simply swap the URL back to the blue zone and investigate the errors. For end users, the service continues to run smoothly.
Many outages are caused by human errors, misconfigurations or a lack of comprehensive testing.
We therefore aimed to create a repeatable and reliable infra build by following a continuous set of practices that minimise the risk of issues and failures, allowing us to deliver a robust infrastructure through an automated, secure, and efficient process.
There are several challenges that we face when we manage infrastructure.
Configuration Drift — You start with identical servers but untracked changes over time lead to different configurations.
Fear of changes — Developers become afraid to make any changes for fear of breaking the environment.
Human errors — More manual processes make the infrastructure more vulnerable to human errors.
Time to market — Time that’s supposed to be spent getting products to market ends up being wasted on troubleshooting.
Possible downtime —Due to the accumulation of existing and potential future issues, the service is brought offline entirely for troubleshooting.
NDI uses IAC as a method for eliminating these issues. With IAC, any actions that can be automated are automated, as we can write code to deploy infrastructure components quickly and consistently. This ensures that the deployment process is automated, repeatable, predictable and most importantly, error-free!
Since everything is captured in code, we can repeatedly build multiple identical environments from the same code base. We can also deploy automated tests to check for errors ensuring that changes will be implemented reliably and safely.
With the right process built around IAC, issues or failures can be minimised, giving developers the confidence to make changes without fear of errors.
The diagram above shows NDI’s approach to IAC. Here’s what happens at each level:
Level 1: Base Infrastructure
For the first level, we chose to use Terraform to build up a service catalogue of secure and trusted modules. We did not want NDI teams to build their own modules with different code styles and standards, which is why we created this first level as a central repository for teams to pull, contribute and reuse modules. This way, we can ensure that quality, consistency and repeatability are all built into a single source, accelerating our teams’ assembly of projects.
We first identified the resources for our infrastructure, then used Terraform to build them into generic and reusable building blocks called modules. Each module is thoroughly tested and documented before they are versioned and released as immutable artifacts to be promoted through environments.
Each module is designed to be as small as possible so that testing is efficient. With modules, we can also enforce organisation or project requirements.
Next, we use a tool called Packer to efficiently build machine images with security and compliance in mind. This ensures that operating systems are patched and hardened according to CIS guidelines, which becomes the golden standard for our images.
We also have built-in tests for the images before we tag them as safe-to-use and share them across NDI projects. Furthermore, we pre-bake our images for Cloud Watch, EPP and EPDR.
Once these images are rolled out, we consistently monitor them for vulnerabilities and rebuild them when required.
Level 2: Stack
In this level, different teams come into the picture and use Terraform to assemble their own stacks using modules from the first level.
Teams sometimes skip past the first level and jump straight into building their own infrastructure, leading to code that isn’t reusable for other teams. It is therefore important to enforce the repeated use of modules from level 1 to help build up their stacks.
In the diagram, you can see how we assemble our stacks into different groups of Virtual Private Clouds (VPCs). Infrastructure is treated as exclusive to the underlying stack, helping us to better control and audit the changes going in and out of the stacks.
Do note that it is important to plan ahead for your directory structure and file layout early in the development life cycle.
Level 3: App Infrastructure
In this level, we deploy applications — for security, observability and resiliency purposes — into Kubernetes to support NDI apps.
The more objects we deploy into Kubernetes, the more complicated it becomes to manage the life cycles of these objects, as we have to continually track them with each upgrade. That’s why we chose Helm to help us manage Kubernetes more consistently and efficiently.
Helm helps to package everything into one single application and clearly displays what’s configurable, making application deployment easy, standardised, and reusable. This provides a consistent workflow through which we can quickly define, manage, and deploy applications, improving developer productivity and reducing deployment complexity.
Level 4: App Layer
In this final level, we install NDI applications into Kubernetes. The apps are designed to be stateless and are responsible for serving application traffic.
Once again, we use Helm here to efficiently manage the deployment of these applications and ensure that they’re configured with security and resiliency in mind.
System failures are a dime a dozen in our industry. These failures happen for many reasons, but usually as a result of multiple small failures that add up to a larger issue and the system collapses. Being able to detect anomalies early can prevent these collapses.
That’s why full visibility of the different parts of our system is so important.
How do we know that each component is healthy?
We’ve put in place a monitoring stack, shown above, which serves as a shared monitoring service for multiple NDI products. The stack collects hundreds of different metrics from different components, be it infra, apps or AWS, and all metrics are presented on a dashboard that allows our ops teams to perform analyses, spot usage patterns, as well as detect and prevent anomalies.
We also collected system and application logs, which are piped into a central logging service. This allows us to access and analyse the logs easily during troubleshooting and correlate them when there are issues involving multiple products.
Finally, we put in place a set of actionable alerts that allow us to act on issues fast and provide early warnings if necessary. Upon receiving alerts, the ops teams can click on them and be redirected to a specific monitoring dashboard. For some critical events, an incident ticket is even automatically created.
Above, you can see the set of practices we defined to achieve the reliability we desire.
The focus of this is primarily the health and uptime of each service that we have in NDI, a reflection of what the users experience, or the user-perceived uptime.
This diagram starts off on the yellow quadrant, where we first identify how the user interacts with the system. For example, during authentication, the user will scan a QR code and be prompted to provide consent. Once the verification request is signed, it’ll be verified by an authentication service provider and certificate authority. So, all of these touch points and steps are mapped into a service journey (and possible sub-journeys). This means identifying relevant endpoints and all critical service dependencies involved within the user interaction. This dependency could be a database, Message Queue (MQ) or other AWS services.
Once the service journey is identified, we move on to the next step. Here we identify the Service Level Indicator (SLI) and Service Level Objective (SLO) for each of the service journeys and translate them into monitoring dashboards and alerts. This, of course, isn’t a one-time activity — we have to continuously monitor and fine-tune our SLIs and SLOs.
Thirdly, we review our processes regularly and identifying any opportunities for automation. For example, we automated our Grafana dashboard pre-deployment verification by creating a Python script, and automated incident ticket creation for critical events. This improves the reliability of our services by reducing the possibility of human error and reduces the unnecessary effort needed for manual processes.
Finally, it’s also our practice to delegate and test our service reliability periodically. In fact, one fun way in which we do this is through game day, where one team triggers failures in the system and another has to resolve incidents. This helps us to identify gaps in our monitoring, alerts, and our operations processes.
Next, we’ll go into further detail about the first two steps.
Identifying Critical Service Journeys
The above diagram reflects how we broke down the service journey into two layers.
The first layer, or Level 1, is the Service Journey View, a bird’s-eye view of all our services related to the specific NDI project. At this level, we can quickly spot which service is having issues. If it’s a critical component we depend on, it gets reflected immediately on the Level 1 dashboard view.
Once we’ve identified the issue, the ops team can drill deeper into Level 2, the Service Component View (reflected by the three service journey branches in the diagram). This view provides a detailed view of the relevant components, including:
- Service Endpoint
- External API Performance
- External Service Connectivity
- HTTP Response Errors
- AWS Services
- NDI Core Infrastructure
The ops teams will be able to see which component is causing the issue and continue to do in-depth troubleshooting by referring to logs from specific components.
Defining, Monitoring and Fine-tuning SLIs and SLOs
In this step, we have various teams — infra, application and ops — working together to define the SLIs and SLOs. The diagram shows each step of this process (from top to bottom).
Let’s use a QR authentication service accessible through SingPass Mobile as an example. We define a specific SLI that we want to track, which in this case could be the latency of the QR authentication (how many seconds it takes to be completed). From there we decide how we’re going to measure it and what are each of the endpoints involved within this service journey. We then need to track the amount of time required to complete each of these activities.
Then, we define our SLO. As an example, we decided that 99% of our authentication needs to be completed within 5 seconds. We then develop monitoring dashboards to track these SLIs and SLO. From there, we constantly monitor these alerts and thresholds that we’ve defined, as well as refine our baselines and SLOs regularly.
Below, we’ll show you how we’ve translated our earlier concepts into Grafana dashboards.
In this first level view (the Service Journey View, as you’ll recall), the ops teams will be able to immediately look at everything related to a specific NDI product. The critical infrastructures, as managed by the NDI team, are all shown at the top under the Common Stack category. We also have AWS Services, followed by the service journeys we’ve identified at the bottom. These metrics are all aggregated and calculated based on various other metrics they depend on. For example, these metrics might be database metrics or error rate within that particular service journey, to list a few.
When the ops teams click on links to the sub-journeys (under Services), they’re directed to our level 2 dashboard shown below.
In this level, the Service Component View, the ops team can look at the details and dependencies related to each sub-journey. Displayed above are port health and external communication. If one of our products is making a call to an external system such as certificate authority, we can track the health of the communication. We can also track other relevant metrics important to the service journey, such as request rate, error rate, latency and uptime.
The ops team can then view each individual component in even more detail, as shown here. For example, if any connection issues occur, they can immediately take action and investigate the problem.
What we’ve shown you is by no means the end of the monitoring stack implementation. We view this as a journey, and we’re continuing to enhance our monitoring stack as business requirements change and technology evolves.
With that, we’ve come to the end of this two-part look at the NDI tech stack. We hope you’ve found this useful. Thanks for reading!
If you’re a potential partner who’s keen to integrate NDI into your services, products or platform, you can visit the GovTech Developer Portal for more information. The portal contains simple onboarding instructions, APIs and technical documentation, as well as quick access to development sandboxes.
For more information on other government developed products, please visit developer.tech.gov.sg.
Authors: GovTech’s NDI Team (Donald Ong, Dickson Chu, Wongso Wijaya) and Technology Management Office (Michael Tan)
Originally published here on NDI.sg.
For more resouces, visit NDI.sg.