The following article was adapted from a STACK-X Webinar conducted by the National Digital Identity Team at GovTech.
This is the first part of a two-part series that focuses on the NDI tech stack. You can read our second article here.
Developing the Stack
In this part, we’re diving straight into the process of how we built the NDI infrastructure.
We’ve divided this development process into six areas of engineering focus that each serve to improve the overall robustness of the platform while increasing agility and keeping costs down. In this article, we’ll be looking at the first three:
Multi-tenancy — How do we avoid duplicated efforts spent on common patterns across all NDI projects?
Resiliency — How do we ensure that our NDI services are always available, knowing that all systems eventually fail at some point in time?
Security — As one of the first critical infrastructures built on the cloud, how has the security paradigm changed? How do we then ensure our system is secure, knowing that it will have to operate on a shared-responsibility model?
Let’s explore them one-by-one, shall we?
The NDI programme comprises multiple projects, such as SingPass, SingPass Mobile, MyInfo, etc. In order to create a multi-tenancy architecture, we first attempted to identify common patterns in the infrastructure for every project. For instance, every web app project requires some form of De-Militarised Zone (DMZ) to control traffic coming in and out of the network, as well as a management zone that consists of a suite of tools for monitoring, log management, access management, etc.
Next, we had to figure out how to implement these zones to the cloud efficiently without duplicating efforts to build such common patterns, all the while enforcing consistency in policy standards alignment.
With that in mind, we created a multi-tenancy architecture in the form of common stacks, as depicted above in the diagram. These stacks are shared between the various projects, highlighted in light blue, allowing developers to focus on the important work of application development as opposed to infrastructure development.
You’ll notice, wedged between the DMZ and Management zone stacks, that there’s also an Application Infrastructure Architecture Standard (AIAS) zone stack. The AIAS zone serves as a bridge that enables protected traffic flow between the internet and intranet zones through encryption, authentication and payload inspection.
While we build the common stacks, we are always aware that we do not need to build everything from scratch. NDI is built on the Government Commercial Cloud (GCC) platform, which enforces uniformity on how we onboard users, define roles and access rights, etc across our various cloud providers. We also utilise tools from the SG Tech Stack, such as SHIP and HATS for our Deploy (CICD) stack.
While building common stacks has obvious benefits, it also brings along difficulties that must be addressed. For example, concentration risks are a real concern. Our design needs to ensure there is an isolation of configurations and workload instances at the project level, so that any misconfigurations or programme bugs in one project won’t cause a cascading failure that affects all our other projects.
Any deployment to the common stacks may potentially cause downtime to projects sharing the stacks. It is therefore paramount to have a comprehensive automated regression test suite encompassing key use cases of each project, and to carry out deployment with zero-downtime approaches (more on that in the second part of the article).
We also need to build a high degree of self service into the common stacks to enable project teams roll out config changes without having to wait for a central team to attend to these changes which may potentially impact team velocity.
Why is it important for us to build common stacks?
In every project, there is a tendency for developers to dedicate their efforts towards building their own infra layers. This could happen for any number of reasons, including not having access to multi-tenancy features. Not only does this lead to unnecessary extra effort spent on largely similar tasks, it also inflates expenses.
By building these infra layers just once and sharing them across different projects, the effort and cost savings are potentially huge! (see the bottom left section of the diagram)
Furthermore, without a common stack, every project team takes its own approach in trying to adhere to policies, possibly leading to inconsistencies across teams. Also, the same sets of various tests and assessments are conducted on infrastructure repeatedly among teams. (see the bottom right of the diagram)
The NDI common stack spares teams from these potential headaches by (1) defining a set of frameworks and standards that reference back to industry best practices, and (2) offering implementation guides for all NDI projects to use. This ensures that every project is aligned to our security policies and efficiency is maximised with app development.
Even the best cloud service providers encounter outages.
NDI is a national platform that has to be able to serve over 4 million users and thousands of public and private relying parties. Assuming that all systems will eventually fail, how can we build our platform to not only minimise these outages but be resilient to such inevitable failures?
A diagram of NDI’s multi-AZ network.
The first thing we did was to fully leverage the Multi-Availability Zone (Multi-AZ) network capability of our cloud platform to build an active/active system across the three AZs in the Singapore region, where traffic is load-balanced across all three AZs.
We also implemented an auto-scaling group for every node in the transactional path. Besides ensuring that we have the on-demand capacity to respond to any sudden spike in the load, the auto-scaling group created for each node also eliminates any single point of failure within each AZ. Imagine that a hundred requests are sent to one of the AZs and the reverse proxy node in that AZ suddenly fails.
In a traditional setup, all the requests sent to that AZ would simply time out and fail. In our setup, however, the auto-scaling group simply reroutes requests to the next AZ ensuring the remaining requests are successfully processed.
Something else we implemented was auto-healing. One of the key selection criteria for any product we use in our platform is that it must be stateless and support health checks. This enables the product to work with the cloud auto-scaling mechanism, such that the auto-scaling group can detect a malfunction instance and replace it automatically.
NDI is a critical information infrastructure (CII), and it was important that we did not simply carry over the traditional security paradigm that exists in our on-premise system today. So instead, we tried to look at how we could implement a security paradigm in a cloud environment that’s equivalent, if not better, than the existing one. We did so by taking reference from this zero-trust framework published by NIST.
But what is zero-trust, and how does it relate to NDI exactly? A zero-trust framework comprises several components:
Micro-segmentation–a mechanism to reduce the blast radius of an infra component when security is compromised in our segmentations. We deployed each set of infra components in their own subnets, so that when security is compromised, the subnet NACL rules will limit lateral movement and unauthorised access within the subnet.
Fine-grained access controls — with our infra divided into micro-segmentations (subnets) and NACL rules controlling access between the micro-segmentations, we further put in place fine-grained access controls, in the form of security groups attached to each workload, to ensure access is only given between intended workloads.
Never trust, always verify — we treat each system as untrusted and continuously assess every connectivity point.
The diagram above depicts a system initiating a request for enterprise resources. These resources could be another system’s database or cloud resources.
When a system initiates a request, it’s considered untrusted until verification is done. The verification is conducted in the Policy Enforcement Point (PEP) process, where policies at decision points are validated and checked. In the diagram, we depict a scenario of an instance-to-instance/docker-to-docker communication, where the verification process happens at three different levels:
Network Policy — ensures the system is coming from the workload it intends to communicate with. The subnet firewall, host firewall, routing ruleset and mTLS all help to ensure that this is the case.
Service Policy — the request has to go through authentication, authorisation and payload inspection to mitigate the risk of malformed-related attacks.
IAM Policy — validates the roles and access rights that you have before the intended action can be executed.
For this zero-trust mechanism to work out holistically in NDI, the backbone system (depicted as the bottom layer in the diagram) is extremely important.
For example, consider access management. On the user side, there is a need for privileged ID management to manage user accounts with elevated permissions to critical resources, and to enforce proper segregation of roles. On the system side, there is a need for a certificate management system (PKI) to issue certificates for the various endpoints to enforce access control on these endpoints through certificate-based authentication and authorisation. There must also be a data access policy, to determine what data which system has access to.
For threat detection, alerts and response, activity logs from applications, infra and control points will be piped into the SIEM tool, where it applies rules and threat intelligence to identify potential threats, send out alerts and trigger actions to isolate or neutralise the threats.
All of these do not, by any stretch, mean that we should throw out decades of perimeter defence experience. A combination of both zero trust and perimeter defence mechanisms leads to an overall better security posture for NDI in the cloud.
Here is a snapshot of some of the key security controls that we use.
With that, we’ve come to the end of the first look into the NDI tech stack. We hope this has been informative and useful for you, and that you’ll join us in the second part of this article, where we talk about minimising scheduled downtimes, ensuring consistency in our environments and utilising metrics and insights to foresee and prevent potential problems.
Please click here to read the second part.
If you’re a potential partner who’s keen to integrate NDI into your services, products or platform, you can visit the GovTech Developer Portal for more information. The portal contains simple onboarding instructions, APIs and technical documentation, as well as quick access to development sandboxes.
For more information on other government developed products, please visit developer.tech.gov.sg.
Authors: GovTech’s NDI Team (Donald Ong, Dickson Chu, Wongso Wijaya) and Technology Management Office (Michael Tan)
Originally published here on NDI.sg.
For more resources, visit NDI.sg.