Learn the principles and practices essential for your organisation to scale critical services reliably and economically.
Site Reliability Engineering (SRE) is a discipline that incorporates aspects of software engineering and applies them to infrastructure and operations problems. The key objectives are to create ultra-scalable and highly reliable distributed software systems.
Introducing a site-reliability dimension requires organisational re-alignment, a new focus on engineering and automation, as well as the adoption of a range of new working paradigms.
Key Takeaways
At the end of this programme, you will be able to:
Discover the history of SRE and its emergence at Google
Understand the inter-relationship of SRE with DevOps and other popular frameworks
Understand the underlying principles behind SRE
Understand Service Level Objectives (SLOs) and their user focus
Understand Service Level Indicators (SLI’s) and the modern monitoring landscape
Identify error budgets and the associated error budget policies
Understand toil and its effect on an organisation’s productivity
Identify some practical steps that can help to eliminate toil
Understand observability as something to indicate the health of a service
Understand SRE tools, automation techniques and the importance of security
Apply anti-fragility, the approach to failure and failure testing
Understand the organisational impact that SRE can bring to an organisation
Who Should Attend
Please refer to the job roles section.
Prerequisites
Prior knowledge of DevOps, which can be achieved by attending: IT14A05 - DevOps Foundation.
It is recommended that you have prior working experience or knowledge in IT software development or IT industry operations.
What To Bring
Hardware and Software
This programme will be conducted as a Virtual Live Class (VLC) via the Zoom platform. You must own a Zoom account and have a laptop or a desktop with “Zoom Client for Meetings” installed. This can be downloaded from https://zoom.us/download.
Please ensure that your computer or laptop meets the following requirements:
Operating system: Windows 10 or MacOS (64-bit or above)
Processor/CPU: 1.8 GHz, 2-core Intel Core i3 or higher
Minimum 20 GB hard disk space.
Minimum 8 GB RAM
Webcam (The camera must be turn on during the entire duration of the class)
Microphone
Internet connection: wired or wireless broadband
The latest version of Zoom software is to be installed on your computer or laptop before the class
Good to have a wired internet connection to provide you with a stable and reliable connection.
Recommended to have dual monitors to improve your training experience, enabling you to simultaneously participate in hands-on exercises and maintain engagement with your instructor.
Programme Structure
This programme will cover the following topics:
Module 1: SRE Principles and Practices
What is Site Reliability Engineering?
SRE and DevOps: What are the Differences?
SRE Principles and Practices
Module 2: Service Level Objectives and Error Budgets
Service Level Objectives (SLO’s)
Error Budgets and Error Budget Policies
Module 3: Reducing Toil
What is Toil?
Why is Toil Bad?
Doing Something About Toil
Module 4: Monitoring & Service Level Indicators
Service Level Indicators (SLI’s)
Monitoring and Observability
Module 5: SRE Tools and Automation
Automation Focus
Hierarchy of Automation Types
Secure Automation
Automation Tools
Module 6: Anti-Fragility and Learning from Failure
Why Learn from Failure?
Benefits of Anti-Fragility
Shifting the Organisational Balance
Module 7: Organisational Impact of SRE
Why Organisations Embrace SRE?
Patterns for SRE Adoption
Sustainable Incident Response
Blameless Post-Mortems
SRE and Scale
Module 8: SRE, Other Frameworks and Trends
SRE and Other Frameworks
SRE Evolution
Additional Sources of Information
Certificate Obtained and Conferred By:
Certificate of Completion from NTUC LearningHub
Upon meeting 75% attendance and passing the assessment(s), you will receive a Certificate of Completion from NTUC LearningHub.
Statement of Attainment from SkillsFuture Singapore
Upon meeting at least 75% attendance and passing the assessment(s), you will receive a Statement of Attainment from SkillsFuture Singapore to certify that the you has achieved the following Competency Standard(s): Quality Engineering (ICT-DIT-3011-1.1)
External Certification Exam:
After registration, you will receive a DevOps exam voucher three days before the date of programme commencement from NTUC LearningHub. After completing the programme with 75% attendance achieved, you can proceed to register and sit for the official “DevOps Site Reliability Engineering Foundation” exam on DevOps Institute online portal. You must complete the exam within the validity date of the exam voucher.
DevOps Site Reliability Engineering Foundation Exam Details Number of Questions: 40 Question Format: Multiple-choice Exam Duration: 60 minutes Passing Score: 26 out of 40 (65%)
After completing this programme with at least 75% attendance and upon passing the official “DevOps Site Reliability Engineering Foundation” certification exam, you will receive a Certified Site Reliability Engineering Foundation certification from DevOps Institute. The certification is governed and maintained by DevOps Institute.