Building a Cloud-Native Container Platform from Scratch - Part 7

3 minute read Platform Engineering

By this stage, you’ve built and scaled an impressive self-service container platform. It’s secure, multi-tenant, GitOps-driven, and developer-friendly. But now comes the hard part: keeping it all alive, healthy, secure, and cost-effective over time.

This is where Day-2 operations come into focus. In this post, we’ll cover the critical areas that separate a working platform from a sustainable one:

Platform observability
Backup and disaster recovery
Upgrades and lifecycle management
Cost visibility and chargeback
Handling incidents and outages

There are 3 core phases of operations in software development: day-0, day-1, and day-2.

Day-0 operations are all about planning. What needs to be set up?

Day-1 operations involve putting the plans into action; running the required deployments.

Day-2 operations are ‘business as usual’ operations tasks, like monitoring the application to prevent and respond to problems, and performing routine maintenance like backups and restores.

Full series

Part 1: Why Build a Self-Service Container Platform
Part 2: Choosing Your Platform’s Building Blocks
Part 3: Bootstrapping Your Infrastructure with Terraform
Part 4: Installing Core Platform Services
Part 5: Crafting the Developer Experience Layer
Part 6: Scaling the Platform — Multi-Tenancy, Environments, and Governance
Part 7: Day-2 Operations and Platform Maturity (you are here)
Part 8: The Future of Your Internal Platform

Platform Observability: Beyond App Metrics

You’ve already deployed Prometheus, Grafana, and Loki; but now you need to observe the platform itself:

Kubernetes health metrics: controller manager, API server, node pressure, etc.
Cluster-level dashboards: show usage per namespace, node saturation, pod restarts
Platform alerts: “Argo CD sync errors,” “cert-manager failures,” “ingress routing issues”

Tools like Kube-state-metrics and Node Exporter enhance visibility, and Thanos or Cortex can help aggregate metrics from multiple clusters.

💡 Pro tip: Provide pre-built dashboards to teams so they don’t need to reinvent them.

Backup and Disaster Recovery

At some point, something will go wrong, and you need to recover fast!

What to back up:

Etcd: especially for control plane metadata (use AWS-managed snapshots or Velero)
Persistent Volumes: EBS or EFS backups for stateful workloads
GitOps Repos: your platform state lives in Git — keep these safe!
Secrets: encrypted backups of cloud KMS or External Secrets config

Tools like Velero can snapshot and restore workloads per namespace, or even across clusters.

Upgrade Strategy and Lifecycle Management

Your platform is not static — you must stay on top of updates to:

Kubernetes (EKS versions go EOL quickly)
Platform tools (Argo CD, cert-manager, monitoring stack)
Custom workloads and Helm charts

Automate version testing and rollouts:

Stage upgrades in lower environments first
Test Argo syncs after each change
Use CI to lint and test Helm values, Kubernetes manifests

Use a central changelog or Backstage plugin, like RSC-Labs/backstage-changelog-plugin, to inform teams about platform changes.

Cost Visibility and Chargeback

Kubernetes abstracts infrastructure — but someone’s paying for those EC2 nodes and EBS volumes.

For cost control, introduce:

Cost dashboards per namespace/team
Quotas: CPU, memory, storage, and pod limits
Cluster Autoscaler tuning: right-size your workloads and nodes

Tools to help:

Kubecost (popular for namespace-level cost tracking)
AWS Cost Explorer + tagging
Custom Prometheus exporters for billing metrics

If needed, create a chargeback model where teams see and are accountable for their resource usage.

Incident Handling and Platform SRE

You are now an SRE for the platform. That means:

Set up alerting and on-call rotation for platform issues
Maintain runbooks: how to fix Argo outages, expired certs, failing upgrades
Document incident reviews: what happened, how to prevent recurrence
Share insights with your internal customers (the product teams)

If the platform goes down, devs can’t ship, and that’s a business risk.

Platform as a Long-Lived Product

Day-2 operations is where many internal platforms fail; not for technical reasons, but because the platform team is overloaded, or the system becomes too brittle to evolve.

Platform maturity requires discipline:

Regular hygiene tasks (dependency updates, security patches)
Clear ownership and roadmap
Listening to developer feedback
Minimising toil through automation

Keep iterating. Treat your platform as a living product with its own lifecycle.

You’ve Reached Operational Maturity

Your platform now delivers real business value, is resilient, observable, and cost-aware, has the tools to grow and evolve with your teams.

This is the line between a DIY Kubernetes cluster and a real internal developer platform.

Coming Up in Part 8

In the final part, we’ll look at what’s next: Advanced capabilities and the future of the internal platform; including progressive delivery, AI-enhanced observability, and internal platform marketplaces.

Part 8: The Future of Your Internal Platform

Share on

X Facebook LinkedIn Bluesky

Glen Thomas

Building a Cloud-Native Container Platform from Scratch - Part 7

Platform Observability: Beyond App Metrics

Backup and Disaster Recovery

Upgrade Strategy and Lifecycle Management

Cost Visibility and Chargeback

Incident Handling and Platform SRE

Platform as a Long-Lived Product

You’ve Reached Operational Maturity

Coming Up in Part 8

Share on

Leave a comment

You may also enjoy

Mastering Docker Bake: Building Multi-Platform Images at Scale

Container Security Fundamentals: Protecting Your Containerised Applications

Building a Hub and Spoke Network Topology in Azure: A Comprehensive Guide

Building a Centralised Azure Container Registry: A Platform Engineering Guide