Building a Cloud-Native Container Platform from Scratch - Part 7
By this stage, you’ve built and scaled an impressive self-service container platform. It’s secure, multi-tenant, GitOps-driven, and developer-friendly. But now comes the hard part: keeping it all alive, healthy, secure, and cost-effective over time.
This is where Day-2 operations come into focus. In this post, we’ll cover the critical areas that separate a working platform from a sustainable one:
- Platform observability
- Backup and disaster recovery
- Upgrades and lifecycle management
- Cost visibility and chargeback
- Handling incidents and outages
There are 3 core phases of operations in software development: day-0, day-1, and day-2.
Day-0 operations are all about planning. What needs to be set up?
Day-1 operations involve putting the plans into action; running the required deployments.
Day-2 operations are ‘business as usual’ operations tasks, like monitoring the application to prevent and respond to problems, and performing routine maintenance like backups and restores.
Full series
- Part 1: Why Build a Self-Service Container Platform
- Part 2: Choosing Your Platform’s Building Blocks
- Part 3: Bootstrapping Your Infrastructure with Terraform
- Part 4: Installing Core Platform Services
- Part 5: Crafting the Developer Experience Layer
- Part 6: Scaling the Platform — Multi-Tenancy, Environments, and Governance
- Part 7: Day-2 Operations and Platform Maturity (you are here)
- Part 8: The Future of Your Internal Platform
Platform Observability: Beyond App Metrics
You’ve already deployed Prometheus, Grafana, and Loki; but now you need to observe the platform itself:
- Kubernetes health metrics: controller manager, API server, node pressure, etc.
- Cluster-level dashboards: show usage per namespace, node saturation, pod restarts
- Platform alerts: “Argo CD sync errors,” “cert-manager failures,” “ingress routing issues”
Tools like Kube-state-metrics and Node Exporter enhance visibility, and Thanos or Cortex can help aggregate metrics from multiple clusters.
💡 Pro tip: Provide pre-built dashboards to teams so they don’t need to reinvent them.
Backup and Disaster Recovery
At some point, something will go wrong, and you need to recover fast!
What to back up:
- Etcd: especially for control plane metadata (use AWS-managed snapshots or Velero)
- Persistent Volumes: EBS or EFS backups for stateful workloads
- GitOps Repos: your platform state lives in Git — keep these safe!
- Secrets: encrypted backups of cloud KMS or External Secrets config
Tools like Velero can snapshot and restore workloads per namespace, or even across clusters.
Upgrade Strategy and Lifecycle Management
Your platform is not static — you must stay on top of updates to:
- Kubernetes (EKS versions go EOL quickly)
- Platform tools (Argo CD, cert-manager, monitoring stack)
- Custom workloads and Helm charts
Automate version testing and rollouts:
- Stage upgrades in lower environments first
- Test Argo syncs after each change
- Use CI to lint and test Helm values, Kubernetes manifests
Use a central changelog or Backstage plugin to inform teams about platform changes.
Cost Visibility and Chargeback
Kubernetes abstracts infrastructure — but someone’s paying for those EC2 nodes and EBS volumes.
For cost control, introduce:
- Cost dashboards per namespace/team
- Quotas: CPU, memory, storage, and pod limits
- Cluster Autoscaler tuning: right-size your workloads and nodes
Tools to help:
- Kubecost (popular for namespace-level cost tracking)
- AWS Cost Explorer + tagging
- Custom Prometheus exporters for billing metrics
If needed, create a chargeback model — where teams see and are accountable for their resource usage.
Incident Handling and Platform SRE
You are now an SRE for the platform. That means:
- Set up alerting and on-call rotation for platform issues
- Maintain runbooks: how to fix Argo outages, expired certs, failing upgrades
- Document incident reviews: what happened, how to prevent recurrence
- Share insights with your internal customers (the product teams)
If the platform goes down, devs can’t ship, and that’s a business risk.
Platform as a Long-Lived Product
Day-2 operations is where many internal platforms fail — not for technical reasons, but because the platform team is overloaded, or the system becomes too brittle to evolve.
Platform maturity requires discipline:
- Regular hygiene tasks (dependency updates, security patches)
- Clear ownership and roadmap
- Listening to developer feedback
- Minimising toil through automation
Keep iterating. Treat your platform as a living product with its own lifecycle.
You’ve Reached Operational Maturity
Your platform now delivers real business value, is resilient, observable, and cost-aware, has the tools to grow and evolve with your teams.
This is the line between a DIY Kubernetes cluster and a real internal developer platform.
Coming Up in Part 8
In the final part, we’ll look at what’s next: Advanced capabilities and the future of the internal platform; including progressive delivery, AI-enhanced observability, and internal platform marketplaces.
Leave a comment