3 minute read Platform Engineering

By this stage, you’ve built and scaled an impressive self-service container platform. It’s secure, multi-tenant, GitOps-driven, and developer-friendly. But now comes the hard part: keeping it all alive, healthy, secure, and cost-effective over time.

This is where Day-2 operations come into focus. In this post, we’ll cover the critical areas that separate a working platform from a sustainable one:

  • Platform observability
  • Backup and disaster recovery
  • Upgrades and lifecycle management
  • Cost visibility and chargeback
  • Handling incidents and outages

There are 3 core phases of operations in software development: day-0, day-1, and day-2.

Day-0 operations are all about planning. What needs to be set up?

Day-1 operations involve putting the plans into action; running the required deployments.

Day-2 operations are ‘business as usual’ operations tasks, like monitoring the application to prevent and respond to problems, and performing routine maintenance like backups and restores.

Platform Observability: Beyond App Metrics

You’ve already deployed Prometheus, Grafana, and Loki; but now you need to observe the platform itself:

  • Kubernetes health metrics: controller manager, API server, node pressure, etc.
  • Cluster-level dashboards: show usage per namespace, node saturation, pod restarts
  • Platform alerts: “Argo CD sync errors,” “cert-manager failures,” “ingress routing issues”

Tools like Kube-state-metrics and Node Exporter enhance visibility, and Thanos or Cortex can help aggregate metrics from multiple clusters.

💡 Pro tip: Provide pre-built dashboards to teams so they don’t need to reinvent them.

Backup and Disaster Recovery

At some point, something will go wrong, and you need to recover fast!

What to back up:

  • Etcd: especially for control plane metadata (use AWS-managed snapshots or Velero)
  • Persistent Volumes: EBS or EFS backups for stateful workloads
  • GitOps Repos: your platform state lives in Git — keep these safe!
  • Secrets: encrypted backups of cloud KMS or External Secrets config

Tools like Velero can snapshot and restore workloads per namespace, or even across clusters.

Upgrade Strategy and Lifecycle Management

Your platform is not static — you must stay on top of updates to:

  • Kubernetes (EKS versions go EOL quickly)
  • Platform tools (Argo CD, cert-manager, monitoring stack)
  • Custom workloads and Helm charts

Automate version testing and rollouts:

  • Stage upgrades in lower environments first
  • Test Argo syncs after each change
  • Use CI to lint and test Helm values, Kubernetes manifests

Use a central changelog or Backstage plugin to inform teams about platform changes.

Cost Visibility and Chargeback

Kubernetes abstracts infrastructure — but someone’s paying for those EC2 nodes and EBS volumes.

For cost control, introduce:

  • Cost dashboards per namespace/team
  • Quotas: CPU, memory, storage, and pod limits
  • Cluster Autoscaler tuning: right-size your workloads and nodes

Tools to help:

  • Kubecost (popular for namespace-level cost tracking)
  • AWS Cost Explorer + tagging
  • Custom Prometheus exporters for billing metrics

If needed, create a chargeback model — where teams see and are accountable for their resource usage.

Incident Handling and Platform SRE

You are now an SRE for the platform. That means:

  • Set up alerting and on-call rotation for platform issues
  • Maintain runbooks: how to fix Argo outages, expired certs, failing upgrades
  • Document incident reviews: what happened, how to prevent recurrence
  • Share insights with your internal customers (the product teams)

If the platform goes down, devs can’t ship, and that’s a business risk.

Platform as a Long-Lived Product

Day-2 operations is where many internal platforms fail — not for technical reasons, but because the platform team is overloaded, or the system becomes too brittle to evolve.

Platform maturity requires discipline:

  • Regular hygiene tasks (dependency updates, security patches)
  • Clear ownership and roadmap
  • Listening to developer feedback
  • Minimising toil through automation

Keep iterating. Treat your platform as a living product with its own lifecycle.

You’ve Reached Operational Maturity

Your platform now delivers real business value, is resilient, observable, and cost-aware, has the tools to grow and evolve with your teams.

This is the line between a DIY Kubernetes cluster and a real internal developer platform.

Coming Up in Part 8

In the final part, we’ll look at what’s next: Advanced capabilities and the future of the internal platform; including progressive delivery, AI-enhanced observability, and internal platform marketplaces.

Part 8: The Future of Your Internal Platform

Leave a comment