5 minute read Platform Engineering

You’ve built a self-service container platform from the ground up. Complete with GitOps workflows, security, developer experience tooling, and operational maturity. But platforms are never “done”.

The most effective platform teams don’t just build infrastructure; they build leverage.

In this final part, we’ll explore how to evolve your platform with advanced capabilities that go beyond stability, to empower innovation, accelerate delivery, and future-proof your ecosystem.

Progressive Delivery: Safer Deployments at Scale

Not all deploys should be “all at once”. Progressive delivery introduces smarter release strategies like:

Tools like Flagger, Argo Rollouts, or LaunchDarkly can integrate with your existing GitOps workflows to enable these strategies.

You can build them into your platform templates, giving devs safe, repeatable release workflows.

Example: Implementing Canary Deployments with Argo Rollouts

This YAML manifest defines an Argo Rollouts resource for implementing a canary deployment strategy. The Rollout specifies four replicas of the application and uses a canary strategy to gradually shift traffic to the new version. The steps section outlines the rollout process: initially, 20% of traffic is directed to the new version, followed by a 5-minute pause for monitoring. Then, 50% of traffic is shifted with a 10-minute pause, and finally, 100% of traffic is routed to the new version. This staged approach helps reduce risk by allowing issues to be detected early before the rollout is completed. The selector and template fields define how pods are matched and deployed, similar to a standard Kubernetes Deployment.

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: my-app
spec:
  replicas: 4
  strategy:
    canary:
      steps:
        - setWeight: 20
        - pause: {duration: 5m}
        - setWeight: 50
        - pause: {duration: 10m}
        - setWeight: 100
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
        - name: my-app
          image: my-app:v2

You can automate monitoring and rollback as part of progressive delivery. Tools like Argo Rollouts and Flagger integrate with observability platforms (Prometheus, Datadog, etc.) to watch key metrics (error rate, latency, SLOs) during each rollout step.

If a metric breaches a threshold, the rollout can be paused or automatically rolled back to the previous stable version—without manual intervention.

Example: Automated Analysis and Rollback with Argo Rollouts

strategy:
  canary:
    analysis:
      startingStep: 2 # delay starting analysis run until setWeight: 20%
      templates:
        - templateName: success-rate
      args:
        - name: service-name
          value: my-app
    steps:
      - setWeight: 20
      - pause: {duration: 5m}
      - setWeight: 50
      - pause: {duration: 10m}
      - setWeight: 100

Here, the analysis step references a metric template (e.g. success rate from Prometheus). If the analysis fails, Argo Rollouts will halt or revert the deployment automatically. This approach ensures safer, hands-off releases and faster recovery from issues.

Below is a sample of the success-rate AnalysisTemplate that checks the success rate of your application using Prometheus metrics.

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  args:
    - name: service-name
  metrics:
    - name: success-rate
      interval: 1m
      count: 3
      successCondition: result >= 0.99
      provider:
        prometheus:
          address: http://prometheus.monitoring.svc.cluster.local:9090
          query: |
            sum(rate(http_requests_total{job="",status=~"2.."}[5m]))
            /
            sum(rate(http_requests_total{job=""}[5m]))

This template queries Prometheus for the ratio of successful HTTP requests (status code 2xx) to total requests over a 5-minute window. If the success rate drops below 99%, the rollout will be paused or rolled back automatically.

Example: Blue/Green Deployment with Kubernetes Services

This YAML manifest defines a Kubernetes Deployment for a “blue/green” deployment strategy. In this approach, a new version of the application (labeled as “green”) is deployed alongside the existing version (“blue”). The deployment creates four replicas of the new version, each identified by the labels app: my-app and version: green. By running both versions simultaneously, you can safely test the new release in production and switch traffic over when ready, minimising downtime and risk during updates.

# Deploy the new version alongside the old
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app-green
spec:
  replicas: 4
  selector:
    matchLabels:
      app: my-app
      version: green
  template:
    metadata:
      labels:
        app: my-app
        version: green
    spec:
      containers:
        - name: my-app
          image: my-app:v2

# Service points to either blue or green deployment
apiVersion: v1
kind: Service
metadata:
  name: my-app
spec:
  selector:
    app: my-app
    version: green  # Switch to 'blue' or 'green' to cut over
  ports:
    - port: 80
      targetPort: 8080

Switch the version label in the Service selector to cut over traffic after validation.

Observability Meets Intelligence

Modern observability isn’t just dashboards and alerts; it’s about insight.

Use AI/ML-powered tools (like Dynatrace Davis AI or New Relic AIOps) to:

  • Automatically detect anomalies
  • Correlate logs, metrics, and traces
  • Surface the root cause of outages faster
  • Predict capacity and performance trends

Pair this with SLO-based monitoring to focus on customer impact, not just system health.

Internal Developer Marketplace

As your platform matures, create a catalog of reusable components:

  • App templates (API service, batch job, Kafka consumer)
  • Infrastructure (Postgres, Redis, DNS entries, event triggers)
  • Pipelines (CI/CD workflows, code scanning)

Tools like Backstage let you expose these as click-to-deploy services via a central portal. This creates consistency, speed, and governance, while still giving teams autonomy.

Federated Platform Model

Larger orgs might want to federate platform responsibilities. Instead of one central team:

  • Maintain a core platform team for tooling, standards, and infra
  • Embed platform “squads” within product domains
  • Use a shared governance model (security, policy, GitOps workflows)

This lets platform evolve organically across business units, with flexibility and cohesion.

Embracing Platform Engineering Principles

What sets apart truly great platforms?

  • Empathy for developers
  • Abstraction of complexity, not power
  • Product thinking: roadmaps, feedback loops, onboarding, documentation
  • Metrics-driven impact: measure DORA metrics, lead time, deployment frequency, MTTR

And, above all: treating the platform as an enabler, not a gatekeeper.

Your Platform Journey — Recap

Let’s look back on what you’ve built over these 8 parts:

  1. Foundation: Secure EKS base, GitOps-first
  2. Core Services: Ingress, observability, secrets, policy
  3. Dev Experience: Templates, namespaces, CI/CD
  4. Multi-Tenancy: Isolated teams and environments
  5. Day-2 Ops: Monitoring, upgrades, cost control
  6. Advanced: Safe delivery, AI, developer marketplaces

This is more than Kubernetes; it’s an internal platform that delivers software faster, safer, and with less friction.

What’s Next?

From here, you might explore:

  • Cross-cloud or hybrid deployments (AKS, GKE, on-prem)
  • Developer portals with real-time feedback loops
  • Integrating security scanning (SAST, DAST, supply chain)
  • Policy-driven self-service infrastructure (Crossplane, Terraform Cloud)

Or, open source your tooling. Share your journey. Inspire others.

Because platform engineering isn’t just about technology — it’s about making engineering better for everyone.

Leave a comment