Building a Cloud-Native Container Platform from Scratch - Part 8

5 minute read Platform Engineering

You’ve built a self-service container platform from the ground up. Complete with GitOps workflows, security, developer experience tooling, and operational maturity. But platforms are never “done”.

The most effective platform teams don’t just build infrastructure; they build leverage.

In this final part, we’ll explore how to evolve your platform with advanced capabilities that go beyond stability, to empower innovation, accelerate delivery, and future-proof your ecosystem.

Full series

Part 1: Why Build a Self-Service Container Platform
Part 2: Choosing Your Platform’s Building Blocks
Part 3: Bootstrapping Your Infrastructure with Terraform
Part 4: Installing Core Platform Services
Part 5: Crafting the Developer Experience Layer
Part 6: Scaling the Platform — Multi-Tenancy, Environments, and Governance
Part 7: Day-2 Operations and Platform Maturity
Part 8: The Future of Your Internal Platform (you are here)

Progressive Delivery: Safer Deployments at Scale

Not all deploys should be “all at once”. Progressive delivery introduces smarter release strategies like:

Canary releases: route 5%, then 25%, then 100% of traffic
Blue/green deployments: switch over to a new version after validation
Feature flags: decouple deploy from release entirely

Tools like Flagger, Argo Rollouts, or LaunchDarkly can integrate with your existing GitOps workflows to enable these strategies.

You can build them into your platform templates, giving devs safe, repeatable release workflows.

Example: Implementing Canary Deployments with Argo Rollouts

This YAML manifest defines an Argo Rollouts resource for implementing a canary deployment strategy. The Rollout specifies four replicas of the application and uses a canary strategy to gradually shift traffic to the new version. The steps section outlines the rollout process: initially, 20% of traffic is directed to the new version, followed by a 5-minute pause for monitoring. Then, 50% of traffic is shifted with a 10-minute pause, and finally, 100% of traffic is routed to the new version. This staged approach helps reduce risk by allowing issues to be detected early before the rollout is completed. The selector and template fields define how pods are matched and deployed, similar to a standard Kubernetes Deployment.

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: my-app
spec:
  replicas: 4
  strategy:
    canary:
      steps:
        - setWeight: 20
        - pause: {duration: 5m}
        - setWeight: 50
        - pause: {duration: 10m}
        - setWeight: 100
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
        - name: my-app
          image: my-app:v2

You can automate monitoring and rollback as part of progressive delivery. Tools like Argo Rollouts and Flagger integrate with observability platforms (Prometheus, Datadog, etc.) to watch key metrics (error rate, latency, SLOs) during each rollout step.

If a metric breaches a threshold, the rollout can be paused or automatically rolled back to the previous stable version—without manual intervention.

Example: Automated Analysis and Rollback with Argo Rollouts

strategy:
  canary:
    analysis:
      startingStep: 2 # delay starting analysis run until setWeight: 20%
      templates:
        - templateName: success-rate
      args:
        - name: service-name
          value: my-app
    steps:
      - setWeight: 20
      - pause: {duration: 5m}
      - setWeight: 50
      - pause: {duration: 10m}
      - setWeight: 100

Here, the analysis step references a metric template (e.g. success rate from Prometheus). If the analysis fails, Argo Rollouts will halt or revert the deployment automatically. This approach ensures safer, hands-off releases and faster recovery from issues.

Below is a sample of the success-rate AnalysisTemplate that checks the success rate of your application using Prometheus metrics.

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  args:
    - name: service-name
  metrics:
    - name: success-rate
      interval: 1m
      count: 3
      successCondition: result >= 0.99
      provider:
        prometheus:
          address: http://prometheus.monitoring.svc.cluster.local:9090
          query: |
            sum(rate(http_requests_total{job="",status=~"2.."}[5m]))
            /
            sum(rate(http_requests_total{job=""}[5m]))

This template queries Prometheus for the ratio of successful HTTP requests (status code 2xx) to total requests over a 5-minute window. If the success rate drops below 99%, the rollout will be paused or rolled back automatically.

Example: Blue/Green Deployment with Kubernetes Services

This YAML manifest defines a Kubernetes Deployment for a “blue/green” deployment strategy. In this approach, a new version of the application (labeled as “green”) is deployed alongside the existing version (“blue”). The deployment creates four replicas of the new version, each identified by the labels app: my-app and version: green. By running both versions simultaneously, you can safely test the new release in production and switch traffic over when ready, minimising downtime and risk during updates.

# Deploy the new version alongside the old
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app-green
spec:
  replicas: 4
  selector:
    matchLabels:
      app: my-app
      version: green
  template:
    metadata:
      labels:
        app: my-app
        version: green
    spec:
      containers:
        - name: my-app
          image: my-app:v2

# Service points to either blue or green deployment
apiVersion: v1
kind: Service
metadata:
  name: my-app
spec:
  selector:
    app: my-app
    version: green  # Switch to 'blue' or 'green' to cut over
  ports:
    - port: 80
      targetPort: 8080

Switch the version label in the Service selector to cut over traffic after validation.

Observability Meets Intelligence

Modern observability isn’t just dashboards and alerts; it’s about insight.

Use AI/ML-powered tools (like Dynatrace Davis AI or New Relic AIOps) to:

Automatically detect anomalies
Correlate logs, metrics, and traces
Surface the root cause of outages faster
Predict capacity and performance trends

Pair this with SLO-based monitoring to focus on customer impact, not just system health.

Internal Developer Marketplace

As your platform matures, create a catalog of reusable components:

App templates (API service, batch job, Kafka consumer)
Infrastructure (Postgres, Redis, DNS entries, event triggers)
Pipelines (CI/CD workflows, code scanning)

Tools like Backstage let you expose these as click-to-deploy services via a central portal. This creates consistency, speed, and governance, while still giving teams autonomy.

Federated Platform Model

Larger orgs might want to federate platform responsibilities. Instead of one central team:

Maintain a core platform team for tooling, standards, and infra
Embed platform “squads” within product domains
Use a shared governance model (security, policy, GitOps workflows)

This lets platform evolve organically across business units, with flexibility and cohesion.

Embracing Platform Engineering Principles

What sets apart truly great platforms?

Empathy for developers
Abstraction of complexity, not power
Product thinking: roadmaps, feedback loops, onboarding, documentation
Metrics-driven impact: measure DORA metrics, lead time, deployment frequency, MTTR

And, above all: treating the platform as an enabler, not a gatekeeper.

Your Platform Journey — Recap

Let’s look back on what you’ve built over these 8 parts:

Foundation: Secure EKS base, GitOps-first
Core Services: Ingress, observability, secrets, policy
Dev Experience: Templates, namespaces, CI/CD
Multi-Tenancy: Isolated teams and environments
Day-2 Ops: Monitoring, upgrades, cost control
Advanced: Safe delivery, AI, developer marketplaces

This is more than Kubernetes; it’s an internal platform that delivers software faster, safer, and with less friction.

What’s Next?

From here, you might explore:

Cross-cloud or hybrid deployments (AKS, GKE, on-prem)
Developer portals with real-time feedback loops
Integrating security scanning (SAST, DAST, supply chain)
Policy-driven self-service infrastructure (Crossplane, Terraform Cloud)

Or, open source your tooling. Share your journey. Inspire others.

Because platform engineering isn’t just about technology — it’s about making engineering better for everyone.

Share on

X Facebook LinkedIn Bluesky

Glen Thomas

Building a Cloud-Native Container Platform from Scratch - Part 8

Progressive Delivery: Safer Deployments at Scale

Example: Implementing Canary Deployments with Argo Rollouts

Example: Blue/Green Deployment with Kubernetes Services

Observability Meets Intelligence

Internal Developer Marketplace

Federated Platform Model

Embracing Platform Engineering Principles

Your Platform Journey — Recap

What’s Next?

Share on

Leave a comment

You may also enjoy

Running Effective 1:1s: A Tactical Guide for Engineering Leaders

How to Make Architectural Decisions (and Stick to Them)

Designing a Scalable DNS Schema for Large Distributed Systems

Platform as a Product – A Guide for Platform Product Owners