Building a Cloud-Native Container Platform from Scratch - Part 8
You’ve built a self-service container platform from the ground up. Complete with GitOps workflows, security, developer experience tooling, and operational maturity. But platforms are never “done”.
The most effective platform teams don’t just build infrastructure; they build leverage.
In this final part, we’ll explore how to evolve your platform with advanced capabilities that go beyond stability, to empower innovation, accelerate delivery, and future-proof your ecosystem.
Full series
- Part 1: Why Build a Self-Service Container Platform
- Part 2: Choosing Your Platform’s Building Blocks
- Part 3: Bootstrapping Your Infrastructure with Terraform
- Part 4: Installing Core Platform Services
- Part 5: Crafting the Developer Experience Layer
- Part 6: Scaling the Platform — Multi-Tenancy, Environments, and Governance
- Part 7: Day-2 Operations and Platform Maturity
- Part 8: The Future of Your Internal Platform (you are here)
Progressive Delivery: Safer Deployments at Scale
Not all deploys should be “all at once”. Progressive delivery introduces smarter release strategies like:
- Canary releases: route 5%, then 25%, then 100% of traffic
- Blue/green deployments: switch over to a new version after validation
- Feature flags: decouple deploy from release entirely
Tools like Flagger, Argo Rollouts, or LaunchDarkly can integrate with your existing GitOps workflows to enable these strategies.
You can build them into your platform templates, giving devs safe, repeatable release workflows.
Example: Implementing Canary Deployments with Argo Rollouts
This YAML manifest defines an Argo Rollouts resource for implementing a canary deployment strategy. The Rollout specifies four replicas of the application and uses a canary strategy to gradually shift traffic to the new version. The steps section outlines the rollout process: initially, 20% of traffic is directed to the new version, followed by a 5-minute pause for monitoring. Then, 50% of traffic is shifted with a 10-minute pause, and finally, 100% of traffic is routed to the new version. This staged approach helps reduce risk by allowing issues to be detected early before the rollout is completed. The selector and template fields define how pods are matched and deployed, similar to a standard Kubernetes Deployment.
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: my-app
spec:
replicas: 4
strategy:
canary:
steps:
- setWeight: 20
- pause: {duration: 5m}
- setWeight: 50
- pause: {duration: 10m}
- setWeight: 100
selector:
matchLabels:
app: my-app
template:
metadata:
labels:
app: my-app
spec:
containers:
- name: my-app
image: my-app:v2
You can automate monitoring and rollback as part of progressive delivery. Tools like Argo Rollouts and Flagger integrate with observability platforms (Prometheus, Datadog, etc.) to watch key metrics (error rate, latency, SLOs) during each rollout step.
If a metric breaches a threshold, the rollout can be paused or automatically rolled back to the previous stable version—without manual intervention.
Example: Automated Analysis and Rollback with Argo Rollouts
strategy:
canary:
analysis:
startingStep: 2 # delay starting analysis run until setWeight: 20%
templates:
- templateName: success-rate
args:
- name: service-name
value: my-app
steps:
- setWeight: 20
- pause: {duration: 5m}
- setWeight: 50
- pause: {duration: 10m}
- setWeight: 100
Here, the analysis
step references a metric template (e.g. success rate from Prometheus). If the analysis fails, Argo Rollouts will halt or revert the deployment automatically. This approach ensures safer, hands-off releases and faster recovery from issues.
Below is a sample of the success-rate AnalysisTemplate that checks the success rate of your application using Prometheus metrics.
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: success-rate
spec:
args:
- name: service-name
metrics:
- name: success-rate
interval: 1m
count: 3
successCondition: result >= 0.99
provider:
prometheus:
address: http://prometheus.monitoring.svc.cluster.local:9090
query: |
sum(rate(http_requests_total{job="",status=~"2.."}[5m]))
/
sum(rate(http_requests_total{job=""}[5m]))
This template queries Prometheus for the ratio of successful HTTP requests (status code 2xx) to total requests over a 5-minute window. If the success rate drops below 99%, the rollout will be paused or rolled back automatically.
Example: Blue/Green Deployment with Kubernetes Services
This YAML manifest defines a Kubernetes Deployment for a “blue/green” deployment strategy. In this approach, a new version of the application (labeled as “green”) is deployed alongside the existing version (“blue”). The deployment creates four replicas of the new version, each identified by the labels app: my-app and version: green. By running both versions simultaneously, you can safely test the new release in production and switch traffic over when ready, minimising downtime and risk during updates.
# Deploy the new version alongside the old
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app-green
spec:
replicas: 4
selector:
matchLabels:
app: my-app
version: green
template:
metadata:
labels:
app: my-app
version: green
spec:
containers:
- name: my-app
image: my-app:v2
# Service points to either blue or green deployment
apiVersion: v1
kind: Service
metadata:
name: my-app
spec:
selector:
app: my-app
version: green # Switch to 'blue' or 'green' to cut over
ports:
- port: 80
targetPort: 8080
Switch the version
label in the Service selector to cut over traffic after validation.
Observability Meets Intelligence
Modern observability isn’t just dashboards and alerts; it’s about insight.
Use AI/ML-powered tools (like Dynatrace Davis AI or New Relic AIOps) to:
- Automatically detect anomalies
- Correlate logs, metrics, and traces
- Surface the root cause of outages faster
- Predict capacity and performance trends
Pair this with SLO-based monitoring to focus on customer impact, not just system health.
Internal Developer Marketplace
As your platform matures, create a catalog of reusable components:
- App templates (API service, batch job, Kafka consumer)
- Infrastructure (Postgres, Redis, DNS entries, event triggers)
- Pipelines (CI/CD workflows, code scanning)
Tools like Backstage let you expose these as click-to-deploy services via a central portal. This creates consistency, speed, and governance, while still giving teams autonomy.
Federated Platform Model
Larger orgs might want to federate platform responsibilities. Instead of one central team:
- Maintain a core platform team for tooling, standards, and infra
- Embed platform “squads” within product domains
- Use a shared governance model (security, policy, GitOps workflows)
This lets platform evolve organically across business units, with flexibility and cohesion.
Embracing Platform Engineering Principles
What sets apart truly great platforms?
- Empathy for developers
- Abstraction of complexity, not power
- Product thinking: roadmaps, feedback loops, onboarding, documentation
- Metrics-driven impact: measure DORA metrics, lead time, deployment frequency, MTTR
And, above all: treating the platform as an enabler, not a gatekeeper.
Your Platform Journey — Recap
Let’s look back on what you’ve built over these 8 parts:
- Foundation: Secure EKS base, GitOps-first
- Core Services: Ingress, observability, secrets, policy
- Dev Experience: Templates, namespaces, CI/CD
- Multi-Tenancy: Isolated teams and environments
- Day-2 Ops: Monitoring, upgrades, cost control
- Advanced: Safe delivery, AI, developer marketplaces
This is more than Kubernetes; it’s an internal platform that delivers software faster, safer, and with less friction.
What’s Next?
From here, you might explore:
- Cross-cloud or hybrid deployments (AKS, GKE, on-prem)
- Developer portals with real-time feedback loops
- Integrating security scanning (SAST, DAST, supply chain)
- Policy-driven self-service infrastructure (Crossplane, Terraform Cloud)
Or, open source your tooling. Share your journey. Inspire others.
Because platform engineering isn’t just about technology — it’s about making engineering better for everyone.
Leave a comment