Technical Insights from Platform Teams

How teams do things on an operational level is of course the base in the daily work routine.

Our previous post explored five key principles that contribute to a successful platform team. We discussed viewing the platform as its own product, fostering a quality-oriented mindset and ownership. Additionally, we emphasized prioritizing developer experience by reducing complexity during service creation and fostering collaboration and planning through shared roadmaps, synced backlogs, and clear dependency identification. Finally, we highlighted the importance of providing “structural capital” like documentation, guides, and training to empower users and maximize the platform’s benefits.

Building on these essential principles, this follow-up post will delve into the technical components commonly used in platform development.

Foundation

Before looking into the specific components, we would like to highlight two important things that are always important.

  • A deep technical understanding of the platform components, including their use within the organization and how different operational processes depend on them.
  • Ensuring the team’s responsibilities are clear and function well for both routine work and incidents. The stakeholder image below provides an example of the division of responsibility between teams. While this can be adapted based on operations, environment, and needs, clearly defined roles and interfaces give teams more freedom to maneuver within their area of responsibility.

Stakeholders

Developers Ops / Infra CI / CD Networking / connectivity Security
  • Containerization
  • Configuration
  • Deployment
  • Secret management
  • Alerting
  • Metrics
  • Log aggregation
  • Health checks
  • Self-service Deployments
  • Service ingress/egress
  • Dev environment
  • Component life-cycles
  • Capacity planning
  • Provisioning
  • Infra alarms/metrics
  • Central services
  • Log shipping
  • Availability
  • VM patching
  • Containerization
  • Deployments
  • Deployment templates
  • Rollback
  • Container image registries
  • Image scanning
  • Approval/promotion process/RBAC
  • Automated in-platform testing
  • Ingress controllers
  • Load balancers
  • Egress
  • DNS
  • Route tables
  • Firewalls
  • Service mesh
  • Certificates
  • Sandboxing
  • Pod security policies
  • Network security policies
  • RBAC
  • Secret management
  • Availability
  • VM patching
  • Image lifecycles
  • Image scanning
  • Image signing
  • Admission controllers

Example of the division of responsibility between teams

Components & Considerations

Here are insights based on our experiences with common platform components. There is of course room to discuss each component in almost eternity, but we make a more simple approach here.

Kubernetes: (OpenShift, AKS, GKE, EKS)

  • Consider Blue/Green clusters for critical upgrades.
  • Keep applications stateless if possible, if not establish clear processes for backup/restore and infrastructure migration.
  • Cloud native app citizens: use container best practices, health checks, rolling updates, test applications against rolling cluster upgrades.

ArgoCD

  • ArgoCD GitOps works well for mature deployments and in “sunny day” scenarios, but you should know when you can and should turn off the sync functionality, e.g. in a lab phase for a new application, or when troubleshooting in production where there are complex flows around operators and CRDs.
  • Use Argo notifications to provide feedback on deployment status.

Tekton

  • Use Pipelines-As-Code for easier management.
  • Establish a library of common tasks, version controlled.
  • Sandwich testing: pipelines test code/test repository, expected output from previously run code/tests can in turn regression test the pipeline library.

API Management

  • Clear access logs with correlation ID.
  • A clear division of what is the user’s responsibility and the platform team’s responsibility.
  • Consider using GitOps self-service configuration in cases where the portal does not meet the requirements/organizational process.
  • Train developers in API best practices and conventions.

MessageBroker (Kafka/ RabbitMQ / AMQ)

  • GitOps self-service of users and queues for good automation and traceability and short lead times. Monitor capacity carefully, including alarms for expected publishing/consumption volumes.

KeyCloak

  • Have a clear strategy for which authentication strategies to use and how Clients are set up as the application provides a lot of configuration options.
  • Maintain a frequent update frequency, as it is a security component, especially if it is public-facing.
  • Implement a solid process for how configuration changes propagate between test and production environments as a lot can tend to be done manually via the admin GUI.
  • Monitor metrics to find trends/deviations in e.g. authentications.

Prometheus/Grafana

  • Focus on basic metrics to avoid “information overload”. Rate-Errors-Duration (RED) for services, Usage-Saturation-Errors (USE) for finite resources.
  • Offer dashboard templates for common application types within the organization (Spring Boot, etc.).

Splunk

  • Set up alarms for maximum expected log volume per service/component, it is often a bad signal if something starts logging unexpectedly much, and can fill disks/quotas quickly.
  • Create ready-made reports and dashboards to quickly see how the systems and flows are performing.

This list could go on for quite some time, but let’s wrap it up for now. To put all this in context here is a short summary of three establishments the team at Irori have done in the field of platform-engineering.

  • Establishment of Kubernetes for a new generation of event-driven middleware at ICA Sverige. The team first performed an application layer migration from Oracle SOA Suite to Strimzi Kafka on Kubernetes in an Azure environment. The same team is responsible for application migration and platform deployment of AKS on Azure with self-service support for a dozen development teams in the customer’s organization. The initiative’s initial goals of increasing automation, lowering cost of ownership and accelerating business deliveries were achieved within a year. Buzzwords: ICC, Azure, AKS, Kafka, Kafka Connect, WSO2, Camel, Quarkus
  • Establishment of Kubernetes as a new platform for online sports and casino games. A pilot application based on a game form was developed to verify the underlying infrastructure. The initiative was then scaled up gradually to accommodate increased load and more gaming services. Initially, the solution was built on bare metal in a data center, with further scaling to more sites and construction of self-service, disaster recovery, observability, capacity planning, enhanced security, etc. Irori delivers a platform team for OpenShift deployment.
  • Establishment of an event-driven integration platform on OpenShift within a larger government agency. The team is responsible for the design and implementation of middleware as a service built on OpenShift. The task includes offering self-service to many development teams and ensuring effective management over time. 
Buzzwords: Private Cloud, Kafka, gitops, OpenShift, Spring-Boot, Helm

Irori has experience working according to the above principles in organizations within government, retail, banking/finance and manufacturing. The work is focused on business-critical platforms where delivery processes, stability, security, reliability and scalability are important requirements.

Author:
Saulos
Platform Engineer