Principles for running a stable development platform

Wondering how to structure a team that manages your complex and ever-evolving platform as well as enables the development teams?

Modern platforms are essential for organizations to fast deliver services and drive innovation. Building and managing these platforms requires specialized teams with the right skills and approach. In this blog post, we present five well-proven principles that form the foundation for successful platform teams.

We view the platform as a product, even though it serves internal users within the organization. In this way, we establish a quality mindset and ownership that guides discussions about functionality in a product context. We see the platform team as part of the Developer Experience concept – the team should make it easier for developers to do their job effectively by reducing complexity and thus simplifying creation of valuable services.



Some important aspects are being able to show roadmaps, sync backlogs and plan ahead with other teams as well as identify dependencies. The team also establishes and provides efficient structural capital for the product, which includes documentation, guides and training so that the intended benefits are spread throughout the organization.
Specifically, we highlight five well-proven principles that a modern platform team benefits from below.

  • Infrastructure-as-code: for reproducibility, version control, disaster recovery, audit of who did what, when it was done, and reusability. It makes it easy to keep environments homogeneous (dev, stage, prod).
  • GitOps: related to the above, but also spanning configuration, and as a mechanism for self-service towards technical stakeholders. Automation is built based on config that is checked into a version control system. In many cases, a flow can also be built where the platform team reviews and approves pull requests, which different stakeholders create for different orders.
  • Proactive monitoring/altering: proactively monitor the services that are delivered, i.e. establish alarm rules that catch problems before users report them. Continuously iterate on alarm levels and monitoring points to ensure acceptable levels of false positives and false negatives. Specifically, load/capacity monitoring can be built to detect problems before they have an impact.
  • Comprehensive monitoring data: log and collect metrics for all external dependencies and calls to and from the platform. If a problem is reported, it should be easy to determine whether it is within the platform or in one of its dependencies. Problems within the platform should then be able to be isolated, reproduced and fixed quickly. If the problem is outside the platform, the right focus and resources can be allocated quickly.
  • Sandbox environment: the team should have a sandbox/lab environment in a known and well-defined state that can be used to test new versions of components, or alternatively reproduce errors. User test environment clusters are in reality a “production environment” for the platform team.

In the next blog post, we will go deeper and highlight different components that are often encountered when working with different platforms and how we like to use them.

Irori has experience working according to the above principles in organizations within government, retail, banking/finance and manufacturing. The work is focused on business-critical platforms where delivery processes, stability, security, reliability and scalability are important requirements.

Authors: 


Jonathan Kyrklund
Platform Engineer

 

Björn Löfroth
Platform Engineer