This post is the last in our series about our Istio Service Mesh journey. In the first two parts we talked a little about a few reasons why we wanted to try it out and how we did the basic install. This part is about how it turned out and what we think about it.
If you have read the first two parts you know that we only touched on the surface of what Istio can do. This limited functionality is primarily what we have evaluated but we will also try to reason a bit about some other interesting topics.
Using Istio for exposing our APIs
Previously, we have always exposed all our APIs under a single hostname (per environment) and used path based separation between the APIs. The default way of exposing services in OpenShift is one host per service or API. This is, of course, configurable. . Same goes for Istio. We can do it both ways, but we found it easier to combine all under one hostname. It would be very easy to do a combo and perhaps use one API hostname for each business domain, but expose each API on its own path on that hostname.
Regardless of preferred method, this part is pretty easy. Especially if using only one hostname as we did. This means the only resource needed to expose an OpenShift Service is the VirtualService object. We kept this simple in the evaluation and were very happy with that. But we could make things a bit more interesting with traffic shifting when deploying new versions for example. More on that later on.
Managing outbound traffic
In our old solution, it was always a little work to add new things to the whitelist to allow external traffic. A few times we also ran into problems with the proxy solution. It happened more than once that a framework used in the application utilized an HTTP client library with poor proxy support, or that no support at all existed. Since starting to use Istio, no such problems have arisen. It is more or less effortless to just add a ServiceEntry if it doesn’t already exist.
It is not entirely foolproof though. Unless using an egress gateway and only allowing that to communicate with the outside world, there are workarounds. A namespace not enrolled in the Mesh could potentially be wide open. An Istio-enabled namespace is by default only allowed to communicate via the Mesh, but to simplify communication with mesh-external workloads within the same cluster (if such exist) it is possible to allow for any namespace egress traffic using NetworkPolicy objects. In that case, a workload that has simply disabled the Istio-sidecar could be at risk.
We have had great use of this functionality and do not think it would have been as easy without Istio. As long as you know what you are doing and are restrictive regarding shortcuts and too generous firewall rules then this is very useful.
Observability in the Service Mesh
The first thing that we saw regarding observability was the Grafana dashboards. Even though we were used to Grafana from before, we mostly looked at metrics for single components when it came to networking. We used different frameworks that used different metric types and names, so usually each application had its own dashboard with nice-to-have metrics. We could see external traffic on the API endpoint. But no overview.
In Istio we get a Mesh view with great insights on the mesh traffic in total, as well as workload/service oriented views. This has been very useful.
The second thing that hit us was tracing. This is not only attributed to Istio, but also to Jaeger, but anyway. We knew from before that we have had some timeout problems with the first application that was onboarded. It was a REST-based service that in turn had integrations with a database and another REST-based service, external to our environment. In the old environment when looking at the logs we always concluded that the timeouts were caused by the external service. It was not a big problem and we had not prioritized troubleshooting it. Within minutes of deploying it with Istio enabled, we could clearly see an interesting pattern in Jaeger. The upstream service was in fact slow many times, but not as many as the application timeouts.
Tracing in Istio becomes rather blunt without instrumenting the applications. You can see the big picture, but not the details. This is caused by two things: First, we are not aware of anything happening within the workload itself. Secondly and perhaps most important, without instrumenting or at least making sure the traceId of each payload is passed forward, there is no way for Istio to know how to connect inbound and outbound spans from the workload. So if a call to an application results in an upstream call to something else, Istio would not be able to know that they are related even though all calls go via the istio-proxy.
Since we are running Java applications, we decided to give OpenTelemetry auto instrumentation a try. Installing the OpenTelemetry Operator in our cluster, https://github.com/open-telemetry/opentelemetry-operator, and after configuring the collector to point to our Istio Jaeger service, we just needed to annotate our application pod to enable the OpenTelemetry Java Agent. No change needed in the application at all. This enables the application to append to the traces already created by Istio and the Jaeger UI showed us all the details we needed to pinpoint the problem.
The final tool is Kiali. Although it can be used to look at the Mesh configuration(and even edit), we primarily used it to look at the graphical Mesh representation. It is very informative. We get a live map of the components that communicate with each other within a chosen timespan. Each communication link can provide details on success rate, rps etc and changes color if there is a problem. It is very useful both to understand the communication pattern, but we also found it very useful to easily get an overview on which versions of each application that were running. Given that the pods are labeled correctly, this is clearly visible.
Conclusions on Service Mesh
As you may have noticed, most things we have said about Istio are positive. Are there some downsides? Of course. First of all, we are talking about a lot of magic happening. Sometimes it feels safer to choose that old boring way of doing things that you know by heart. If something were to cause a major problem with the Mesh, do you know how to fix it or even where to start? This brings us to the other point on the downside. There is a lot to learn if you want to understand all these things. We also think that you should probably get acquainted with the technology for some time before going for full scale production. An evaluation like ours is highly recommended to learn how to work with it.
On the plus side. When you have grasped the basics and the functionality you are after, it is very simple and easy to work with. The documentation is usually very good even though some examples feel related to an older release from time to time. This is especially true for the Maistra documentation for OpenShift Service Mesh.
We will absolutely continue to work with Istio. For simpleruse cases it might be a bit overkill, but the observability in itself brings a lot to the table.
What about the rest of the featureset? We have started investigating some of them and here are our thoughts.
We have usually deployed new versions in a rolling fashion. In the OpenShift environment this would basically be a new image version being rolled out through a Deployment. If something goes wrong, the deployment pauses and we need to fix it. At some time, we would like to try blue/green deployments where we could get the new version up and running and gradually transfer traffic. Istio gives great flexibility regarding this which could also be used to achieve a/b-testing. Since we are running GitOps with ArgoCD, we thought that we may include Argo Rollouts in these scenarios going forward.
Circuit breaking and timeouts
Depending on frameworks used, this could be simple or hard to implement in the applications. Even if it is simple, implementations would still be needed in multiple places. We like the thought of offloading network related things to the Service Mesh instead of doing it over and over in the applications. Istio provides us with this capability and we are thinking about trying it out.
We have traditionally used OpenID authentication validated in each backend application and TLS termination on the edge. Lately Let’s Encrypt and others have made the certificate space much easier for public endpoints, but it is still not easy internally. Istio makes this easy and implements much of it by default. We have not yet tried it, but it should be easy to enable mTLS between all workloads. That is an interesting thought. Also, instead of implementing all authorization within the applications, much of it seems possible to do in Istio. We are not sure this is the way to go or if it makes anything easier, but we want to look into it. There are a lot of interesting features around validating and handling JWTs.
This list could go on for quite some time, but let’s wrap it up for now. If you haven’t yet, I recommend taking a look at https://istio.io/