At iSolutions, collecting software metrics is crucial for delivering adequate customer support and meeting our Service Level Agreements. We gather dozens of metrics from various systems and use Grafana to query and visualize them. Additionally, we maintain an in-house Alert system that utilizes some of these metrics as the basis for some alerts.
Historically, we relied on Windows Performance Counters to collect application-level and business metrics. While this tool initially served our system and database metrics needs, better choices were available for our expanding requirements. Over time, it became evident that we needed a more efficient and developer-friendly solution to handle our growing metrics infrastructure.
Why We Decided to Migrate
Our migration decision stemmed mainly from the fact that Windows Perf Counters are not flexible enough for business metrics. They are also not intended for high-frequency writing.
As a developer, I would add another reason: the cumbersome, error-prone, and time-consuming process of adding new metrics using Performance Counters. It involved many manual steps and was a real pain. Developers, like all humans, thrive in environments where tools and systems are friendly and efficient.
Enter OpenTelemetry (OT from now on), which seemed a natural choice for us.
The Migration
Consequently, our team was tasked with writing an OpenTelemetry collector and migrating performance counters in the code to types provided by .NET.
We faced several challenges during the migration. We needed to keep our existing Grafana pages and Alerts. This necessitated maintaining compatibility with the current implementation, preserving metric names, and retaining the semantics established by the Windows Performance Counters. On the storage side, metrics are written in an SQL database—a setup that, while not ideal by some standards, evolved naturally and functions quite effectively for us.
It required considerable planning and code reading. The migration took a few weeks and was executed smoothly without any service interruptions. Over the following months, we continued refining the system, primarily adding new features and automation.
The Aftermath
Before the migration, adding a new metric involved seven steps, two pull requests, and releasing the application AND the monitoring services. This arduous process is now a thing of the past. Developers can declare metrics directly in their code, making them available after release, with everything else managed automatically.
The best part for me is that other teams have adopted OpenTelemetry in their applications. Today, all our systems rely on the OT standard. We also export some metrics to external providers like Dynatrace, which supports OpenTelemetry.
Conclusion
More than a year has passed since our transition to OpenTelemetry, and it has become an integral part of our platform. This integration has proven to be an excellent choice, resulting in more metrics, better system integration, and a significantly improved developer experience.
Following the success of our metrics migration, we also transitioned our logging to the OT standard, yielding similar positive results. This migration has brought us ‘economies of scale,’ an often undervalued benefit in our industry. We implemented the change once and used it universally, making metrics and logs collection a ‘non-issue’, allowing our developers to focus on other critical areas.
This migration’s benefits underscore the importance of investing in efficient and developer-friendly tools.