Somehow I had to deal with the metrics for our API, as always (no time ?!) to add later - it is very difficult and have not yet been implemented - it means it's time to implement it. After some wanderings on the net, the most popular monitoring system, it seemed to me, was Prometheus.
Using Prometheus, we can track various computer resources, such as: memory, processor, disk, network load. It may also be important for us to calculate the number of calls to the methods of our API or to measure their execution time, because the greater the load on the system, the more expensive its downtime. And here Prometheus comes to our aid. This article provides, it seems to me, the main points for understanding the work of Prometheus and for adding a collection of metrics to the API. Therefore, we start with the most banal, with a small description.
Prometheus is an open source system and Time Series DBMS written in Go and developed by SoundCloud. It has official documentation and support for languages โโsuch as: Go, Java or Scala, Python, Ruby. There is unofficial support for other languages, such as: C #, C ++, C, Bash, Lua for Nginx, Lua for Tarantool and others, the whole list is on the official Prometheus website.
All Prometheus services are available as Docker images on the Docker Hub or Quay.io.
Prometheus is launched by the docker run -p 9090:9090 prom/prometheus
, which starts it with the default configuration and sets port localhost:9090
for it. After that, the Prometheus UI will be available at localhost:9090
.
Prometheus is a monitoring system that includes various tools for configuring monitoring applications (endpoints) using the HTTP protocol. When connecting to Prometheus, the HTTP API does not support "basic auth". If you want to use basic authentication to connect to Prometheus, it is recommended that you use Prometheus in combination with a reverse proxy server and use authentication at the proxy level. You can use any reverse proxy with Prometheus.
The main components of Prometheus:
- a server that collects metrics, saves them to the database and cleans up;
- packages for collecting metrics in the API;
- Pushgateway - a component for receiving metrics from applications for which a Pull request cannot be used;
- Exporters - tools for exporting metrics from third-party applications and services, installed on target machines;
- AlertManager - manager of notifications (alerts), alerts are defined in the configuration file and set by a set of rules for metrics.
If during operation there is compliance with the rule, an alert is triggered and sent to the specified recipients via Email, Slack or others.
The objects that Prometheus works with are called metrics received from targets either through Pushgateway or through Exporters.
When collecting metrics, several methods of their transmission are used:
- Prometheus requests metrics from the target through a Pull request, the settings of which are specified in the configuration file in the scrape_config section for each job.
When the system collects data, you can control the frequency of collection and create several configurations of data collection to select a different frequency for different objects; - Exporters will allow you to collect metrics from various objects, for example: databases (MongoDB, SQL, etc.), message brokers (RabbitMQ, EMQ, NSQ, etc.), HTTP load balancers, etc .;
- Pushgateway. It can be used if necessary, when the application cannot provide the ability to directly give metrics to Prometheus; or when batch jobs are used that do not have the ability to use the Prometheus pull request.
Thus, all received metrics will be saved by Prometheus in the database with time stamps.
Configuration
Prometheus is configured using the command line flags and configuration files provided in the YAML format. Command line flags allow you to configure immutable parameters, such as: paths, data volumes stored on disk and in memory, etc. The configuration file allows you to configure everything related to jobs and setting up loaded rule yaml files. Everything is written in the global configuration file, it allows you to set general settings for everyone and highlight settings for different configuration sections separately. The settings that Prometheus polls are configured in the configuration file in the scrape_configs section.
Prometheus can reload configuration files during operation, if the new configuration is invalid, then it will not be applied. Rebooting the configuration file is triggered by sending the SIGHUP Prometheus command or sending an HTTP POST request to /-/reload
, provided that the --web.enable-lifecycle
flag is --web.enable-lifecycle
. This will also reload all configured rule files.
What types of data are used
Prometheus stores a custom multidimensional data model and uses a query language for multidimensional data called PromQL. Prometheus stores data in the form of time series; it supports several storage options:
- local disk storage: every 2 hours, data that has been buffered in memory is compressed and stored on disk. By default, the working directory uses the ./data directory to save compressed files;
- Remote repository: Prometheus supports integration with third-party repositories (for example: Kafka, PostgreSQL, Amazon S3, etc.) through the Protocol Buffer adapter.
The stored time series is determined by the metric and metadata in the form of key-value pairs, although, if necessary, the name of the metric may not be used and the metric itself will consist only of metadata. A time series can be formally defined as <metric name> {<metadata>}. The key is <metric name> {<metadata>} - what we are measuring, and the value is the actual value as a number with type float64 (Prometheus only supports this type). The key description contains metadata (labels), also described by key-value pairs: <label name> = "<label value>", <label name> = "<label value>", ...
When storing metrics, the following data types are used:
- Counter - counts the amount over a period of time. This type of metrics can only increase (you cannot use negative values) or zero the value.
It may be suitable, for example, to calculate the number of requests per minute or the number of errors per day, the number of sent / received network packets, etc. - Gauge - stores values โโthat may decrease or increase over time.
Gauge does not show the development of metrics over a period of time. Using Gauge, you can lose irregular metric changes over time. - Histogram - saves several time series: the total sum of all observed values; the number of events that were observed;
accumulative counters (buckets) - are indicated in the label as le="<upper inclusive bound>"
.
Values โโare collected in areas with custom upper bounds (buckets). - Summary - saves several time series: the total sum of all observed values; the number of events that were observed;
flow ฯ-quantiles (0 โค ฯ โค 1) of observed events - are indicated in the label as quantile="<ฯ>"
.
How is data saved?
Prometheus recommends "give" 2/3 of RAM for a running application.
To store data in memory, Prometheus uses files called chunk; each metric has its own file. All chunk files are immutable, except for the last one into which data is written. New data is saved in chunk and every 2 hours the background stream combines the data and writes it to disk. Each two-hour block consists of a directory containing one or more chunk files that contain all time series samples for that period of time, as well as a metadata file and an index file (which indexes the names of the metrics and labels for time series in the chunk files). If within one hour Prometheus does not write data to chunck, it will be saved to disk and a new chunck will be created to write data. The maximum data retention period in Prometheus is ~ 21 days.
Because the memory size is fixed, system write and read performance will be limited by this memory size. The amount of PTSDB memory is determined by the minimum time period, the collection period, and the number of time metrics.
Prometheus also has a WAL mechanism to prevent data loss.
Write ahead log (WAL) serializes memorized operations on a permanent medium in the form of log files. In the event of a failure, WAL files can be used to restore the database to its consistent state by restoring from the logs.
Log files are stored in a wal directory in 128 MB segments. These files contain raw data that has not yet been compressed, so they are significantly larger than regular fragment files.
Prometheus will store at least 3 log files, but servers with high traffic can see more than three WAL files, since it needs to store at least two hours of raw data.
The result of using WAL is a significant reduction in the number of write requests to disk, as only a log file needs writing to the disk, and not every piece of data changed as a result of the operation. The log file is written sequentially and thus the cost of synchronizing the log is much less than the cost of writing fragments with data.
Prometheus saves periodic breakpoints, which by default are added every 2 hours by compressing logs for the past period and saving them to disk.
All breakpoints are stored in the same directory as checkpoint.ddd, where ddd is a monotonically increasing number. Therefore, when recovering from a failure, it can restore breakpoints from the breakpoint catalog with an indication of the order (.ddd).
By recording WAL logs, you can return to any checkpoint for which the data log is available.
What happened in practice?
When adding to the project (.Net Framework), we used the Prometheus.Client.3.0.2 package to collect metrics. To collect metrics, the necessary methods and classes were added to the project to store metrics until they are received by Prometheus.
The IMetricsService interface was originally defined, containing timer methods for measuring how long methods work:
public interface IMetricsService { Stopwatch StartTimer(); void StopTimer(Stopwatch timer, string controllerName, string actionName, string methodName = "POST"); }
We add the MetricsService class, which implements the IMetricsService interface and temporarily stores metrics.
public class MetricsService : IMetricsService { private static Histogram _histogram; static MetricsService() { _histogram = CreateHistogram(); } public Stopwatch StartTimer() { try { var timer = new Stopwatch(); timer.Start(); return timer; } catch (Exception exception) { Logger.Error(exception); } return null; } public void StopTimer(Stopwatch timer, string controllerName, string actionName, string methodName = "POST") { try { if (timer == null) { throw new ArgumentException($"{nameof(timer)} can't be null."); } timer.Stop(); _histogram .WithLabels(controllerName, actionName, methodName) .Observe(timer.ElapsedMilliseconds, DateTimeOffset.UtcNow); } catch (Exception exception) { Logger.Error(exception); } } public static List<string> GetAllLabels() { var metricsList = new List<string>(); try { foreach (var keyValuePair in _histogram.Labelled) { var controllerName = keyValuePair.Key.Labels[0].Value; var actionName = keyValuePair.Key.Labels[1].Value; var methodName = keyValuePair.Key.Labels[2].Value; var requestDurationSum = keyValuePair.Value.Value.Sum; var requestCount = keyValuePair.Value.Value.Count; metricsList.Add($"http_request_duration_widget_sum{{controller={controllerName},action={actionName},method={methodName}}} {requestDurationSum}"); metricsList.Add($"http_request_duration_widget_count{{controller={controllerName},action={actionName},method={methodName}}} {requestCount}"); } _histogram = CreateHistogram(); } catch (Exception exception) { Logger.Error(exception); } return metricsList; } private static Histogram CreateHistogram() { var newMetrics = Metrics .WithCustomRegistry(new CollectorRegistry()) .CreateHistogram(name: "http_request_duration_web_api", help: "Histogram metrics of Web.Api", includeTimestamp: true, labelNames: new[] { "controller", "action", "method" }); var oldValue = _histogram; for (var i = 0; i < 10; i++) { var oldValue = Interlocked.Exchange<Histogram>(ref oldValue, newMetrics); if (oldValue != null) { return oldValue; } } return null; } }
Now we can use our class to save the metrics that we plan to collect in the methods Application_BeginRequest, Application_Error, Application_EndRequest. In the Global.cs class, we add a collection of metrics to the above methods.
private IMetricsService _metricsService; protected virtual void Application_BeginRequest(object sender, EventArgs e) { var context = new HttpContextWrapper(HttpContext.Current); var metricServiceTimer = _metricsService.StartTimer(); context.Items.Add("metricsService", _metricsService); context.Items.Add("metricServiceTimer", metricServiceTimer); } protected virtual void Application_EndRequest(object sender, EventArgs e) { WriteMetrics(new HttpContextWrapper(HttpContext.Current)); } protected void Application_Error(object sender, EventArgs e) { WriteMetrics(new HttpContextWrapper(HttpContext.Current)); } private void WriteMetrics(HttpContextBase context) { try { _metricsService = context.Items["metricsService"] as IMetricsService; if (_metricsService != null) { var timer = context.Items["metricServiceTimer"] as Stopwatch; string controllerName = null; string actionName = null; var rd = RouteTable.Routes.GetRouteData(context); if (rd != null) { controllerName = rd.GetRequiredString("controller"); actionName = rd.GetRequiredString("action"); } _metricsService.StopTimer(timer, controllerName, actionName, context.Request.HttpMethod); } } catch (Exception exception) { Logger.Error("Can't write metrics.", exception); } }
Add a new controller, which will be a reference point for sending the metrics of our API to Prometheus:
public class MetricsController : Controller { [HttpGet] public string[] GetAllMetrics() { try { var metrics = MetricsService.GetAllLabels(); return metrics.ToArray(); } catch (Exception exception) { Logger.Error(exception); } return new string[] { }; } }
The last step will be to configure the Prometheus config to collect metrics in the scrape_configs section, after which we can see the metrics being collected already in the Prometheus or Grafana UI.
Key features that we were interested in at Prometheus:
Multidimensional data model: metrics and labels.
Flexible PromQL query language. In the same query operator, we can use operations such as multiplication, addition, concatenation, etc .; can be performed with multiple metrics.
Gathers HTTP-based data using the pull method.
Compatible with push method via Pushgateway.
It is possible to collect metrics from other applications through Exporters.
Provides a mechanism to prevent data loss.
Supports various graphical representations of data.