Signed-off-by: Daniel Baumann <daniel@debian.org>

2025-05-24 07:26:29 +02:00

4.1 KiB

Raw Blame History

Partial write error handling

Objective

Provide a way to pass information about partial metric write errors from an output to the output model.

Keywords

output plugins, write, error, output model, metric, buffer

Overview

The output model wrapping each output plugin buffers metrics to be able to batch those metrics for more efficient sending. In each flush cycle, the model collects a batch of metrics and hands it over to the output plugin for writing through the Write method. Currently, if writing succeeds (i.e. no error is returned), all metrics of the batch are removed from the buffer and are marked as accepted both in terms of statistics as well as in tracking-metric terms. If writing fails (i.e. any error is returned), all metrics of the batch are kept in the buffer for requeueing them in the next write cycle.

Issues arise when an output plugin cannot write all metrics of a batch bit only some to its service endpoint, e.g. due to the metrics being serializable or if metrics are selectively rejected by the service on the server side. This might happen when reaching submission limits, violating service constraints e.g. by out-of-order sends, or due to invalid characters in the serialited metric. In those cases, an output currently is only able to accept or reject the complete batch of metrics as there is no mechanism to inform the model (and in turn the buffer) that only some of the metrics in the batch were failing.

As a consequence, outputs often accept the batch to avoid a requeueing of the failing metrics for the next flush interval. This distorts statistics of accepted metrics and causes misleading log messages saying all metrics were written sucessfully which is not true. Even worse, for outputs ending-up with partial writes, e.g. only the first half of the metrics can be written to the service, there is no way of telling the model to selectively accept the actually written metrics and in turn those outputs must internally buffer the remaining, unwritten metrics leading to a duplication of buffering logic and adding to code complexity.

This specification aims at defining the handling of partially successful writes and introduces the concept of a special partial write error type to reflect partial writes and partial serialization overcoming the aforementioned issues and limitations.

To do so, the partial write error error type must contain a list of successfully written metrics, to be marked accepted, both in terms of statistics as well as in terms of metric tracking, and must be removed from the buffer. Furthermore, the error must contain a list of metrics that cannot be sent or serialized and cannot be retried. These metrics must be marked as rejected, both in terms of statistics as well as in terms of metric tracking, and must be removed from the buffer.

The error may contain a list of metrics not-yet written to be kept for the next write cylce. Those metrics must not be marked and must be kept in the buffer. If the error does not contain the list of not-yet written metrics, this list must be inferred using the accept and reject lists mentioned above.

To allow the model and the buffer to correctly handle tracking metrics ending up in the buffer and output the tracking information must be preserved during communication between the output plugin, the model and the buffer through the specified error. To do so, all metric lists should be communicated as indices into the batch to be able to handle tracking metrics correctly.

For backward compatibility and simplicity output plugins can return a nil error to indicate that all metrics of the batch are accepted. Similarly, returing an error not being a partial write error indicates that all metrics of the batch should be kept in the buffer for the next write cycle.

issue #11942 for contradicting log messages
issue #14802 for rate-limiting multiple batch sends
issue #15908 for infinite loop if single metrics cannot be written

4.1 KiB Raw Blame History

Partial write error handling

Objective

Keywords

Overview

Related Issues

4.1 KiB

Raw Blame History