78 lines
3.3 KiB
Markdown
78 lines
3.3 KiB
Markdown
|
# Telegraf Output Buffer Strategy
|
||
|
|
||
|
## Objective
|
||
|
|
||
|
Introduce a new agent-level config option to choose a disk buffer strategy for
|
||
|
output plugin metric queues.
|
||
|
|
||
|
## Overview
|
||
|
|
||
|
Currently, when a Telegraf output metric queue fills, either due to incoming
|
||
|
metrics being too fast or various issues with writing to the output, oldest
|
||
|
metrics are overwritten and never written to the output. This specification
|
||
|
defines a set of options to make this output queue more durable by persisting
|
||
|
pending metrics to disk rather than only an in-memory limited size queue.
|
||
|
|
||
|
## Keywords
|
||
|
|
||
|
output plugins, agent configuration, persist to disk
|
||
|
|
||
|
## Agent Configuration
|
||
|
|
||
|
The configuration is at the agent-level, with options for:
|
||
|
|
||
|
- **Memory**, the current implementation, with no persistence to disk
|
||
|
- **Write-through**, all metrics are also written to disk using a
|
||
|
Write Ahead Log (WAL) file
|
||
|
- **Disk-overflow**, when the memory buffer fills, metrics are flushed to a
|
||
|
WAL file to avoid dropping overflow metrics
|
||
|
|
||
|
As well as an option to specify a directory to store the WAL files on disk,
|
||
|
with a default value. These configurations are global, and no change means
|
||
|
memory only mode, retaining current behavior.
|
||
|
|
||
|
## Metric Ordering and Tracking
|
||
|
|
||
|
Tracking metrics will be accepted on a successful write to the output
|
||
|
destination. Metrics will be written to their appropriate output in the order
|
||
|
they are received in the buffer regardless of which buffer strategy is chosen.
|
||
|
|
||
|
## Disk Utilization and File Handling
|
||
|
|
||
|
Each output plugin has its own in-memory output buffer, and therefore will
|
||
|
each have their own WAL file for buffer persistence. This file may not exist
|
||
|
if Telegraf is successfully able to write all of its metrics without filling
|
||
|
the in-memory buffer in disk-overflow mode, or not at all in memory mode.
|
||
|
Telegraf should use one file per output plugin, and remove entries from the
|
||
|
WAL file as they are written to the output.
|
||
|
|
||
|
Telegraf will not make any attempt to limit the size on disk taken by these
|
||
|
files beyond cleaning up WAL files for metrics that have successfully been
|
||
|
flushed to their output destination. It is the user's responsibility to ensure
|
||
|
these files do not entirely fill the disk, both during Telegraf uptime and
|
||
|
with lingering files from previous instances of the program.
|
||
|
|
||
|
If WAL files exist for an output plugin from previous instances of Telegraf,
|
||
|
they will be picked up and flushed before any new metrics that are written
|
||
|
to the output. This is to ensure that these metrics are not lost, and to
|
||
|
ensure that output write order remains consistent.
|
||
|
|
||
|
Telegraf must additionally provide a way to manually flush WAL files via
|
||
|
some separate plugin or similar. This could be used as a way to ensure that
|
||
|
WAL files are properly written in the event that the output plugin changes
|
||
|
and the WAL file is unable to be detected by a new instance of Telegraf.
|
||
|
This plugin should not be required for use to allow the buffer strategy to
|
||
|
work.
|
||
|
|
||
|
## Is/Is-not
|
||
|
|
||
|
- Is a way to prevent metrics from being dropped due to a full memory buffer
|
||
|
- Is not a way to guarantee data safety in the event of a crash or system failure
|
||
|
- Is not a way to manage file system allocation size, file space will be used
|
||
|
until the disk is full
|
||
|
|
||
|
## Prior art
|
||
|
|
||
|
[Initial issue](https://github.com/influxdata/telegraf/issues/802)
|
||
|
[Loose specification issue](https://github.com/influxdata/telegraf/issues/14805)
|