116 lines
5.5 KiB
Markdown
116 lines
5.5 KiB
Markdown
|
# Startup Error Behavior
|
||
|
|
||
|
## Objective
|
||
|
|
||
|
Unified, configurable behavior on retriable startup errors.
|
||
|
|
||
|
## Keywords
|
||
|
|
||
|
inputs, outputs, startup, error, retry
|
||
|
|
||
|
## Overview
|
||
|
|
||
|
Many Telegraf plugins connect to an external service either on the same machine
|
||
|
or via network. On automated startup of Telegraf (e.g. via service) there is no
|
||
|
guarantee that those services are fully started yet, especially when they reside
|
||
|
on a remote host. More and more plugins implement mechanisms to retry reaching
|
||
|
their related service if they failed to do so on startup.
|
||
|
|
||
|
This specification intends to unify the naming of configuration-options, the
|
||
|
values of those options, and their semantic meaning. It describes the behavior
|
||
|
for the different options on handling startup-errors.
|
||
|
|
||
|
Startup errors are all errors occurring in calls to `Start()` for inputs and
|
||
|
service-inputs or `Connect()` for outputs. The behaviors described below
|
||
|
should only be applied in cases where the plugin *explicitly* states that an
|
||
|
startup error is *retriable*. This includes for example network errors
|
||
|
indicating that the host or service is not yet reachable or external
|
||
|
resources, like a machine or file, which are not yet available, but might become
|
||
|
available later. To indicate a retriable startup error the plugin should return
|
||
|
a predefined error-type.
|
||
|
|
||
|
In cases where the error cannot be generally determined be retriable by
|
||
|
the plugin, the plugin might add configuration settings to let the user
|
||
|
configure that property. For example, where an error code indicates a fatal,
|
||
|
non-recoverable error in one case but a non-fatal, recoverable error in another
|
||
|
case.
|
||
|
|
||
|
## Configuration Options and Behaviors
|
||
|
|
||
|
Telegraf must introduce a unified `startup_error_behavior` configuration option
|
||
|
for inputs and output plugins. The option is handled directly by the Telegraf
|
||
|
agent and is not passed down to the plugins. The setting must be available on a
|
||
|
per-plugin basis and defines how Telegraf behaves on startup errors.
|
||
|
|
||
|
For all config option values Telegraf might retry to start the plugin for a
|
||
|
limited number of times during the startup phase before actually processing
|
||
|
data. This corresponds to the current behavior of Telegraf to retry three times
|
||
|
with a fifteen second interval before continuing processing of the plugins.
|
||
|
|
||
|
### `error` behavior
|
||
|
|
||
|
The `error` setting for the `startup_error_behavior` option causes Telegraf to
|
||
|
fail and exit on startup errors. This must be the default behavior.
|
||
|
|
||
|
### `retry` behavior
|
||
|
|
||
|
The `retry` setting for the `startup_error_behavior` option Telegraf must *not*
|
||
|
fail on startup errors and should continue running. Telegraf must retry to
|
||
|
startup the failed plugin in each gather or write cycle, for inputs or for
|
||
|
outputs respectively, for an unlimited number of times. Neither the
|
||
|
plugin's `Gather()` nor `Write()` method is called as long as the startup did
|
||
|
not succeed. Metrics sent to an output plugin will be buffered until the plugin
|
||
|
is actually started. If the metric-buffer limit is reached **metrics might be
|
||
|
dropped**!
|
||
|
|
||
|
In case a plugin signals a partially successful startup, e.g. a subset of the
|
||
|
given endpoints are reachable, Telegraf must try to fully startup the remaining
|
||
|
endpoints by calling `Start()` or `Connect()`, respectively, until full startup
|
||
|
is reached **and** trigger the plugin's `Gather()` nor `Write()` methods.
|
||
|
|
||
|
### `ignore` behavior
|
||
|
|
||
|
When using the `ignore` setting for the `startup_error_behavior` option Telegraf
|
||
|
must *not* fail on startup errors and should continue running. On startup error,
|
||
|
Telegraf must ignore the plugin as-if it was not configured at all, i.e. the
|
||
|
plugin must be completely removed from processing.
|
||
|
|
||
|
### `probe` behavior
|
||
|
|
||
|
When using the `probe` setting for the `startup_error_behavior` option Telegraf
|
||
|
must *not* fail on startup errors and should continue running. On startup error,
|
||
|
Telegraf must ignore the plugin as-if it was not configured at all, i.e. the
|
||
|
plugin must be completely removed from processing, similar to the `ignore`
|
||
|
behavior. Additionally, Telegraf must probe the plugin (as defined in
|
||
|
[TSD-009][tsd_009]) after startup, if it implements the `ProbePlugin` interface.
|
||
|
If probing is available *and* returns an error Telegraf must *ignore* the
|
||
|
plugin as-if it was not configured at all.
|
||
|
|
||
|
[tsd_009]: /docs/specs/tsd-009-probe-on-startup.md
|
||
|
|
||
|
## Plugin Requirements
|
||
|
|
||
|
Plugins participating in handling startup errors must implement the `Start()`
|
||
|
or `Connect()` method for inputs and outputs respectively. Those methods must be
|
||
|
safe to be called multiple times during retries without leaking resources or
|
||
|
causing issues in the service used.
|
||
|
|
||
|
Furthermore, the `Close()` method of the plugins must be safe to be called for
|
||
|
cases where the startup failed without causing panics.
|
||
|
|
||
|
The plugins should return a `nil` error during startup to indicate a successful
|
||
|
startup or a retriable error (via predefined error type) to enable the defined
|
||
|
startup error behaviors. A non-retriable error (via predefined error type) or
|
||
|
a generic error will bypass the startup error behaviors and Telegraf must fail
|
||
|
and exit in the startup phase.
|
||
|
|
||
|
## Related Issues
|
||
|
|
||
|
- [#8586](https://github.com/influxdata/telegraf/issues/8586) `inputs.postgresql`
|
||
|
- [#9778](https://github.com/influxdata/telegraf/issues/9778) `outputs.kafka`
|
||
|
- [#13278](https://github.com/influxdata/telegraf/issues/13278) `outputs.cratedb`
|
||
|
- [#13746](https://github.com/influxdata/telegraf/issues/13746) `inputs.amqp_consumer`
|
||
|
- [#14365](https://github.com/influxdata/telegraf/issues/14365) `outputs.postgresql`
|
||
|
- [#14603](https://github.com/influxdata/telegraf/issues/14603) `inputs.nvidia-smi`
|
||
|
- [#14603](https://github.com/influxdata/telegraf/issues/14603) `inputs.rocm-smi`
|