telegraf/docs/specs/tsd-006-startup-error-behavior.md

# Startup Error Behavior

## Objective

Unified, configurable behavior on retriable startup errors.

## Keywords

inputs, outputs, startup, error, retry

## Overview

Many Telegraf plugins connect to an external service either on the same machine
or via network. On automated startup of Telegraf (e.g. via service) there is no
guarantee that those services are fully started yet, especially when they reside
on a remote host. More and more plugins implement mechanisms to retry reaching
their related service if they failed to do so on startup.

This specification intends to unify the naming of configuration-options, the
values of those options, and their semantic meaning. It describes the behavior
for the different options on handling startup-errors.

Startup errors are all errors occurring in calls to `Start()` for inputs and
service-inputs or `Connect()` for outputs. The behaviors described below
should only be applied in cases where the plugin *explicitly* states that an
startup error is *retriable*. This includes for example network errors
indicating that the host or service is not yet reachable or external
resources, like a machine or file, which are not yet available, but might become
available later. To indicate a retriable startup error the plugin should return
a predefined error-type.

In cases where the error cannot be generally determined be retriable by
the plugin, the plugin might add configuration settings to let the user
configure that property. For example, where an error code indicates a fatal,
non-recoverable error in one case but a non-fatal, recoverable error in another
case.

## Configuration Options and Behaviors

Telegraf must introduce a unified `startup_error_behavior` configuration option
for inputs and output plugins. The option is handled directly by the Telegraf
agent and is not passed down to the plugins. The setting must be available on a
per-plugin basis and defines how Telegraf behaves on startup errors.

For all config option values Telegraf might retry to start the plugin for a
limited number of times during the startup phase before actually processing
data. This corresponds to the current behavior of Telegraf to retry three times
with a fifteen second interval before continuing processing of the plugins.

### `error` behavior

The `error` setting for the `startup_error_behavior` option causes Telegraf to
fail and exit on startup errors. This must be the default behavior.

### `retry` behavior

The `retry` setting for the `startup_error_behavior` option Telegraf must *not*
fail on startup errors and should continue running. Telegraf must retry to
startup the failed plugin in each gather or write cycle, for inputs or for
outputs respectively, for an unlimited number of times. Neither the
plugin's `Gather()` nor `Write()` method is called as long as the startup did
not succeed. Metrics sent to an output plugin will be buffered until the plugin
is actually started. If the metric-buffer limit is reached **metrics might be
dropped**!

In case a plugin signals a partially successful startup, e.g. a subset of the
given endpoints are reachable, Telegraf must try to fully startup the remaining
endpoints by calling `Start()` or `Connect()`, respectively, until full startup
is reached **and** trigger the plugin's `Gather()` nor `Write()` methods.

### `ignore` behavior

When using the `ignore` setting for the `startup_error_behavior` option Telegraf
must *not* fail on startup errors and should continue running. On startup error,
Telegraf must ignore the plugin as-if it was not configured at all, i.e. the
plugin must be completely removed from processing.

### `probe` behavior

When using the `probe` setting for the `startup_error_behavior` option Telegraf
must *not* fail on startup errors and should continue running. On startup error,
Telegraf must ignore the plugin as-if it was not configured at all, i.e. the
plugin must be completely removed from processing, similar to the `ignore`
behavior. Additionally, Telegraf must probe the plugin (as defined in
[TSD-009][tsd_009]) after startup, if it implements the `ProbePlugin` interface.
If probing is available *and* returns an error Telegraf must *ignore* the
plugin as-if it was not configured at all.

[tsd_009]: /docs/specs/tsd-009-probe-on-startup.md

## Plugin Requirements

Plugins participating in handling startup errors must implement the `Start()`
or `Connect()` method for inputs and outputs respectively. Those methods must be
safe to be called multiple times during retries without leaking resources or
causing issues in the service used.

Furthermore, the `Close()` method of the plugins must be safe to be called for
cases where the startup failed without causing panics.

The plugins should return a `nil` error during startup to indicate a successful
startup or a retriable error (via predefined error type) to enable the defined
startup error behaviors. A non-retriable error (via predefined error type) or
a generic error will bypass the startup error behaviors and Telegraf must fail
and exit in the startup phase.

## Related Issues

- [#8586](https://github.com/influxdata/telegraf/issues/8586) `inputs.postgresql`
- [#9778](https://github.com/influxdata/telegraf/issues/9778) `outputs.kafka`
- [#13278](https://github.com/influxdata/telegraf/issues/13278) `outputs.cratedb`
- [#13746](https://github.com/influxdata/telegraf/issues/13746) `inputs.amqp_consumer`
- [#14365](https://github.com/influxdata/telegraf/issues/14365) `outputs.postgresql`
- [#14603](https://github.com/influxdata/telegraf/issues/14603) `inputs.nvidia-smi`
- [#14603](https://github.com/influxdata/telegraf/issues/14603) `inputs.rocm-smi`
Adding upstream version 1.34.4. Signed-off-by: Daniel Baumann <daniel@debian.org> 2025-05-24 07:26:29 +02:00			`# Startup Error Behavior`

			`## Objective`

			`Unified, configurable behavior on retriable startup errors.`

			`## Keywords`

			`inputs, outputs, startup, error, retry`

			`## Overview`

			`Many Telegraf plugins connect to an external service either on the same machine`
			`or via network. On automated startup of Telegraf (e.g. via service) there is no`
			`guarantee that those services are fully started yet, especially when they reside`
			`on a remote host. More and more plugins implement mechanisms to retry reaching`
			`their related service if they failed to do so on startup.`

			`This specification intends to unify the naming of configuration-options, the`
			`values of those options, and their semantic meaning. It describes the behavior`
			`for the different options on handling startup-errors.`

			Startup errors are all errors occurring in calls to `Start()` for inputs and
			service-inputs or `Connect()` for outputs. The behaviors described below
			`should only be applied in cases where the plugin explicitly states that an`
			`startup error is retriable. This includes for example network errors`
			`indicating that the host or service is not yet reachable or external`
			`resources, like a machine or file, which are not yet available, but might become`
			`available later. To indicate a retriable startup error the plugin should return`
			`a predefined error-type.`

			`In cases where the error cannot be generally determined be retriable by`
			`the plugin, the plugin might add configuration settings to let the user`
			`configure that property. For example, where an error code indicates a fatal,`
			`non-recoverable error in one case but a non-fatal, recoverable error in another`
			`case.`

			`## Configuration Options and Behaviors`

			Telegraf must introduce a unified `startup_error_behavior` configuration option
			`for inputs and output plugins. The option is handled directly by the Telegraf`
			`agent and is not passed down to the plugins. The setting must be available on a`
			`per-plugin basis and defines how Telegraf behaves on startup errors.`

			`For all config option values Telegraf might retry to start the plugin for a`
			`limited number of times during the startup phase before actually processing`
			`data. This corresponds to the current behavior of Telegraf to retry three times`
			`with a fifteen second interval before continuing processing of the plugins.`

			### `error` behavior

			The `error` setting for the `startup_error_behavior` option causes Telegraf to
			`fail and exit on startup errors. This must be the default behavior.`

			### `retry` behavior

			The `retry` setting for the `startup_error_behavior` option Telegraf must not
			`fail on startup errors and should continue running. Telegraf must retry to`
			`startup the failed plugin in each gather or write cycle, for inputs or for`
			`outputs respectively, for an unlimited number of times. Neither the`
			plugin's `Gather()` nor `Write()` method is called as long as the startup did
			`not succeed. Metrics sent to an output plugin will be buffered until the plugin`
			`is actually started. If the metric-buffer limit is reached **metrics might be`
			`dropped**!`

			`In case a plugin signals a partially successful startup, e.g. a subset of the`
			`given endpoints are reachable, Telegraf must try to fully startup the remaining`
			endpoints by calling `Start()` or `Connect()`, respectively, until full startup
			is reached and trigger the plugin's `Gather()` nor `Write()` methods.

			### `ignore` behavior

			When using the `ignore` setting for the `startup_error_behavior` option Telegraf
			`must not fail on startup errors and should continue running. On startup error,`
			`Telegraf must ignore the plugin as-if it was not configured at all, i.e. the`
			`plugin must be completely removed from processing.`

			### `probe` behavior

			When using the `probe` setting for the `startup_error_behavior` option Telegraf
			`must not fail on startup errors and should continue running. On startup error,`
			`Telegraf must ignore the plugin as-if it was not configured at all, i.e. the`
			plugin must be completely removed from processing, similar to the `ignore`
			`behavior. Additionally, Telegraf must probe the plugin (as defined in`
			[TSD-009][tsd_009]) after startup, if it implements the `ProbePlugin` interface.
			`If probing is available and returns an error Telegraf must ignore the`
			`plugin as-if it was not configured at all.`

			`[tsd_009]: /docs/specs/tsd-009-probe-on-startup.md`

			`## Plugin Requirements`

			Plugins participating in handling startup errors must implement the `Start()`
			or `Connect()` method for inputs and outputs respectively. Those methods must be
			`safe to be called multiple times during retries without leaking resources or`
			`causing issues in the service used.`

			Furthermore, the `Close()` method of the plugins must be safe to be called for
			`cases where the startup failed without causing panics.`

			The plugins should return a `nil` error during startup to indicate a successful
			`startup or a retriable error (via predefined error type) to enable the defined`
			`startup error behaviors. A non-retriable error (via predefined error type) or`
			`a generic error will bypass the startup error behaviors and Telegraf must fail`
			`and exit in the startup phase.`

			`## Related Issues`

			- [#8586](https://github.com/influxdata/telegraf/issues/8586) `inputs.postgresql`
			- [#9778](https://github.com/influxdata/telegraf/issues/9778) `outputs.kafka`
			- [#13278](https://github.com/influxdata/telegraf/issues/13278) `outputs.cratedb`
			- [#13746](https://github.com/influxdata/telegraf/issues/13746) `inputs.amqp_consumer`
			- [#14365](https://github.com/influxdata/telegraf/issues/14365) `outputs.postgresql`
			- [#14603](https://github.com/influxdata/telegraf/issues/14603) `inputs.nvidia-smi`
			- [#14603](https://github.com/influxdata/telegraf/issues/14603) `inputs.rocm-smi`