1
0
Fork 0

Adding upstream version 1.34.4.

Signed-off-by: Daniel Baumann <daniel@debian.org>
This commit is contained in:
Daniel Baumann 2025-05-24 07:26:29 +02:00
parent e393c3af3f
commit 4978089aab
Signed by: daniel
GPG key ID: FBB4F0E80A80222F
4963 changed files with 677545 additions and 0 deletions

71
docs/specs/README.md Normal file
View file

@ -0,0 +1,71 @@
# Telegraf Specification Overview
## Objective
Define and layout the Telegraf specification process.
## Overview
The general goal of a spec is to detail the work that needs to get accomplished
for a new feature. A developer should be able to pick up a spec and have a
decent understanding of the objective, the steps required, and most of the
general design decisions.
The specs can then live in the Telegraf repository to share and involve the
community in the process of planning larger changes or new features. The specs
also serve as a public historical record for changes.
## Process
The general workflow is for a user to put up a PR with a spec outlining the
task, have any discussion in the PR, reach consensus, and ultimately commit
the finished spec to the repo.
While researching a new feature may involve an investment of time, writing the
spec should be relatively quick. It should not take hours of time.
## Spec naming
Please name the actual file prefixed with `tsd` and the next available
number, for example:
* tsd-001-agent-write-ahead-log.md
* tsd-002-inputs-apache-increase-timeout.md
* tsd-003-serializers-parquet.md
All lower-case and separated by hyphens.
## What belongs in a spec
A spec should involve the creation of a markdown file with at least an objective
and overview:
* Objective (required) - One sentence headline
* Overview (required) - Explain the reasoning for the new feature and any
historical information. Answer the why this is needed.
Please feel free to make a copy the template.md and start with that.
The user is free to add additional sections or parts in order to express and
convey a new feature. For example this might include:
* Keywords - Help identify what the spec is about
* Is/Is-not - Explicitly state what this change includes and does not include
* Prior Art - Point at existing or previous PRs, issues, or other works that
demonstrate the feature or need for it.
* Open Questions - Section with open questions that can get captured in
updates to the PR
## Changing existing specs
Small changes which are non-substantive, like grammar or formatting are gladly
accepted.
After a feature is complete it may make sense to come back and update a spec
based on the final result.
Other changes that make substantive changes are entirely up to the maintainers
whether the edits to an existing RFC will be accepted. In general, finished
specs should be considered complete and done, however, priorities, details, or
other situations may evolve over time and as such introduce the need to make
updates.

20
docs/specs/template.md Normal file
View file

@ -0,0 +1,20 @@
# Title
## Objective
One sentence explanation of the feature.
## Overview
Background and details about the feature.
## Keywords
A few items to specify what areas of Telegraf this spec affects (e.g. outputs,
inputs, processors, aggregators, agent, packaging, etc.)
## Is/Is-not
## Prior art
## Open questions

View file

@ -0,0 +1,182 @@
# Plugin and Plugin Option Deprecation
## Objective
Specifies the process of deprecating and removing plugins, plugin settings
including values of those settings or features.
## Keywords
procedure, removal, all plugins
## Overview
Over time the number of plugins, plugin options and plugin features grow and
some of those plugins or options are either not relevant anymore, have been
superseded or subsumed by other plugins or options. To be able to remove those,
this specification defines a process to deprecate plugins, plugin options and
plugin features including a timeline and minimal time-frames. Additionally, the
specification defines a framework to annotate deprecations in the code and
inform users about such deprecations.
## User experience
In the deprecation phase a warning will be shown at Telegraf startup with the
following content
```text
Plugin "inputs.logparser" deprecated since version 1.15.0 and will be removed in 1.40.0: use 'inputs.tail' with 'grok' data format instead
```
Similar warnings will be shown when removing plugin options or option values.
This provides users with time to replace the deprecated plugin in their
configuration file.
After the shown release (`v1.40.0` in this case) the warning will be promoted
to an error preventing Telegraf from starting. The user now has to adapt the
configuration file to start Telegraf.
## Time frames and considerations
When deprecating parts of Telegraf, it is important to provide users with enough
time to migrate to alternative solutions before actually removing those parts.
In general, plugins, plugin options or option values should only be deprecated
if a suitable alternative exists! In those cases, the deprecations should
predate the removal by at least one and a half years. In current release terms
this corresponds to six minor-versions. However, there might be circumstances
requiring a prolonged time between deprecation and removal to ensure a smooth
transition for users.
Versions between deprecation and removal of plugins, plugin options or option
values, Telegraf must log a *warning* on startup including information about
the version introducing the deprecation, the version of removal and an
user-facing hint on suitable replacements. In this phase Telegraf should
operate normally even with deprecated plugins, plugin options or option values
being set in the configuration files.
Starting from the removal version, Telegraf must show an *error* message for
deprecated plugins present in the configuration including all information listed
above. Removed plugin options and option values should be handled as invalid
settings in the configuration files and must lead to an error. In this phase,
Telegraf should *stop running* until all deprecated plugins, plugin options and
option values are removed from the configuration files.
## Deprecation Process
The deprecation process comprises the following the steps below.
### File issue
In the filed issue you should outline which plugin, plugin option or feature
you want to deprecate and *why*! Determine in which version the plugin should
be removed.
Try to reach an agreement in the issue before continuing and get a sign off
from the maintainers!
### Submit deprecation pull-request
Send a pull request adding deprecation information to the code and update the
plugin's `README.md` file. Depending on what you want to deprecate this
comprises different locations and steps as detailed below.
Once the deprecation pull-request is merged and Telegraf is released, we have
to wait for the targeted Telegraf version for actually removing the code.
#### Deprecating a plugin
When deprecating a plugin you need to add an entry to the `deprecation.go` file
in the respective plugin category with the following format
```golang
"<plugin name>": {
Since: "<x.y.z format version of the next minor release>",
RemovalIn: "<x.y.z format version of the plugin removal>",
Notice: "<user-facing hint e.g. on replacements>",
},
```
If you for example want to remove the `inputs.logparser` plugin you should add
```golang
"logparser": {
Since: "1.15.0",
RemovalIn: "1.40.0"
Notice: "use 'inputs.tail' with 'grok' data format instead",
},
```
to `plugins/inputs/deprecations.go`. By doing this, Telegraf will show a
deprecation warning to the user starting from version `1.15.0` including the
`Notice` you provided. The plugin can then be remove in version `1.40.0`.
Additionally, you should update the plugin's `README.md` adding a paragraph
mentioning since when the plugin is deprecated, when it will be removed and a
hint to alternatives or replacements. The paragraph should look like this
```text
**Deprecated in version v1.15.0 and scheduled for removal in v1.40.0**:
Please use the [tail][] plugin with the [`grok` data format][grok parser]
instead!
```
#### Deprecating an option
To deprecate a plugin option, remove the option from the `sample.conf` file and
add the deprecation information to the structure field in the code. If you for
for example want to deprecate the `ssl_enabled` option in `inputs.example` you
should add
```golang
type Example struct {
...
SSLEnabled bool `toml:"ssl_enabled" deprecated:"1.3.0;1.40.0;use 'tls_*' options instead"`
}
```
to schedule the setting for removal in version `1.40.0`. The last element of
the `deprecated` tag is a user-facing notice similar to plugin deprecation.
#### Deprecating an option-value
Sometimes, certain option values become deprecated or superseded by other
options or values. To deprecate those option values, remove them from
`sample.conf` and add the deprecation info in the code if the deprecated value
is *actually used* via
```golang
func (e *Example) Init() error {
...
if e.Mode == "old" {
models.PrintOptionDeprecationNotice(telegraf.Warn, "inputs.example", "mode", telegraf.DeprecationInfo{
Since: "1.23.1",
RemovalIn: "1.40.0",
Notice: "use 'v1' instead",
})
}
...
return nil
}
```
This will show a warning if the deprecated `v1` value is used for the `mode`
setting in `inputs.example` with a user-facing notice.
### Submit pull-request for removing code
Once the plugin, plugin option or option-value is deprecated, we have to wait
for the `RemovedIn` release to remove the code. In the examples above, this
would be version `1.40.0`. After all scheduled bugfix-releases are done, with
`1.40.0` being the next release, you can create a pull-request to actually
remove the deprecated code.
Please make sure, you remove the plugin, plugin option or option value and the
code referencing those. This might also comprise the `all` files of your plugin
category, test-cases including those of other plugins, README files or other
documentation. For removed plugins, please keep the deprecation info in
`deprecations.go` so users can find a reference when switching from a really
old version.
Make sure you add an `Important Changes` sections to the `CHANGELOG.md` file
describing the removal with a reference to your PR.

View file

@ -0,0 +1,71 @@
# Telegraf Custom-Builder
## Objective
Provide a tool to build a customized, smaller version of Telegraf with only
the required plugins included.
## Keywords
tool, binary size, customization
## Overview
The Telegraf binary continues to grow as new plugins and features are added
and dependencies are updated. Users running on resource constrained systems
such as embedded-systems or inside containers might suffer from the growth.
This document specifies a tool to build a smaller Telegraf binary tailored to
the plugins configured and actually used, removing unnecessary and unused
plugins. The implementation should be able to cope with configured parsers and
serializers including defaults for those plugin categories. Valid Telegraf
configuration files, including directories containing such files, are the input
to the customization process.
The customization tool might not be available for older versions of Telegraf.
Furthermore, the degree of customization and thus the effective size reduction
might vary across versions. The tool must create a single static Telegraf
binary. Distribution packages or containers are *not* targeted.
## Prior art
[PR #5809](https://github.com/influxdata/telegraf/pull/5809) and
[telegraf-lite-builder](https://github.com/influxdata/telegraf/tree/telegraf-lite-builder/cmd/telegraf-lite-builder):
- Uses docker
- Uses browser:
- Generates a webpage to pick what options you want. User chooses plugins;
does not take a config file
- Build a binary, then minifies by stripping and compressing that binary
- Does some steps that belong in makefile, not builder
- Special case for upx
- Makes gzip, zip, tar.gz
- Uses gopkg.in?
- Can also work from the command line
[PR #8519](https://github.com/influxdata/telegraf/pull/8519)
- User chooses plugins OR provides a config file
[powers/telegraf-build](https://github.com/powersj/telegraf-build)
- User chooses plugins OR provides a config file
- Currently kept in separate repo
- Undoes changes to all.go files
[rawkode/bring-your-own-telegraf](https://github.com/rawkode/bring-your-own-telegraf)
- Uses docker
## Additional information
You might be able to further reduce the binary size of Telegraf by removing
debugging information. This is done by adding `-w` and `-s` to the linker flags
before building `LDFLAGS="-w -s"`.
However, please note that this removes information helpful for debugging issues
in Telegraf.
Additionally, you can use a binary packer such as [UPX](https://upx.github.io/)
to reduce the required *disk* space. This compresses the binary and decompresses
it again at runtime. However, this does not reduce memory footprint at runtime.

View file

@ -0,0 +1,125 @@
# Plugin State-Persistence
## Objective
Retain the state of stateful plugins across restarts of Telegraf.
## Keywords
framework, plugin, stateful, persistence
## Overview
Telegraf contains a number of plugins that hold an internal state while
processing. For some of the plugins this state is important for efficient
processing like the location when reading a large file or when continuously
querying data from a stateful peer requiring for example an offset or the last
queried timestamp. For those plugins it is important to persistent their
internal state over restarts of Telegraf.
It is intended to
- allow for opt-in of plugins to store a state per plugin _instance_
- restore the state for each plugin instances at startup
- track the plugin instances over restarts to relate the stored state with a
corresponding plugin instance
- automatically compute plugin instance IDs based on the plugin configuration
- provide a way to manually specify instance IDs by the user
- _not_ restore states if the plugin configuration changed between runs
- make implementation easy for plugin developers
- make no assumption on the state _content_
The persistence will use the following steps:
- Compute an unique ID for each of the plugin _instances_
- Startup Telegraf plugins calling `Init()`, etc.
- Initialize persistence framework with the user specified `statefile` location
and load the state if present
- Determine all stateful plugin instances by fulfilling the `StatefulPlugin`
interface
- Restore plugin states (if any) for each plugin ID present in the state-file
- Run data-collection etc...
- On shutdown, stopping all Telegraf plugins calling `Stop()` or `Close()`
depending on the plugin type
- Query the state of all registered stateful plugins state
- Create an overall state-map with the plugin instance ID as a key and the
serialized plugin state as value.
- Marshal the overall state-map and store to disk
Potential users of this functionality are plugins continuously querying
endpoints with information of a previous query (e.g. timestamps, offsets,
transaction tokens, etc.) The following plugins are known to have an internal
state. This is not a comprehensive list.
- `inputs.win_eventlog` ([PR #8281](https://github.com/influxdata/telegraf/pull/8281))
- `inputs.docker_log` ([PR #7749](https://github.com/influxdata/telegraf/pull/7749))
- `inputs.tail` (file offset)
- `inputs.cloudwatch` (`windowStart`/`windowEnd` parameters)
- `inputs.stackdriver` (`prevEnd` parameter)
### Plugin ID computation
The plugin ID is computed based on the configuration options specified for the
plugin instance. To generate the ID all settings are extracted as `string`
key-value pairs with the option name being the key and the value being the
configuration option setting. For nested configuration options, e.g. if the
plugins has a sub-table, the options are flattened with a canonical key. The
canonical key elements must be concatenated with a dot (`.`) separator. In case
the sub-element is a list of tables, the key must include the index of each
table prefixed by a hash sign i.e. `<parent>#<index>.<child>`.
The resulting key-value pairs of configuration options are then sorted by the
key in lexical order to make the resulting ID invariant against changes in the
order of configuration options. The key and the value of each pair are joined
by a colon (`:`) to a single `string`.
Finally, a SHA256 sum is computed across all key-value strings separated by a
`null` byte. The HEX representation of the resulting SHA256 is used as the
plugin instance ID.
### State serialization format
The overall Telegraf state maps the plugin IDs (keys) to the serialized state
of the corresponding plugin (values). The state data returned by stateful
plugins is serialized to JSON. The resulting byte-sequence is used as the value
for the overall state. On-disk, the overall state of Telegraf is stored as JSON.
To restore the state of a plugin, the overall Telegraf state is first
deserialized from the on-disk JSON data and a lookup for the plugin ID is
performed in the resulting map. The value, if found, is then deserialized to the
plugin's state data-structure and provided to the plugin after calling `Init()`.
## Is / Is-not
### Is
- A framework to persist states over restarts of Telegraf
- A simple local state store
- A way to restore plugin states between restarts without configuration changes
- A unified API for plugins to use when requiring persistence of a state
### Is-Not
- A remote storage framework
- A way to store anything beyond fundamental plugin states
- A data-store or database
- A way to reassign plugin states if their configuration changes
- A tool to interactively adding/removing/modifying states of plugins
- A persistence guarantee beyond clean shutdown (i.e. no crash resistance)
## Prior art
- [PR #8281](https://github.com/influxdata/telegraf/pull/8281): Stores Windows
event-log bookmarks in the registry
- [PR #7749](https://github.com/influxdata/telegraf/pull/7749): Stores container
ID and log offset to a file at a user-provided path
- [PR #7537](https://github.com/influxdata/telegraf/pull/7537): Provides a
global state object and periodically queries plugin states to store the state
object to a JSON file. This approach does not provide a ID per plugin
_instance_ so it seems like there is only a single state for a plugin _type_
- [PR #9476](https://github.com/influxdata/telegraf/pull/9476): Register
stateful plugins to persister and automatically assigns an ID to plugin
_instances_ based on the configuration. The approach also allows to overwrite
the automatic ID e.g. with user specified data. It uses the plugin instance ID
to store/restore state to the same plugin instance and queries the plugin
state on shutdown and write file (currently JSON).

View file

@ -0,0 +1,69 @@
# Configuration Migration
## Objective
Provides a subcommand and framework to migrate configurations containing
deprecated settings to a corresponding recent configuration.
## Keywords
configuration, deprecation, telegraf command
## Overview
With the deprecation framework of [TSD-001](tsd-001-deprecation.md) implemented
we see more and more plugins and options being scheduled for removal in the
future. Furthermore, deprecations become visible to the user due to the warnings
issued for removed plugins, plugin options and plugin option values.
To aid the user in mitigating deprecated configuration settings this
specifications proposes the implementation of a `migrate` sub-command to the
Telegraf `config` command for automatically migrate the user's existing
configuration files away from the deprecated settings to an equivalent, recent
configuration. Furthermore, the specification describes the layout and
functionality of a plugin-based migration framework to implement migrations.
### `migrate` sub-command
The `migrate` sub-command of the `config` command should take a set of
configuration files and configuration directories and apply available migrations
to deprecated plugins, plugin options or plugin option-values in order to
generate new configuration files that do not make use of deprecated options.
In the process, the migration procedure must ensure that only plugins with
applicable migrations are modified. Existing configuration must be kept and not
be overwritten without manual confirmation of the user. This should be
accomplished by storing modified configuration files with a `.migrated` suffix
and leaving it to the user to overwrite the existing configuration with the
generated counterparts. If no migration is applied in a configuration file, the
command might not generate a new file and leave the original file untouched.
During migration, the configuration, plugin behavior, resulting metrics and
comments should be kept on a best-effort basis. Telegraf must inform the user
about applied migrations and potential changes in the plugin behavior or
resulting metrics. If a plugin cannot be automatically migrated but requires
manual intervention, Telegraf should inform the user.
### Migration implementations
To implement migrations for deprecated plugins, plugin option or plugin option
values, Telegraf must provide a plugin-based infrastructure to register and
apply implemented migrations based on the plugin-type. Only one migration per
plugin-type must be registered.
Developers must implement the required interfaces and register the migration
to the mentioned framework. The developer must provide the possibility to
exclude the migration at build-time according to
[TSD-002](tsd-002-custom-builder.md). Existing migrations can be extended but
must be cumulative such that any previous configuration migration functionality
is kept.
Resulting configurations should generate metrics equivalent to the previous
setup also making use of metric selection, renaming and filtering mechanisms.
In cases this is not possible, there must be a clear information to the user
what to expect and which differences might occur.
A migration can only be informative, i.e. notify the user that a plugin has to
manually be migrated and should point users to additional information.
Deprecated plugins and plugin options must be removed from the migrated
configuration.

View file

@ -0,0 +1,77 @@
# Telegraf Output Buffer Strategy
## Objective
Introduce a new agent-level config option to choose a disk buffer strategy for
output plugin metric queues.
## Overview
Currently, when a Telegraf output metric queue fills, either due to incoming
metrics being too fast or various issues with writing to the output, oldest
metrics are overwritten and never written to the output. This specification
defines a set of options to make this output queue more durable by persisting
pending metrics to disk rather than only an in-memory limited size queue.
## Keywords
output plugins, agent configuration, persist to disk
## Agent Configuration
The configuration is at the agent-level, with options for:
- **Memory**, the current implementation, with no persistence to disk
- **Write-through**, all metrics are also written to disk using a
Write Ahead Log (WAL) file
- **Disk-overflow**, when the memory buffer fills, metrics are flushed to a
WAL file to avoid dropping overflow metrics
As well as an option to specify a directory to store the WAL files on disk,
with a default value. These configurations are global, and no change means
memory only mode, retaining current behavior.
## Metric Ordering and Tracking
Tracking metrics will be accepted on a successful write to the output
destination. Metrics will be written to their appropriate output in the order
they are received in the buffer regardless of which buffer strategy is chosen.
## Disk Utilization and File Handling
Each output plugin has its own in-memory output buffer, and therefore will
each have their own WAL file for buffer persistence. This file may not exist
if Telegraf is successfully able to write all of its metrics without filling
the in-memory buffer in disk-overflow mode, or not at all in memory mode.
Telegraf should use one file per output plugin, and remove entries from the
WAL file as they are written to the output.
Telegraf will not make any attempt to limit the size on disk taken by these
files beyond cleaning up WAL files for metrics that have successfully been
flushed to their output destination. It is the user's responsibility to ensure
these files do not entirely fill the disk, both during Telegraf uptime and
with lingering files from previous instances of the program.
If WAL files exist for an output plugin from previous instances of Telegraf,
they will be picked up and flushed before any new metrics that are written
to the output. This is to ensure that these metrics are not lost, and to
ensure that output write order remains consistent.
Telegraf must additionally provide a way to manually flush WAL files via
some separate plugin or similar. This could be used as a way to ensure that
WAL files are properly written in the event that the output plugin changes
and the WAL file is unable to be detected by a new instance of Telegraf.
This plugin should not be required for use to allow the buffer strategy to
work.
## Is/Is-not
- Is a way to prevent metrics from being dropped due to a full memory buffer
- Is not a way to guarantee data safety in the event of a crash or system failure
- Is not a way to manage file system allocation size, file space will be used
until the disk is full
## Prior art
[Initial issue](https://github.com/influxdata/telegraf/issues/802)
[Loose specification issue](https://github.com/influxdata/telegraf/issues/14805)

View file

@ -0,0 +1,115 @@
# Startup Error Behavior
## Objective
Unified, configurable behavior on retriable startup errors.
## Keywords
inputs, outputs, startup, error, retry
## Overview
Many Telegraf plugins connect to an external service either on the same machine
or via network. On automated startup of Telegraf (e.g. via service) there is no
guarantee that those services are fully started yet, especially when they reside
on a remote host. More and more plugins implement mechanisms to retry reaching
their related service if they failed to do so on startup.
This specification intends to unify the naming of configuration-options, the
values of those options, and their semantic meaning. It describes the behavior
for the different options on handling startup-errors.
Startup errors are all errors occurring in calls to `Start()` for inputs and
service-inputs or `Connect()` for outputs. The behaviors described below
should only be applied in cases where the plugin *explicitly* states that an
startup error is *retriable*. This includes for example network errors
indicating that the host or service is not yet reachable or external
resources, like a machine or file, which are not yet available, but might become
available later. To indicate a retriable startup error the plugin should return
a predefined error-type.
In cases where the error cannot be generally determined be retriable by
the plugin, the plugin might add configuration settings to let the user
configure that property. For example, where an error code indicates a fatal,
non-recoverable error in one case but a non-fatal, recoverable error in another
case.
## Configuration Options and Behaviors
Telegraf must introduce a unified `startup_error_behavior` configuration option
for inputs and output plugins. The option is handled directly by the Telegraf
agent and is not passed down to the plugins. The setting must be available on a
per-plugin basis and defines how Telegraf behaves on startup errors.
For all config option values Telegraf might retry to start the plugin for a
limited number of times during the startup phase before actually processing
data. This corresponds to the current behavior of Telegraf to retry three times
with a fifteen second interval before continuing processing of the plugins.
### `error` behavior
The `error` setting for the `startup_error_behavior` option causes Telegraf to
fail and exit on startup errors. This must be the default behavior.
### `retry` behavior
The `retry` setting for the `startup_error_behavior` option Telegraf must *not*
fail on startup errors and should continue running. Telegraf must retry to
startup the failed plugin in each gather or write cycle, for inputs or for
outputs respectively, for an unlimited number of times. Neither the
plugin's `Gather()` nor `Write()` method is called as long as the startup did
not succeed. Metrics sent to an output plugin will be buffered until the plugin
is actually started. If the metric-buffer limit is reached **metrics might be
dropped**!
In case a plugin signals a partially successful startup, e.g. a subset of the
given endpoints are reachable, Telegraf must try to fully startup the remaining
endpoints by calling `Start()` or `Connect()`, respectively, until full startup
is reached **and** trigger the plugin's `Gather()` nor `Write()` methods.
### `ignore` behavior
When using the `ignore` setting for the `startup_error_behavior` option Telegraf
must *not* fail on startup errors and should continue running. On startup error,
Telegraf must ignore the plugin as-if it was not configured at all, i.e. the
plugin must be completely removed from processing.
### `probe` behavior
When using the `probe` setting for the `startup_error_behavior` option Telegraf
must *not* fail on startup errors and should continue running. On startup error,
Telegraf must ignore the plugin as-if it was not configured at all, i.e. the
plugin must be completely removed from processing, similar to the `ignore`
behavior. Additionally, Telegraf must probe the plugin (as defined in
[TSD-009][tsd_009]) after startup, if it implements the `ProbePlugin` interface.
If probing is available *and* returns an error Telegraf must *ignore* the
plugin as-if it was not configured at all.
[tsd_009]: /docs/specs/tsd-009-probe-on-startup.md
## Plugin Requirements
Plugins participating in handling startup errors must implement the `Start()`
or `Connect()` method for inputs and outputs respectively. Those methods must be
safe to be called multiple times during retries without leaking resources or
causing issues in the service used.
Furthermore, the `Close()` method of the plugins must be safe to be called for
cases where the startup failed without causing panics.
The plugins should return a `nil` error during startup to indicate a successful
startup or a retriable error (via predefined error type) to enable the defined
startup error behaviors. A non-retriable error (via predefined error type) or
a generic error will bypass the startup error behaviors and Telegraf must fail
and exit in the startup phase.
## Related Issues
- [#8586](https://github.com/influxdata/telegraf/issues/8586) `inputs.postgresql`
- [#9778](https://github.com/influxdata/telegraf/issues/9778) `outputs.kafka`
- [#13278](https://github.com/influxdata/telegraf/issues/13278) `outputs.cratedb`
- [#13746](https://github.com/influxdata/telegraf/issues/13746) `inputs.amqp_consumer`
- [#14365](https://github.com/influxdata/telegraf/issues/14365) `outputs.postgresql`
- [#14603](https://github.com/influxdata/telegraf/issues/14603) `inputs.nvidia-smi`
- [#14603](https://github.com/influxdata/telegraf/issues/14603) `inputs.rocm-smi`

View file

@ -0,0 +1,75 @@
# URL-Based Config Behavior
## Objective
Define the retry and reload behavior of remote URLs that are passed as config to
Telegraf. In terms of retry, currently Telegraf will attempt to load a remote
URL three times and then exit. In terms of reload, Telegraf does not have the
capability to reload remote URL based configs. This spec seeks to allow for
options for the user to further these capabilities.
## Keywords
config, error, retry, reload
## Overview
Telegraf allows for loading configurations from local files, directories, and
files via a URL. In order to allow situations where a configuration file is not
yet available or due to a flaky network, the first proposal is to introduce a
new CLI flag: `--url-config-retry-attempts`. This flag would continue to default
to three and would specify the number of retries to attempt to get a remote URL
during the initial startup of Telegraf.
```sh
--config-url-retry-attempts=3 Number of times to attempt to obtain a remote
configuration via a URL during startup. Set to
-1 for unlimited attempts.
```
These attempts would block Telegraf from starting up completely until success or
until we have run out of attempts and exit.
Once Telegraf is up and running, users can use the `--watch` flag to enable
watching local files for changes and if/when changes are made, then reload
Telegraf with the new configuration. For remote URLs, I propose a new CLI flag:
`--url-config-check-interval`. This flag would set an internal timer that when
it goes off, would check for an update to a remote URL file.
```sh
--config-url-watch-interval=0s Time duration to check for updates to URL based
configuration files. Disabled by default.
```
At each interval, Telegraf would send an HTTP HEAD request to the configuration
URL, here is an example curl HEAD request and output:
```sh
$ curl --head http://localhost:8000/config.toml
HTTP/1.0 200 OK
Server: SimpleHTTP/0.6 Python/3.12.3
Date: Mon, 29 Apr 2024 18:18:56 GMT
Content-type: application/octet-stream
Content-Length: 1336
Last-Modified: Mon, 29 Apr 2024 11:44:19 GMT
```
The proposal then is to store the last-modified value when we first obtain the
file and compare the value at each interval. No need to parse the value, just
store the raw string. If there is a difference, trigger a reload.
If anything other than 2xx response code is returned from the HEAD request,
Telegraf would print a warning message and retry at the next interval. Telegraf
will continue to run the existing configuration with no change.
If the value of last-modified is empty, while very unlikely, then Telegraf would
ignore this configuration file. Telegraf will print a warning message once about
the missing field.
## Relevant Issues
* Configuration capabilities to retry for loading config via URL #[8854][]
* Telegraf reloads URL-based/remote config on a specified interval #[8730][]
[8854]: https://github.com/influxdata/telegraf/issues/8854
[8730]: https://github.com/influxdata/telegraf/issues/8730

View file

@ -0,0 +1,78 @@
# Partial write error handling
## Objective
Provide a way to pass information about partial metric write errors from an
output to the output model.
## Keywords
output plugins, write, error, output model, metric, buffer
## Overview
The output model wrapping each output plugin buffers metrics to be able to batch
those metrics for more efficient sending. In each flush cycle, the model
collects a batch of metrics and hands it over to the output plugin for writing
through the `Write` method. Currently, if writing succeeds (i.e. no error is
returned), _all metrics of the batch_ are removed from the buffer and are marked
as __accepted__ both in terms of statistics as well as in tracking-metric terms.
If writing fails (i.e. any error is returned), _all metrics of the batch_ are
__kept__ in the buffer for requeueing them in the next write cycle.
Issues arise when an output plugin cannot write all metrics of a batch bit only
some to its service endpoint, e.g. due to the metrics being serializable or if
metrics are selectively rejected by the service on the server side. This might
happen when reaching submission limits, violating service constraints e.g.
by out-of-order sends, or due to invalid characters in the serialited metric.
In those cases, an output currently is only able to accept or reject the
_complete batch of metrics_ as there is no mechanism to inform the model (and
in turn the buffer) that only _some_ of the metrics in the batch were failing.
As a consequence, outputs often _accept_ the batch to avoid a requeueing of the
failing metrics for the next flush interval. This distorts statistics of
accepted metrics and causes misleading log messages saying all metrics were
written sucessfully which is not true. Even worse, for outputs ending-up with
partial writes, e.g. only the first half of the metrics can be written to the
service, there is no way of telling the model to selectively accept the actually
written metrics and in turn those outputs must internally buffer the remaining,
unwritten metrics leading to a duplication of buffering logic and adding to code
complexity.
This specification aims at defining the handling of partially successful writes
and introduces the concept of a special _partial write error_ type to reflect
partial writes and partial serialization overcoming the aforementioned issues
and limitations.
To do so, the _partial write error_ error type must contain a list of
successfully written metrics, to be marked __accepted__, both in terms of
statistics as well as in terms of metric tracking, and must be removed from the
buffer. Furthermore, the error must contain a list of metrics that cannot be
sent or serialized and cannot be retried. These metrics must be marked as
__rejected__, both in terms of statistics as well as in terms of metric
tracking, and must be removed from the buffer.
The error may contain a list of metrics not-yet written to be __kept__ for the
next write cylce. Those metrics must not be marked and must be kept in the
buffer. If the error does not contain the list of not-yet written metrics, this
list must be inferred using the accept and reject lists mentioned above.
To allow the model and the buffer to correctly handle tracking metrics ending up
in the buffer and output the tracking information must be preserved during
communication between the output plugin, the model and the buffer through the
specified error. To do so, all metric lists should be communicated as indices
into the batch to be able to handle tracking metrics correctly.
For backward compatibility and simplicity output plugins can return a `nil`
error to indicate that __all__ metrics of the batch are __accepted__. Similarly,
returing an error _not_ being a _partial write error_ indicates that __all__
metrics of the batch should be __kept__ in the buffer for the next write cycle.
## Related Issues
- [issue #11942](https://github.com/influxdata/telegraf/issues/11942) for
contradicting log messages
- [issue #14802](https://github.com/influxdata/telegraf/issues/14802) for
rate-limiting multiple batch sends
- [issue #15908](https://github.com/influxdata/telegraf/issues/15908) for
infinite loop if single metrics cannot be written

View file

@ -0,0 +1,68 @@
# Probing plugins after startup
## Objective
Allow Telegraf to probe plugins during startup to enable enhanced plugin error
detection like availability of hardware or services
## Keywords
inputs, outputs, startup, probe, error, ignore, behavior
## Overview
When plugins are first instantiated, Telegraf will call the plugin's `Start()`
method (for inputs) or `Connect()` (for outputs) which will initialize its
configuration based off of config options and the running environment. It is
sometimes the case that while the initialization step succeeds, the upstream
service in which the plugin relies on is not actually running, or is not capable
of being communicated with due to incorrect configuration or environmental
problems. In situations like this, Telegraf does not detect that the plugin's
upstream service is not functioning properly, and thus it will continually call
the plugin during each `Gather()` iteration. This often has the effect of
polluting journald and system logs with voluminous error messages, which creates
issues for system administrators who rely on such logs to identify other
unrelated system problems.
More background discussion on this option, including other possible avenues, can
be viewed [here](https://github.com/influxdata/telegraf/issues/16028).
## Probing
Probing is an action whereby the plugin should ensure that the plugin will be
fully functional on a best effort basis. This may comprise communicating with
its external service, trying to access required devices, entities or executables
etc to ensure that the plugin will not produce errors during e.g. data collection
or data output. Probing must *not* produce, process or output any metrics.
Plugins that support probing must implement the `ProbePlugin` interface. Such
plugins must behave in the following manner:
1. Return an error if the external dependencies (hardware, services,
executables, etc.) of the plugin are not available.
2. Return an error if information cannot be gathered (in the case of inputs) or
sent (in the case of outputs) due to unrecoverable issues. For example, invalid
authentication, missing permissions, or non-existent endpoints.
3. Otherwise, return `nil` indicating the plugin will be fully functional.
## Plugin Requirements
Plugins that allow probing must implement the `ProbePlugin` interface. The
exact implementation depends on the plugin's functionality and requirements,
but generally it should take the same actions as it would during normal operation
e.g. calling `Gather()` or `Write()` and check if errors occur. If probing fails,
it must be safe to call the plugin's `Close()` method.
Input plugins must *not* produce metrics, output plugins must *not* send any
metrics to the service. Plugins must *not* influence the later data processing or
collection by modifying the internal state of the plugin or the external state of the
service or hardware. For example, file-offsets or other service states must be
reset to not lose data during the first gather or write cycle.
Plugins must return `nil` upon successful probing or an error otherwise.
## Related Issues
- [#16028](https://github.com/influxdata/telegraf/issues/16028)
- [#15916](https://github.com/influxdata/telegraf/pull/15916)
- [#16001](https://github.com/influxdata/telegraf/pull/16001)