Adding upstream version 1.34.4.

Signed-off-by: Daniel Baumann <daniel@debian.org>
2025-05-24 07:26:29 +02:00 · 2025-05-24 07:26:29 +02:00 · 4978089aab
commit 4978089aab
parent e393c3af3f
4963 changed files with 677545 additions and 0 deletions
--- a/docs/specs/README.md
+++ b/docs/specs/README.md
@ -0,0 +1,71 @@
+# Telegraf Specification Overview
+
+## Objective
+
+Define and layout the Telegraf specification process.
+
+## Overview
+
+The general goal of a spec is to detail the work that needs to get accomplished
+for a new feature. A developer should be able to pick up a spec and have a
+decent understanding of the objective, the steps required, and most of the
+general design decisions.
+
+The specs can then live in the Telegraf repository to share and involve the
+community in the process of planning larger changes or new features. The specs
+also serve as a public historical record for changes.
+
+## Process
+
+The general workflow is for a user to put up a PR with a spec outlining the
+task, have any discussion in the PR, reach consensus, and ultimately commit
+the finished spec to the repo.
+
+While researching a new feature may involve an investment of time, writing the
+spec should be relatively quick. It should not take hours of time.
+
+## Spec naming
+
+Please name the actual file prefixed with `tsd` and the next available
+number, for example:
+
+* tsd-001-agent-write-ahead-log.md
+* tsd-002-inputs-apache-increase-timeout.md
+* tsd-003-serializers-parquet.md
+
+All lower-case and separated by hyphens.
+
+## What belongs in a spec
+
+A spec should involve the creation of a markdown file with at least an objective
+and overview:
+
+* Objective (required) - One sentence headline
+* Overview (required) - Explain the reasoning for the new feature and any
+  historical information. Answer the why this is needed.
+
+Please feel free to make a copy the template.md and start with that.
+
+The user is free to add additional sections or parts in order to express and
+convey a new feature. For example this might include:
+
+* Keywords - Help identify what the spec is about
+* Is/Is-not - Explicitly state what this change includes and does not include
+* Prior Art - Point at existing or previous PRs, issues, or other works that
+  demonstrate the feature or need for it.
+* Open Questions - Section with open questions that can get captured in
+  updates to the PR
+
+## Changing existing specs
+
+Small changes which are non-substantive, like grammar or formatting are gladly
+accepted.
+
+After a feature is complete it may make sense to come back and update a spec
+based on the final result.
+
+Other changes that make substantive changes are entirely up to the maintainers
+whether the edits to an existing RFC will be accepted. In general, finished
+specs should be considered complete and done, however, priorities, details, or
+other situations may evolve over time and as such introduce the need to make
+updates.
--- a/docs/specs/template.md
+++ b/docs/specs/template.md
@ -0,0 +1,20 @@
+# Title
+
+## Objective
+
+One sentence explanation of the feature.
+
+## Overview
+
+Background and details about the feature.
+
+## Keywords
+
+A few items to specify what areas of Telegraf this spec affects (e.g. outputs,
+inputs, processors, aggregators, agent, packaging, etc.)
+
+## Is/Is-not
+
+## Prior art
+
+## Open questions
--- a/docs/specs/tsd-001-deprecation.md
+++ b/docs/specs/tsd-001-deprecation.md
@ -0,0 +1,182 @@
+# Plugin and Plugin Option Deprecation
+
+## Objective
+
+Specifies the process of deprecating and removing plugins, plugin settings
+including values of those settings or features.
+
+## Keywords
+
+procedure, removal, all plugins
+
+## Overview
+
+Over time the number of plugins, plugin options and plugin features grow and
+some of those plugins or options are either not relevant anymore, have been
+superseded or subsumed by other plugins or options. To be able to remove those,
+this specification defines a process to deprecate plugins, plugin options and
+plugin features including a timeline and minimal time-frames. Additionally, the
+specification defines a framework to annotate deprecations in the code and
+inform users about such deprecations.
+
+## User experience
+
+In the deprecation phase a warning will be shown at Telegraf startup with the
+following content
+
+```text
+Plugin "inputs.logparser" deprecated since version 1.15.0 and will be removed in 1.40.0: use 'inputs.tail' with 'grok' data format instead
+```
+
+Similar warnings will be shown when removing plugin options or option values.
+This provides users with time to replace the deprecated plugin in their
+configuration file.
+
+After the shown release (`v1.40.0` in this case) the warning will be promoted
+to an error preventing Telegraf from starting. The user now has to adapt the
+configuration file to start Telegraf.
+
+## Time frames and considerations
+
+When deprecating parts of Telegraf, it is important to provide users with enough
+time to migrate to alternative solutions before actually removing those parts.
+
+In general, plugins, plugin options or option values should only be deprecated
+if a suitable alternative exists! In those cases, the deprecations should
+predate the removal by at least one and a half years. In current release terms
+this corresponds to six minor-versions. However, there might be circumstances
+requiring a prolonged time between deprecation and removal to ensure a smooth
+transition for users.
+
+Versions between deprecation and removal of plugins, plugin options or option
+values, Telegraf must log a *warning* on startup including information about
+the version introducing the deprecation, the version of removal and an
+user-facing hint on suitable replacements. In this phase Telegraf should
+operate normally even with deprecated plugins, plugin options or option values
+being set in the configuration files.
+
+Starting from the removal version, Telegraf must show an *error* message for
+deprecated plugins present in the configuration including all information listed
+above. Removed plugin options and option values should be handled as invalid
+settings in the configuration files and must lead to an error. In this phase,
+Telegraf should *stop running* until all deprecated plugins, plugin options and
+option values are removed from the configuration files.
+
+## Deprecation Process
+
+The deprecation process comprises the following the steps below.
+
+### File issue
+
+In the filed issue you should outline which plugin, plugin option or feature
+you want to deprecate and *why*! Determine in which version the plugin should
+be removed.
+
+Try to reach an agreement in the issue before continuing and get a sign off
+from the maintainers!
+
+### Submit deprecation pull-request
+
+Send a pull request adding deprecation information to the code and update the
+plugin's `README.md` file. Depending on what you want to deprecate this
+comprises different locations and steps as detailed below.
+
+Once the deprecation pull-request is merged and Telegraf is released, we have
+to wait for the targeted Telegraf version for actually removing the code.
+
+#### Deprecating a plugin
+
+When deprecating a plugin you need to add an entry to the `deprecation.go` file
+in the respective plugin category with the following format
+
+```golang
+    "<plugin name>": {
+        Since:     "<x.y.z format version of the next minor release>",
+        RemovalIn: "<x.y.z format version of the plugin removal>",
+        Notice:    "<user-facing hint e.g. on replacements>",
+    },
+```
+
+If you for example want to remove the `inputs.logparser` plugin you should add
+
+```golang
+    "logparser": {
+        Since:     "1.15.0",
+        RemovalIn: "1.40.0"
+        Notice:    "use 'inputs.tail' with 'grok' data format instead",
+    },
+```
+
+to `plugins/inputs/deprecations.go`. By doing this, Telegraf will show a
+deprecation warning to the user starting from version `1.15.0` including the
+`Notice` you provided. The plugin can then be remove in version `1.40.0`.
+
+Additionally, you should update the plugin's `README.md` adding a paragraph
+mentioning since when the plugin is deprecated, when it will be removed and a
+hint to alternatives or replacements. The paragraph should look like this
+
+```text
+**Deprecated in version v1.15.0 and scheduled for removal in v1.40.0**:
+Please use the [tail][] plugin with the [`grok` data format][grok parser]
+instead!
+```
+
+#### Deprecating an option
+
+To deprecate a plugin option, remove the option from the `sample.conf` file and
+add the deprecation information to the structure field in the code. If you for
+for example want to deprecate the `ssl_enabled` option in `inputs.example` you
+should add
+
+```golang
+type Example struct {
+    ...
+    SSLEnabled bool `toml:"ssl_enabled" deprecated:"1.3.0;1.40.0;use 'tls_*' options instead"`
+}
+```
+
+to schedule the setting for removal in version `1.40.0`. The last element of
+the `deprecated` tag is a user-facing notice similar to plugin deprecation.
+
+#### Deprecating an option-value
+
+Sometimes, certain option values become deprecated or superseded by other
+options or values. To deprecate those option values, remove them from
+`sample.conf` and add the deprecation info in the code if the deprecated value
+is *actually used* via
+
+```golang
+func (e *Example) Init() error {
+    ...
+    if e.Mode == "old" {
+        models.PrintOptionDeprecationNotice(telegraf.Warn, "inputs.example", "mode", telegraf.DeprecationInfo{
+            Since:     "1.23.1",
+            RemovalIn: "1.40.0",
+            Notice:    "use 'v1' instead",
+        })
+    }
+    ...
+    return nil
+}
+```
+
+This will show a warning if the deprecated `v1` value is used for the `mode`
+setting in `inputs.example` with a user-facing notice.
+
+### Submit pull-request for removing code
+
+Once the plugin, plugin option or option-value is deprecated, we have to wait
+for the `RemovedIn` release to remove the code. In the examples above, this
+would be version `1.40.0`. After all scheduled bugfix-releases are done, with
+`1.40.0` being the next release, you can create a pull-request to actually
+remove the deprecated code.
+
+Please make sure, you remove the plugin, plugin option or option value and the
+code referencing those. This might also comprise the `all` files of your plugin
+category, test-cases including those of other plugins, README files or other
+documentation. For removed plugins, please keep the deprecation info in
+`deprecations.go` so users can find a reference when switching from a really
+old version.
+
+Make sure you add an `Important Changes` sections to the `CHANGELOG.md` file
+describing the removal with a reference to your PR.
--- a/docs/specs/tsd-002-custom-builder.md
+++ b/docs/specs/tsd-002-custom-builder.md
@ -0,0 +1,71 @@
+# Telegraf Custom-Builder
+
+## Objective
+
+Provide a tool to build a customized, smaller version of Telegraf with only
+the required plugins included.
+
+## Keywords
+
+tool, binary size, customization
+
+## Overview
+
+The Telegraf binary continues to grow as new plugins and features are added
+and dependencies are updated. Users running on resource constrained systems
+such as embedded-systems or inside containers might suffer from the growth.
+
+This document specifies a tool to build a smaller Telegraf binary tailored to
+the plugins configured and actually used, removing unnecessary and unused
+plugins. The implementation should be able to cope with configured parsers and
+serializers including defaults for those plugin categories. Valid Telegraf
+configuration files, including directories containing such files, are the input
+to the customization process.
+
+The customization tool might not be available for older versions of Telegraf.
+Furthermore, the degree of customization and thus the effective size reduction
+might vary across versions. The tool must create a single static Telegraf
+binary. Distribution packages or containers are *not* targeted.
+
+## Prior art
+
+[PR #5809](https://github.com/influxdata/telegraf/pull/5809) and
+[telegraf-lite-builder](https://github.com/influxdata/telegraf/tree/telegraf-lite-builder/cmd/telegraf-lite-builder):
+
+- Uses docker
+- Uses browser:
+  - Generates a webpage to pick what options you want. User chooses plugins;
+    does not take a config file
+  - Build a binary, then minifies by stripping and compressing that binary
+- Does some steps that belong in makefile, not builder
+  - Special case for upx
+  - Makes gzip, zip, tar.gz
+- Uses gopkg.in?
+- Can also work from the command line
+
+[PR #8519](https://github.com/influxdata/telegraf/pull/8519)
+
+- User chooses plugins OR provides a config file
+
+[powers/telegraf-build](https://github.com/powersj/telegraf-build)
+
+- User chooses plugins OR provides a config file
+- Currently kept in separate repo
+- Undoes changes to all.go files
+
+[rawkode/bring-your-own-telegraf](https://github.com/rawkode/bring-your-own-telegraf)
+
+- Uses docker
+
+## Additional information
+
+You might be able to further reduce the binary size of Telegraf by removing
+debugging information. This is done by adding `-w` and `-s` to the linker flags
+before building `LDFLAGS="-w -s"`.
+
+However, please note that this removes information helpful for debugging issues
+in Telegraf.
+
+Additionally, you can use a binary packer such as [UPX](https://upx.github.io/)
+to reduce the required *disk* space. This compresses the binary and decompresses
+it again at runtime. However, this does not reduce memory footprint at runtime.
--- a/docs/specs/tsd-003-state-persistence.md
+++ b/docs/specs/tsd-003-state-persistence.md
@ -0,0 +1,125 @@
+# Plugin State-Persistence
+
+## Objective
+
+Retain the state of stateful plugins across restarts of Telegraf.
+
+## Keywords
+
+framework, plugin, stateful, persistence
+
+## Overview
+
+Telegraf contains a number of plugins that hold an internal state while
+processing. For some of the plugins this state is important for efficient
+processing like the location when reading a large file or when continuously
+querying data from a stateful peer requiring for example an offset or the last
+queried timestamp. For those plugins it is important to persistent their
+internal state over restarts of Telegraf.
+
+It is intended to
+
+- allow for opt-in of plugins to store a state per plugin _instance_
+- restore the state for each plugin instances at startup
+- track the plugin instances over restarts to relate the stored state with a
+  corresponding plugin instance
+- automatically compute plugin instance IDs based on the plugin configuration
+- provide a way to manually specify instance IDs by the user
+- _not_ restore states if the plugin configuration changed between runs
+- make implementation easy for plugin developers
+- make no assumption on the state _content_
+
+The persistence will use the following steps:
+
+- Compute an unique ID for each of the plugin _instances_
+- Startup Telegraf plugins calling `Init()`, etc.
+- Initialize persistence framework with the user specified `statefile` location
+  and load the state if present
+- Determine all stateful plugin instances by fulfilling the `StatefulPlugin`
+  interface
+- Restore plugin states (if any) for each plugin ID present in the state-file
+- Run data-collection etc...
+- On shutdown, stopping all Telegraf plugins calling `Stop()` or `Close()`
+  depending on the plugin type
+- Query the state of all registered stateful plugins state
+- Create an overall state-map with the plugin instance ID as a key and the
+  serialized plugin state as value.
+- Marshal the overall state-map and store to disk
+
+Potential users of this functionality are plugins continuously querying
+endpoints with information of a previous query (e.g. timestamps, offsets,
+transaction tokens, etc.) The following plugins are known to have an internal
+state. This is not a comprehensive list.
+
+- `inputs.win_eventlog` ([PR #8281](https://github.com/influxdata/telegraf/pull/8281))
+- `inputs.docker_log` ([PR #7749](https://github.com/influxdata/telegraf/pull/7749))
+- `inputs.tail` (file offset)
+- `inputs.cloudwatch` (`windowStart`/`windowEnd` parameters)
+- `inputs.stackdriver` (`prevEnd` parameter)
+
+### Plugin ID computation
+
+The plugin ID is computed based on the configuration options specified for the
+plugin instance. To generate the ID all settings are extracted as `string`
+key-value pairs with the option name being the key and the value being the
+configuration option setting. For nested configuration options, e.g. if the
+plugins has a sub-table, the options are flattened with a canonical key. The
+canonical key elements must be concatenated with a dot (`.`) separator. In case
+the sub-element is a list of tables, the key must include the index of each
+table prefixed by a hash sign i.e. `<parent>#<index>.<child>`.
+
+The resulting key-value pairs of configuration options are then sorted by the
+key in lexical order to make the resulting ID invariant against changes in the
+order of configuration options. The key and the value of each pair are joined
+by a colon (`:`) to a single `string`.
+
+Finally, a SHA256 sum is computed across all key-value strings separated by a
+`null` byte. The HEX representation of the resulting SHA256 is used as the
+plugin instance ID.
+
+### State serialization format
+
+The overall Telegraf state maps the plugin IDs (keys) to the serialized state
+of the corresponding plugin (values). The state data returned by stateful
+plugins is serialized to JSON. The resulting byte-sequence is used as the value
+for the overall state. On-disk, the overall state of Telegraf is stored as JSON.
+
+To restore the state of a plugin, the overall Telegraf state is first
+deserialized from the on-disk JSON data and a lookup for the plugin ID is
+performed in the resulting map. The value, if found, is then deserialized to the
+plugin's state data-structure and provided to the plugin after calling `Init()`.
+
+## Is / Is-not
+
+### Is
+
+- A framework to persist states over restarts of Telegraf
+- A simple local state store
+- A way to restore plugin states between restarts without configuration changes
+- A unified API for plugins to use when requiring persistence of a state
+
+### Is-Not
+
+- A remote storage framework
+- A way to store anything beyond fundamental plugin states
+- A data-store or database
+- A way to reassign plugin states if their configuration changes
+- A tool to interactively adding/removing/modifying states of plugins
+- A persistence guarantee beyond clean shutdown (i.e. no crash resistance)
+
+## Prior art
+
+- [PR #8281](https://github.com/influxdata/telegraf/pull/8281): Stores Windows
+  event-log bookmarks in the registry
+- [PR #7749](https://github.com/influxdata/telegraf/pull/7749): Stores container
+  ID and log offset to a file at a user-provided path
+- [PR #7537](https://github.com/influxdata/telegraf/pull/7537): Provides a
+  global state object and periodically queries plugin states to store the state
+  object to a JSON file. This approach does not provide a ID per plugin
+  _instance_ so it seems like there is only a single state for a plugin _type_
+- [PR #9476](https://github.com/influxdata/telegraf/pull/9476): Register
+  stateful plugins to persister and automatically assigns an ID to plugin
+  _instances_ based on the configuration. The approach also allows to overwrite
+  the automatic ID e.g. with user specified data. It uses the plugin instance ID
+  to store/restore state to the same plugin instance and queries the plugin
+  state on shutdown and write file (currently JSON).
--- a/docs/specs/tsd-004-configuration-migration.md
+++ b/docs/specs/tsd-004-configuration-migration.md
@ -0,0 +1,69 @@
+# Configuration Migration
+
+## Objective
+
+Provides a subcommand and framework to migrate configurations containing
+deprecated settings to a corresponding recent configuration.
+
+## Keywords
+
+configuration, deprecation, telegraf command
+
+## Overview
+
+With the deprecation framework of [TSD-001](tsd-001-deprecation.md) implemented
+we see more and more plugins and options being scheduled for removal in the
+future. Furthermore, deprecations become visible to the user due to the warnings
+issued for removed plugins, plugin options and plugin option values.
+
+To aid the user in mitigating deprecated configuration settings this
+specifications proposes the implementation of a `migrate` sub-command to the
+Telegraf `config` command for automatically migrate the user's existing
+configuration files away from the deprecated settings to an equivalent, recent
+configuration. Furthermore, the specification describes the layout and
+functionality of a plugin-based migration framework to implement migrations.
+
+### `migrate` sub-command
+
+The `migrate` sub-command of the `config` command should take a set of
+configuration files and configuration directories and apply available migrations
+to deprecated plugins, plugin options or plugin option-values in order to
+generate new configuration files that do not make use of deprecated options.
+
+In the process, the migration procedure must ensure that only plugins with
+applicable migrations are modified. Existing configuration must be kept and not
+be overwritten without manual confirmation of the user. This should be
+accomplished by storing modified configuration files with a `.migrated` suffix
+and leaving it to the user to overwrite the existing configuration with the
+generated counterparts. If no migration is applied in a configuration file, the
+command might not generate a new file and leave the original file untouched.
+
+During migration, the configuration, plugin behavior, resulting metrics and
+comments should be kept on a best-effort basis. Telegraf must inform the user
+about applied migrations and potential changes in the plugin behavior or
+resulting metrics. If a plugin cannot be automatically migrated but requires
+manual intervention, Telegraf should inform the user.
+
+### Migration implementations
+
+To implement migrations for deprecated plugins, plugin option or plugin option
+values, Telegraf must provide a plugin-based infrastructure to register and
+apply implemented migrations based on the plugin-type. Only one migration per
+plugin-type must be registered.
+
+Developers must implement the required interfaces and register the migration
+to the mentioned framework. The developer must provide the possibility to
+exclude the migration at build-time according to
+[TSD-002](tsd-002-custom-builder.md). Existing migrations can be extended but
+must be cumulative such that any previous configuration migration functionality
+is kept.
+
+Resulting configurations should generate metrics equivalent to the previous
+setup also making use of metric selection, renaming and filtering mechanisms.
+In cases this is not possible, there must be a clear information to the user
+what to expect and which differences might occur.
+A migration can only be informative, i.e. notify the user that a plugin has to
+manually be migrated and should point users to additional information.
+
+Deprecated plugins and plugin options must be removed from the migrated
+configuration.
--- a/docs/specs/tsd-005-output-buffer-strategy.md
+++ b/docs/specs/tsd-005-output-buffer-strategy.md
@ -0,0 +1,77 @@
+# Telegraf Output Buffer Strategy
+
+## Objective
+
+Introduce a new agent-level config option to choose a disk buffer strategy for
+output plugin metric queues.
+
+## Overview
+
+Currently, when a Telegraf output metric queue fills, either due to incoming
+metrics being too fast or various issues with writing to the output, oldest
+metrics are overwritten and never written to the output. This specification
+defines a set of options to make this output queue more durable by persisting
+pending metrics to disk rather than only an in-memory limited size queue.
+
+## Keywords
+
+output plugins, agent configuration, persist to disk
+
+## Agent Configuration
+
+The configuration is at the agent-level, with options for:
+
+- **Memory**, the current implementation, with no persistence to disk
+- **Write-through**, all metrics are also written to disk using a
+  Write Ahead Log (WAL) file
+- **Disk-overflow**, when the memory buffer fills, metrics are flushed to a
+  WAL file to avoid dropping overflow metrics
+
+As well as an option to specify a directory to store the WAL files on disk,
+with a default value. These configurations are global, and no change means
+memory only mode, retaining current behavior.
+
+## Metric Ordering and Tracking
+
+Tracking metrics will be accepted on a successful write to the output
+destination. Metrics will be written to their appropriate output in the order
+they are received in the buffer regardless of which buffer strategy is chosen.
+
+## Disk Utilization and File Handling
+
+Each output plugin has its own in-memory output buffer, and therefore will
+each have their own WAL file for buffer persistence. This file may not exist
+if Telegraf is successfully able to write all of its metrics without filling
+the in-memory buffer in disk-overflow mode, or not at all in memory mode.
+Telegraf should use one file per output plugin, and remove entries from the
+WAL file as they are written to the output.
+
+Telegraf will not make any attempt to limit the size on disk taken by these
+files beyond cleaning up WAL files for metrics that have successfully been
+flushed to their output destination. It is the user's responsibility to ensure
+these files do not entirely fill the disk, both during Telegraf uptime and
+with lingering files from previous instances of the program.
+
+If WAL files exist for an output plugin from previous instances of Telegraf,
+they will be picked up and flushed before any new metrics that are written
+to the output. This is to ensure that these metrics are not lost, and to
+ensure that output write order remains consistent.
+
+Telegraf must additionally provide a way to manually flush WAL files via
+some separate plugin or similar. This could be used as a way to ensure that
+WAL files are properly written in the event that the output plugin changes
+and the WAL file is unable to be detected by a new instance of Telegraf.
+This plugin should not be required for use to allow the buffer strategy to
+work.
+
+## Is/Is-not
+
+- Is a way to prevent metrics from being dropped due to a full memory buffer
+- Is not a way to guarantee data safety in the event of a crash or system failure
+- Is not a way to manage file system allocation size, file space will be used
+  until the disk is full
+
+## Prior art
+
+[Initial issue](https://github.com/influxdata/telegraf/issues/802)
+[Loose specification issue](https://github.com/influxdata/telegraf/issues/14805)
--- a/docs/specs/tsd-006-startup-error-behavior.md
+++ b/docs/specs/tsd-006-startup-error-behavior.md
@ -0,0 +1,115 @@
+# Startup Error Behavior
+
+## Objective
+
+Unified, configurable behavior on retriable startup errors.
+
+## Keywords
+
+inputs, outputs, startup, error, retry
+
+## Overview
+
+Many Telegraf plugins connect to an external service either on the same machine
+or via network. On automated startup of Telegraf (e.g. via service) there is no
+guarantee that those services are fully started yet, especially when they reside
+on a remote host. More and more plugins implement mechanisms to retry reaching
+their related service if they failed to do so on startup.
+
+This specification intends to unify the naming of configuration-options, the
+values of those options, and their semantic meaning. It describes the behavior
+for the different options on handling startup-errors.
+
+Startup errors are all errors occurring in calls to `Start()` for inputs and
+service-inputs or `Connect()` for outputs. The behaviors described below
+should only be applied in cases where the plugin *explicitly* states that an
+startup error is *retriable*. This includes for example network errors
+indicating that the host or service is not yet reachable or external
+resources, like a machine or file, which are not yet available, but might become
+available later. To indicate a retriable startup error the plugin should return
+a predefined error-type.
+
+In cases where the error cannot be generally determined be retriable by
+the plugin, the plugin might add configuration settings to let the user
+configure that property. For example, where an error code indicates a fatal,
+non-recoverable error in one case but a non-fatal, recoverable error in another
+case.
+
+## Configuration Options and Behaviors
+
+Telegraf must introduce a unified `startup_error_behavior` configuration option
+for inputs and output plugins. The option is handled directly by the Telegraf
+agent and is not passed down to the plugins. The setting must be available on a
+per-plugin basis and defines how Telegraf behaves on startup errors.
+
+For all config option values Telegraf might retry to start the plugin for a
+limited number of times during the startup phase before actually processing
+data. This corresponds to the current behavior of Telegraf to retry three times
+with a fifteen second interval before continuing processing of the plugins.
+
+### `error` behavior
+
+The `error` setting for the `startup_error_behavior` option causes Telegraf to
+fail and exit on startup errors. This must be the default behavior.
+
+### `retry` behavior
+
+The `retry` setting for the `startup_error_behavior` option Telegraf must *not*
+fail on startup errors and should continue running. Telegraf must retry to
+startup the failed plugin in each gather or write cycle, for inputs or for
+outputs respectively, for an unlimited number of times. Neither the
+plugin's `Gather()` nor `Write()` method is called as long as the startup did
+not succeed. Metrics sent to an output plugin will be buffered until the plugin
+is actually started. If the metric-buffer limit is reached **metrics might be
+dropped**!
+
+In case a plugin signals a partially successful startup, e.g. a subset of the
+given endpoints are reachable, Telegraf must try to fully startup the remaining
+endpoints by calling `Start()` or `Connect()`, respectively, until full startup
+is reached **and** trigger the plugin's `Gather()` nor `Write()` methods.
+
+### `ignore` behavior
+
+When using the `ignore` setting for the `startup_error_behavior` option Telegraf
+must *not* fail on startup errors and should continue running. On startup error,
+Telegraf must ignore the plugin as-if it was not configured at all, i.e. the
+plugin must be completely removed from processing.
+
+### `probe` behavior
+
+When using the `probe` setting for the `startup_error_behavior` option Telegraf
+must *not* fail on startup errors and should continue running. On startup error,
+Telegraf must ignore the plugin as-if it was not configured at all, i.e. the
+plugin must be completely removed from processing, similar to the `ignore`
+behavior. Additionally, Telegraf must probe the plugin (as defined in
+[TSD-009][tsd_009]) after startup, if it implements the `ProbePlugin` interface.
+If probing is available *and* returns an error Telegraf must *ignore* the
+plugin as-if it was not configured at all.
+
+[tsd_009]: /docs/specs/tsd-009-probe-on-startup.md
+
+## Plugin Requirements
+
+Plugins participating in handling startup errors must implement the `Start()`
+or `Connect()` method for inputs and outputs respectively. Those methods must be
+safe to be called multiple times during retries without leaking resources or
+causing issues in the service used.
+
+Furthermore, the `Close()` method of the plugins must be safe to be called for
+cases where the startup failed without causing panics.
+
+The plugins should return a `nil` error during startup to indicate a successful
+startup or a retriable error (via predefined error type) to enable the defined
+startup error behaviors. A non-retriable error (via predefined error type) or
+a generic error will bypass the startup error behaviors and Telegraf must fail
+and exit in the startup phase.
+
+## Related Issues
+
+- [#8586](https://github.com/influxdata/telegraf/issues/8586) `inputs.postgresql`
+- [#9778](https://github.com/influxdata/telegraf/issues/9778) `outputs.kafka`
+- [#13278](https://github.com/influxdata/telegraf/issues/13278) `outputs.cratedb`
+- [#13746](https://github.com/influxdata/telegraf/issues/13746) `inputs.amqp_consumer`
+- [#14365](https://github.com/influxdata/telegraf/issues/14365) `outputs.postgresql`
+- [#14603](https://github.com/influxdata/telegraf/issues/14603) `inputs.nvidia-smi`
+- [#14603](https://github.com/influxdata/telegraf/issues/14603) `inputs.rocm-smi`
--- a/docs/specs/tsd-007-url-config-behavior.md
+++ b/docs/specs/tsd-007-url-config-behavior.md
@ -0,0 +1,75 @@
+# URL-Based Config Behavior
+
+## Objective
+
+Define the retry and reload behavior of remote URLs that are passed as config to
+Telegraf. In terms of retry, currently Telegraf will attempt to load a remote
+URL three times and then exit. In terms of reload, Telegraf does not have the
+capability to reload remote URL based configs. This spec seeks to allow for
+options for the user to further these capabilities.
+
+## Keywords
+
+config, error, retry, reload
+
+## Overview
+
+Telegraf allows for loading configurations from local files, directories, and
+files via a URL. In order to allow situations where a configuration file is not
+yet available or due to a flaky network, the first proposal is to introduce a
+new CLI flag: `--url-config-retry-attempts`. This flag would continue to default
+to three and would specify the number of retries to attempt to get a remote URL
+during the initial startup of Telegraf.
+
+```sh
+--config-url-retry-attempts=3   Number of times to attempt to obtain a remote
+                                configuration via a URL during startup. Set to
+                                -1 for unlimited attempts.
+```
+
+These attempts would block Telegraf from starting up completely until success or
+until we have run out of attempts and exit.
+
+Once Telegraf is up and running, users can use the `--watch` flag to enable
+watching local files for changes and if/when changes are made, then reload
+Telegraf with the new configuration. For remote URLs, I propose a new CLI flag:
+`--url-config-check-interval`. This flag would set an internal timer that when
+it goes off, would check for an update to a remote URL file.
+
+```sh
+--config-url-watch-interval=0s  Time duration to check for updates to URL based
+                                configuration files. Disabled by default.
+```
+
+At each interval, Telegraf would send an HTTP HEAD request to the configuration
+URL, here is an example curl HEAD request and output:
+
+```sh
+$ curl --head http://localhost:8000/config.toml
+HTTP/1.0 200 OK
+Server: SimpleHTTP/0.6 Python/3.12.3
+Date: Mon, 29 Apr 2024 18:18:56 GMT
+Content-type: application/octet-stream
+Content-Length: 1336
+Last-Modified: Mon, 29 Apr 2024 11:44:19 GMT
+```
+
+The proposal then is to store the last-modified value when we first obtain the
+file and compare the value at each interval. No need to parse the value, just
+store the raw string. If there is a difference, trigger a reload.
+
+If anything other than 2xx response code is returned from the HEAD request,
+Telegraf would print a warning message and retry at the next interval. Telegraf
+will continue to run the existing configuration with no change.
+
+If the value of last-modified is empty, while very unlikely, then Telegraf would
+ignore this configuration file. Telegraf will print a warning message once about
+the missing field.
+
+## Relevant Issues
+
+* Configuration capabilities to retry for loading config via URL #[8854][]
+* Telegraf reloads URL-based/remote config on a specified interval #[8730][]
+
+[8854]: https://github.com/influxdata/telegraf/issues/8854
+[8730]: https://github.com/influxdata/telegraf/issues/8730
--- a/docs/specs/tsd-008-partial-write-error-handling.md
+++ b/docs/specs/tsd-008-partial-write-error-handling.md
@ -0,0 +1,78 @@
+# Partial write error handling
+
+## Objective
+
+Provide a way to pass information about partial metric write errors from an
+output to the output model.
+
+## Keywords
+
+output plugins, write, error, output model, metric, buffer
+
+## Overview
+
+The output model wrapping each output plugin buffers metrics to be able to batch
+those metrics for more efficient sending. In each flush cycle, the model
+collects a batch of metrics and hands it over to the output plugin for writing
+through the `Write` method. Currently, if writing succeeds (i.e. no error is
+returned), _all metrics of the batch_ are removed from the buffer and are marked
+as __accepted__ both in terms of statistics as well as in tracking-metric terms.
+If writing fails (i.e. any error is returned), _all metrics of the batch_ are
+__kept__ in the buffer for requeueing them in the next write cycle.
+
+Issues arise when an output plugin cannot write all metrics of a batch bit only
+some to its service endpoint, e.g. due to the metrics being serializable or if
+metrics are selectively rejected by the service on the server side. This might
+happen when reaching submission limits, violating service constraints e.g.
+by out-of-order sends, or due to invalid characters in the serialited metric.
+In those cases, an output currently is only able to accept or reject the
+_complete batch of metrics_ as there is no mechanism to inform the model (and
+in turn the buffer) that only _some_ of the metrics in the batch were failing.
+
+As a consequence, outputs often _accept_ the batch to avoid a requeueing of the
+failing metrics for the next flush interval. This distorts statistics of
+accepted metrics and causes misleading log messages saying all metrics were
+written sucessfully which is not true. Even worse, for outputs ending-up with
+partial writes, e.g. only the first half of the metrics can be written to the
+service, there is no way of telling the model to selectively accept the actually
+written metrics and in turn those outputs must internally buffer the remaining,
+unwritten metrics leading to a duplication of buffering logic and adding to code
+complexity.
+
+This specification aims at defining the handling of partially successful writes
+and introduces the concept of a special _partial write error_ type to reflect
+partial writes and partial serialization overcoming the aforementioned issues
+and limitations.
+
+To do so, the _partial write error_ error type must contain a list of
+successfully written metrics, to be marked __accepted__, both in terms of
+statistics as well as in terms of metric tracking, and must be removed from the
+buffer. Furthermore, the error must contain a list of metrics that cannot be
+sent or serialized and cannot be retried. These metrics must be marked as
+__rejected__, both in terms of statistics as well as in terms of metric
+tracking,  and must be removed from the buffer.
+
+The error may contain a list of metrics not-yet written to be __kept__ for the
+next write cylce. Those metrics must not be marked and must be kept in the
+buffer. If the error does not contain the list of not-yet written metrics, this
+list must be inferred using the accept and reject lists mentioned above.
+
+To allow the model and the buffer to correctly handle tracking metrics ending up
+in the buffer and output the tracking information must be preserved during
+communication between the output plugin, the model and the buffer through the
+specified error. To do so, all metric lists should be communicated as indices
+into the batch to be able to handle tracking metrics correctly.
+
+For backward compatibility and simplicity output plugins can return a `nil`
+error to indicate that __all__ metrics of the batch are __accepted__. Similarly,
+returing an error _not_ being a _partial write error_ indicates that __all__
+metrics of the batch should be __kept__ in the buffer for the next write cycle.
+
+## Related Issues
+
+- [issue #11942](https://github.com/influxdata/telegraf/issues/11942) for
+  contradicting log messages
+- [issue #14802](https://github.com/influxdata/telegraf/issues/14802) for
+  rate-limiting multiple batch sends
+- [issue #15908](https://github.com/influxdata/telegraf/issues/15908) for
+  infinite loop if single metrics cannot be written
--- a/docs/specs/tsd-009-probe-on-startup.md
+++ b/docs/specs/tsd-009-probe-on-startup.md
@ -0,0 +1,68 @@
+# Probing plugins after startup
+
+## Objective
+
+Allow Telegraf to probe plugins during startup to enable enhanced plugin error
+detection like availability of hardware or services
+
+## Keywords
+
+inputs, outputs, startup, probe, error, ignore, behavior
+
+## Overview
+
+When plugins are first instantiated, Telegraf will call the plugin's `Start()`
+method (for inputs) or `Connect()` (for outputs) which will initialize its
+configuration based off of config options and the running environment. It is
+sometimes the case that while the initialization step succeeds, the upstream
+service in which the plugin relies on is not actually running, or is not capable
+of being communicated with due to incorrect configuration or environmental
+problems. In situations like this, Telegraf does not detect that the plugin's
+upstream service is not functioning properly, and thus it will continually call
+the plugin during each `Gather()` iteration. This often has the effect of
+polluting journald and system logs with voluminous error messages, which creates
+issues for system administrators who rely on such logs to identify other
+unrelated system problems.
+
+More background discussion on this option, including other possible avenues, can
+be viewed [here](https://github.com/influxdata/telegraf/issues/16028).
+
+## Probing
+
+Probing is an action whereby the plugin should ensure that the plugin will be
+fully functional on a best effort basis. This may comprise communicating with
+its external service, trying to access required devices, entities or executables
+etc to ensure that the plugin will not produce errors during e.g. data collection
+or data output. Probing must *not* produce, process or output any metrics.
+
+Plugins that support probing must implement the `ProbePlugin` interface. Such
+plugins must behave in the following manner:
+
+1. Return an error if the external dependencies (hardware, services,
+executables, etc.) of the plugin are not available.
+2. Return an error if information cannot be gathered (in the case of inputs) or
+sent (in the case of outputs) due to unrecoverable issues. For example, invalid
+authentication, missing permissions, or non-existent endpoints.
+3. Otherwise, return `nil` indicating the plugin will be fully functional.
+
+## Plugin Requirements
+
+Plugins that allow probing must implement the `ProbePlugin` interface. The
+exact implementation depends on the plugin's functionality and requirements,
+but generally it should take the same actions as it would during normal operation
+e.g. calling `Gather()` or `Write()` and check if errors occur. If probing fails,
+it must be safe to call the plugin's `Close()` method.
+
+Input plugins must *not* produce metrics, output plugins must *not* send any
+metrics to the service. Plugins must *not* influence the later data processing or
+collection by modifying the internal state of the plugin or the external state of the
+service or hardware. For example, file-offsets or other service states must be
+reset to not lose data during the first gather or write cycle.
+
+Plugins must return `nil` upon successful probing or an error otherwise.
+
+## Related Issues
+
+- [#16028](https://github.com/influxdata/telegraf/issues/16028)
+- [#15916](https://github.com/influxdata/telegraf/pull/15916)
+- [#16001](https://github.com/influxdata/telegraf/pull/16001)