Adding upstream version 1.34.4.
Signed-off-by: Daniel Baumann <daniel@debian.org>
This commit is contained in:
parent
e393c3af3f
commit
4978089aab
4963 changed files with 677545 additions and 0 deletions
71
docs/specs/README.md
Normal file
71
docs/specs/README.md
Normal file
|
@ -0,0 +1,71 @@
|
|||
# Telegraf Specification Overview
|
||||
|
||||
## Objective
|
||||
|
||||
Define and layout the Telegraf specification process.
|
||||
|
||||
## Overview
|
||||
|
||||
The general goal of a spec is to detail the work that needs to get accomplished
|
||||
for a new feature. A developer should be able to pick up a spec and have a
|
||||
decent understanding of the objective, the steps required, and most of the
|
||||
general design decisions.
|
||||
|
||||
The specs can then live in the Telegraf repository to share and involve the
|
||||
community in the process of planning larger changes or new features. The specs
|
||||
also serve as a public historical record for changes.
|
||||
|
||||
## Process
|
||||
|
||||
The general workflow is for a user to put up a PR with a spec outlining the
|
||||
task, have any discussion in the PR, reach consensus, and ultimately commit
|
||||
the finished spec to the repo.
|
||||
|
||||
While researching a new feature may involve an investment of time, writing the
|
||||
spec should be relatively quick. It should not take hours of time.
|
||||
|
||||
## Spec naming
|
||||
|
||||
Please name the actual file prefixed with `tsd` and the next available
|
||||
number, for example:
|
||||
|
||||
* tsd-001-agent-write-ahead-log.md
|
||||
* tsd-002-inputs-apache-increase-timeout.md
|
||||
* tsd-003-serializers-parquet.md
|
||||
|
||||
All lower-case and separated by hyphens.
|
||||
|
||||
## What belongs in a spec
|
||||
|
||||
A spec should involve the creation of a markdown file with at least an objective
|
||||
and overview:
|
||||
|
||||
* Objective (required) - One sentence headline
|
||||
* Overview (required) - Explain the reasoning for the new feature and any
|
||||
historical information. Answer the why this is needed.
|
||||
|
||||
Please feel free to make a copy the template.md and start with that.
|
||||
|
||||
The user is free to add additional sections or parts in order to express and
|
||||
convey a new feature. For example this might include:
|
||||
|
||||
* Keywords - Help identify what the spec is about
|
||||
* Is/Is-not - Explicitly state what this change includes and does not include
|
||||
* Prior Art - Point at existing or previous PRs, issues, or other works that
|
||||
demonstrate the feature or need for it.
|
||||
* Open Questions - Section with open questions that can get captured in
|
||||
updates to the PR
|
||||
|
||||
## Changing existing specs
|
||||
|
||||
Small changes which are non-substantive, like grammar or formatting are gladly
|
||||
accepted.
|
||||
|
||||
After a feature is complete it may make sense to come back and update a spec
|
||||
based on the final result.
|
||||
|
||||
Other changes that make substantive changes are entirely up to the maintainers
|
||||
whether the edits to an existing RFC will be accepted. In general, finished
|
||||
specs should be considered complete and done, however, priorities, details, or
|
||||
other situations may evolve over time and as such introduce the need to make
|
||||
updates.
|
20
docs/specs/template.md
Normal file
20
docs/specs/template.md
Normal file
|
@ -0,0 +1,20 @@
|
|||
# Title
|
||||
|
||||
## Objective
|
||||
|
||||
One sentence explanation of the feature.
|
||||
|
||||
## Overview
|
||||
|
||||
Background and details about the feature.
|
||||
|
||||
## Keywords
|
||||
|
||||
A few items to specify what areas of Telegraf this spec affects (e.g. outputs,
|
||||
inputs, processors, aggregators, agent, packaging, etc.)
|
||||
|
||||
## Is/Is-not
|
||||
|
||||
## Prior art
|
||||
|
||||
## Open questions
|
182
docs/specs/tsd-001-deprecation.md
Normal file
182
docs/specs/tsd-001-deprecation.md
Normal file
|
@ -0,0 +1,182 @@
|
|||
# Plugin and Plugin Option Deprecation
|
||||
|
||||
## Objective
|
||||
|
||||
Specifies the process of deprecating and removing plugins, plugin settings
|
||||
including values of those settings or features.
|
||||
|
||||
## Keywords
|
||||
|
||||
procedure, removal, all plugins
|
||||
|
||||
## Overview
|
||||
|
||||
Over time the number of plugins, plugin options and plugin features grow and
|
||||
some of those plugins or options are either not relevant anymore, have been
|
||||
superseded or subsumed by other plugins or options. To be able to remove those,
|
||||
this specification defines a process to deprecate plugins, plugin options and
|
||||
plugin features including a timeline and minimal time-frames. Additionally, the
|
||||
specification defines a framework to annotate deprecations in the code and
|
||||
inform users about such deprecations.
|
||||
|
||||
## User experience
|
||||
|
||||
In the deprecation phase a warning will be shown at Telegraf startup with the
|
||||
following content
|
||||
|
||||
```text
|
||||
Plugin "inputs.logparser" deprecated since version 1.15.0 and will be removed in 1.40.0: use 'inputs.tail' with 'grok' data format instead
|
||||
```
|
||||
|
||||
Similar warnings will be shown when removing plugin options or option values.
|
||||
This provides users with time to replace the deprecated plugin in their
|
||||
configuration file.
|
||||
|
||||
After the shown release (`v1.40.0` in this case) the warning will be promoted
|
||||
to an error preventing Telegraf from starting. The user now has to adapt the
|
||||
configuration file to start Telegraf.
|
||||
|
||||
## Time frames and considerations
|
||||
|
||||
When deprecating parts of Telegraf, it is important to provide users with enough
|
||||
time to migrate to alternative solutions before actually removing those parts.
|
||||
|
||||
In general, plugins, plugin options or option values should only be deprecated
|
||||
if a suitable alternative exists! In those cases, the deprecations should
|
||||
predate the removal by at least one and a half years. In current release terms
|
||||
this corresponds to six minor-versions. However, there might be circumstances
|
||||
requiring a prolonged time between deprecation and removal to ensure a smooth
|
||||
transition for users.
|
||||
|
||||
Versions between deprecation and removal of plugins, plugin options or option
|
||||
values, Telegraf must log a *warning* on startup including information about
|
||||
the version introducing the deprecation, the version of removal and an
|
||||
user-facing hint on suitable replacements. In this phase Telegraf should
|
||||
operate normally even with deprecated plugins, plugin options or option values
|
||||
being set in the configuration files.
|
||||
|
||||
Starting from the removal version, Telegraf must show an *error* message for
|
||||
deprecated plugins present in the configuration including all information listed
|
||||
above. Removed plugin options and option values should be handled as invalid
|
||||
settings in the configuration files and must lead to an error. In this phase,
|
||||
Telegraf should *stop running* until all deprecated plugins, plugin options and
|
||||
option values are removed from the configuration files.
|
||||
|
||||
## Deprecation Process
|
||||
|
||||
The deprecation process comprises the following the steps below.
|
||||
|
||||
### File issue
|
||||
|
||||
In the filed issue you should outline which plugin, plugin option or feature
|
||||
you want to deprecate and *why*! Determine in which version the plugin should
|
||||
be removed.
|
||||
|
||||
Try to reach an agreement in the issue before continuing and get a sign off
|
||||
from the maintainers!
|
||||
|
||||
### Submit deprecation pull-request
|
||||
|
||||
Send a pull request adding deprecation information to the code and update the
|
||||
plugin's `README.md` file. Depending on what you want to deprecate this
|
||||
comprises different locations and steps as detailed below.
|
||||
|
||||
Once the deprecation pull-request is merged and Telegraf is released, we have
|
||||
to wait for the targeted Telegraf version for actually removing the code.
|
||||
|
||||
#### Deprecating a plugin
|
||||
|
||||
When deprecating a plugin you need to add an entry to the `deprecation.go` file
|
||||
in the respective plugin category with the following format
|
||||
|
||||
```golang
|
||||
"<plugin name>": {
|
||||
Since: "<x.y.z format version of the next minor release>",
|
||||
RemovalIn: "<x.y.z format version of the plugin removal>",
|
||||
Notice: "<user-facing hint e.g. on replacements>",
|
||||
},
|
||||
```
|
||||
|
||||
If you for example want to remove the `inputs.logparser` plugin you should add
|
||||
|
||||
```golang
|
||||
"logparser": {
|
||||
Since: "1.15.0",
|
||||
RemovalIn: "1.40.0"
|
||||
Notice: "use 'inputs.tail' with 'grok' data format instead",
|
||||
},
|
||||
```
|
||||
|
||||
to `plugins/inputs/deprecations.go`. By doing this, Telegraf will show a
|
||||
deprecation warning to the user starting from version `1.15.0` including the
|
||||
`Notice` you provided. The plugin can then be remove in version `1.40.0`.
|
||||
|
||||
Additionally, you should update the plugin's `README.md` adding a paragraph
|
||||
mentioning since when the plugin is deprecated, when it will be removed and a
|
||||
hint to alternatives or replacements. The paragraph should look like this
|
||||
|
||||
```text
|
||||
**Deprecated in version v1.15.0 and scheduled for removal in v1.40.0**:
|
||||
Please use the [tail][] plugin with the [`grok` data format][grok parser]
|
||||
instead!
|
||||
```
|
||||
|
||||
#### Deprecating an option
|
||||
|
||||
To deprecate a plugin option, remove the option from the `sample.conf` file and
|
||||
add the deprecation information to the structure field in the code. If you for
|
||||
for example want to deprecate the `ssl_enabled` option in `inputs.example` you
|
||||
should add
|
||||
|
||||
```golang
|
||||
type Example struct {
|
||||
...
|
||||
SSLEnabled bool `toml:"ssl_enabled" deprecated:"1.3.0;1.40.0;use 'tls_*' options instead"`
|
||||
}
|
||||
```
|
||||
|
||||
to schedule the setting for removal in version `1.40.0`. The last element of
|
||||
the `deprecated` tag is a user-facing notice similar to plugin deprecation.
|
||||
|
||||
#### Deprecating an option-value
|
||||
|
||||
Sometimes, certain option values become deprecated or superseded by other
|
||||
options or values. To deprecate those option values, remove them from
|
||||
`sample.conf` and add the deprecation info in the code if the deprecated value
|
||||
is *actually used* via
|
||||
|
||||
```golang
|
||||
func (e *Example) Init() error {
|
||||
...
|
||||
if e.Mode == "old" {
|
||||
models.PrintOptionDeprecationNotice(telegraf.Warn, "inputs.example", "mode", telegraf.DeprecationInfo{
|
||||
Since: "1.23.1",
|
||||
RemovalIn: "1.40.0",
|
||||
Notice: "use 'v1' instead",
|
||||
})
|
||||
}
|
||||
...
|
||||
return nil
|
||||
}
|
||||
```
|
||||
|
||||
This will show a warning if the deprecated `v1` value is used for the `mode`
|
||||
setting in `inputs.example` with a user-facing notice.
|
||||
|
||||
### Submit pull-request for removing code
|
||||
|
||||
Once the plugin, plugin option or option-value is deprecated, we have to wait
|
||||
for the `RemovedIn` release to remove the code. In the examples above, this
|
||||
would be version `1.40.0`. After all scheduled bugfix-releases are done, with
|
||||
`1.40.0` being the next release, you can create a pull-request to actually
|
||||
remove the deprecated code.
|
||||
|
||||
Please make sure, you remove the plugin, plugin option or option value and the
|
||||
code referencing those. This might also comprise the `all` files of your plugin
|
||||
category, test-cases including those of other plugins, README files or other
|
||||
documentation. For removed plugins, please keep the deprecation info in
|
||||
`deprecations.go` so users can find a reference when switching from a really
|
||||
old version.
|
||||
|
||||
Make sure you add an `Important Changes` sections to the `CHANGELOG.md` file
|
||||
describing the removal with a reference to your PR.
|
71
docs/specs/tsd-002-custom-builder.md
Normal file
71
docs/specs/tsd-002-custom-builder.md
Normal file
|
@ -0,0 +1,71 @@
|
|||
# Telegraf Custom-Builder
|
||||
|
||||
## Objective
|
||||
|
||||
Provide a tool to build a customized, smaller version of Telegraf with only
|
||||
the required plugins included.
|
||||
|
||||
## Keywords
|
||||
|
||||
tool, binary size, customization
|
||||
|
||||
## Overview
|
||||
|
||||
The Telegraf binary continues to grow as new plugins and features are added
|
||||
and dependencies are updated. Users running on resource constrained systems
|
||||
such as embedded-systems or inside containers might suffer from the growth.
|
||||
|
||||
This document specifies a tool to build a smaller Telegraf binary tailored to
|
||||
the plugins configured and actually used, removing unnecessary and unused
|
||||
plugins. The implementation should be able to cope with configured parsers and
|
||||
serializers including defaults for those plugin categories. Valid Telegraf
|
||||
configuration files, including directories containing such files, are the input
|
||||
to the customization process.
|
||||
|
||||
The customization tool might not be available for older versions of Telegraf.
|
||||
Furthermore, the degree of customization and thus the effective size reduction
|
||||
might vary across versions. The tool must create a single static Telegraf
|
||||
binary. Distribution packages or containers are *not* targeted.
|
||||
|
||||
## Prior art
|
||||
|
||||
[PR #5809](https://github.com/influxdata/telegraf/pull/5809) and
|
||||
[telegraf-lite-builder](https://github.com/influxdata/telegraf/tree/telegraf-lite-builder/cmd/telegraf-lite-builder):
|
||||
|
||||
- Uses docker
|
||||
- Uses browser:
|
||||
- Generates a webpage to pick what options you want. User chooses plugins;
|
||||
does not take a config file
|
||||
- Build a binary, then minifies by stripping and compressing that binary
|
||||
- Does some steps that belong in makefile, not builder
|
||||
- Special case for upx
|
||||
- Makes gzip, zip, tar.gz
|
||||
- Uses gopkg.in?
|
||||
- Can also work from the command line
|
||||
|
||||
[PR #8519](https://github.com/influxdata/telegraf/pull/8519)
|
||||
|
||||
- User chooses plugins OR provides a config file
|
||||
|
||||
[powers/telegraf-build](https://github.com/powersj/telegraf-build)
|
||||
|
||||
- User chooses plugins OR provides a config file
|
||||
- Currently kept in separate repo
|
||||
- Undoes changes to all.go files
|
||||
|
||||
[rawkode/bring-your-own-telegraf](https://github.com/rawkode/bring-your-own-telegraf)
|
||||
|
||||
- Uses docker
|
||||
|
||||
## Additional information
|
||||
|
||||
You might be able to further reduce the binary size of Telegraf by removing
|
||||
debugging information. This is done by adding `-w` and `-s` to the linker flags
|
||||
before building `LDFLAGS="-w -s"`.
|
||||
|
||||
However, please note that this removes information helpful for debugging issues
|
||||
in Telegraf.
|
||||
|
||||
Additionally, you can use a binary packer such as [UPX](https://upx.github.io/)
|
||||
to reduce the required *disk* space. This compresses the binary and decompresses
|
||||
it again at runtime. However, this does not reduce memory footprint at runtime.
|
125
docs/specs/tsd-003-state-persistence.md
Normal file
125
docs/specs/tsd-003-state-persistence.md
Normal file
|
@ -0,0 +1,125 @@
|
|||
# Plugin State-Persistence
|
||||
|
||||
## Objective
|
||||
|
||||
Retain the state of stateful plugins across restarts of Telegraf.
|
||||
|
||||
## Keywords
|
||||
|
||||
framework, plugin, stateful, persistence
|
||||
|
||||
## Overview
|
||||
|
||||
Telegraf contains a number of plugins that hold an internal state while
|
||||
processing. For some of the plugins this state is important for efficient
|
||||
processing like the location when reading a large file or when continuously
|
||||
querying data from a stateful peer requiring for example an offset or the last
|
||||
queried timestamp. For those plugins it is important to persistent their
|
||||
internal state over restarts of Telegraf.
|
||||
|
||||
It is intended to
|
||||
|
||||
- allow for opt-in of plugins to store a state per plugin _instance_
|
||||
- restore the state for each plugin instances at startup
|
||||
- track the plugin instances over restarts to relate the stored state with a
|
||||
corresponding plugin instance
|
||||
- automatically compute plugin instance IDs based on the plugin configuration
|
||||
- provide a way to manually specify instance IDs by the user
|
||||
- _not_ restore states if the plugin configuration changed between runs
|
||||
- make implementation easy for plugin developers
|
||||
- make no assumption on the state _content_
|
||||
|
||||
The persistence will use the following steps:
|
||||
|
||||
- Compute an unique ID for each of the plugin _instances_
|
||||
- Startup Telegraf plugins calling `Init()`, etc.
|
||||
- Initialize persistence framework with the user specified `statefile` location
|
||||
and load the state if present
|
||||
- Determine all stateful plugin instances by fulfilling the `StatefulPlugin`
|
||||
interface
|
||||
- Restore plugin states (if any) for each plugin ID present in the state-file
|
||||
- Run data-collection etc...
|
||||
- On shutdown, stopping all Telegraf plugins calling `Stop()` or `Close()`
|
||||
depending on the plugin type
|
||||
- Query the state of all registered stateful plugins state
|
||||
- Create an overall state-map with the plugin instance ID as a key and the
|
||||
serialized plugin state as value.
|
||||
- Marshal the overall state-map and store to disk
|
||||
|
||||
Potential users of this functionality are plugins continuously querying
|
||||
endpoints with information of a previous query (e.g. timestamps, offsets,
|
||||
transaction tokens, etc.) The following plugins are known to have an internal
|
||||
state. This is not a comprehensive list.
|
||||
|
||||
- `inputs.win_eventlog` ([PR #8281](https://github.com/influxdata/telegraf/pull/8281))
|
||||
- `inputs.docker_log` ([PR #7749](https://github.com/influxdata/telegraf/pull/7749))
|
||||
- `inputs.tail` (file offset)
|
||||
- `inputs.cloudwatch` (`windowStart`/`windowEnd` parameters)
|
||||
- `inputs.stackdriver` (`prevEnd` parameter)
|
||||
|
||||
### Plugin ID computation
|
||||
|
||||
The plugin ID is computed based on the configuration options specified for the
|
||||
plugin instance. To generate the ID all settings are extracted as `string`
|
||||
key-value pairs with the option name being the key and the value being the
|
||||
configuration option setting. For nested configuration options, e.g. if the
|
||||
plugins has a sub-table, the options are flattened with a canonical key. The
|
||||
canonical key elements must be concatenated with a dot (`.`) separator. In case
|
||||
the sub-element is a list of tables, the key must include the index of each
|
||||
table prefixed by a hash sign i.e. `<parent>#<index>.<child>`.
|
||||
|
||||
The resulting key-value pairs of configuration options are then sorted by the
|
||||
key in lexical order to make the resulting ID invariant against changes in the
|
||||
order of configuration options. The key and the value of each pair are joined
|
||||
by a colon (`:`) to a single `string`.
|
||||
|
||||
Finally, a SHA256 sum is computed across all key-value strings separated by a
|
||||
`null` byte. The HEX representation of the resulting SHA256 is used as the
|
||||
plugin instance ID.
|
||||
|
||||
### State serialization format
|
||||
|
||||
The overall Telegraf state maps the plugin IDs (keys) to the serialized state
|
||||
of the corresponding plugin (values). The state data returned by stateful
|
||||
plugins is serialized to JSON. The resulting byte-sequence is used as the value
|
||||
for the overall state. On-disk, the overall state of Telegraf is stored as JSON.
|
||||
|
||||
To restore the state of a plugin, the overall Telegraf state is first
|
||||
deserialized from the on-disk JSON data and a lookup for the plugin ID is
|
||||
performed in the resulting map. The value, if found, is then deserialized to the
|
||||
plugin's state data-structure and provided to the plugin after calling `Init()`.
|
||||
|
||||
## Is / Is-not
|
||||
|
||||
### Is
|
||||
|
||||
- A framework to persist states over restarts of Telegraf
|
||||
- A simple local state store
|
||||
- A way to restore plugin states between restarts without configuration changes
|
||||
- A unified API for plugins to use when requiring persistence of a state
|
||||
|
||||
### Is-Not
|
||||
|
||||
- A remote storage framework
|
||||
- A way to store anything beyond fundamental plugin states
|
||||
- A data-store or database
|
||||
- A way to reassign plugin states if their configuration changes
|
||||
- A tool to interactively adding/removing/modifying states of plugins
|
||||
- A persistence guarantee beyond clean shutdown (i.e. no crash resistance)
|
||||
|
||||
## Prior art
|
||||
|
||||
- [PR #8281](https://github.com/influxdata/telegraf/pull/8281): Stores Windows
|
||||
event-log bookmarks in the registry
|
||||
- [PR #7749](https://github.com/influxdata/telegraf/pull/7749): Stores container
|
||||
ID and log offset to a file at a user-provided path
|
||||
- [PR #7537](https://github.com/influxdata/telegraf/pull/7537): Provides a
|
||||
global state object and periodically queries plugin states to store the state
|
||||
object to a JSON file. This approach does not provide a ID per plugin
|
||||
_instance_ so it seems like there is only a single state for a plugin _type_
|
||||
- [PR #9476](https://github.com/influxdata/telegraf/pull/9476): Register
|
||||
stateful plugins to persister and automatically assigns an ID to plugin
|
||||
_instances_ based on the configuration. The approach also allows to overwrite
|
||||
the automatic ID e.g. with user specified data. It uses the plugin instance ID
|
||||
to store/restore state to the same plugin instance and queries the plugin
|
||||
state on shutdown and write file (currently JSON).
|
69
docs/specs/tsd-004-configuration-migration.md
Normal file
69
docs/specs/tsd-004-configuration-migration.md
Normal file
|
@ -0,0 +1,69 @@
|
|||
# Configuration Migration
|
||||
|
||||
## Objective
|
||||
|
||||
Provides a subcommand and framework to migrate configurations containing
|
||||
deprecated settings to a corresponding recent configuration.
|
||||
|
||||
## Keywords
|
||||
|
||||
configuration, deprecation, telegraf command
|
||||
|
||||
## Overview
|
||||
|
||||
With the deprecation framework of [TSD-001](tsd-001-deprecation.md) implemented
|
||||
we see more and more plugins and options being scheduled for removal in the
|
||||
future. Furthermore, deprecations become visible to the user due to the warnings
|
||||
issued for removed plugins, plugin options and plugin option values.
|
||||
|
||||
To aid the user in mitigating deprecated configuration settings this
|
||||
specifications proposes the implementation of a `migrate` sub-command to the
|
||||
Telegraf `config` command for automatically migrate the user's existing
|
||||
configuration files away from the deprecated settings to an equivalent, recent
|
||||
configuration. Furthermore, the specification describes the layout and
|
||||
functionality of a plugin-based migration framework to implement migrations.
|
||||
|
||||
### `migrate` sub-command
|
||||
|
||||
The `migrate` sub-command of the `config` command should take a set of
|
||||
configuration files and configuration directories and apply available migrations
|
||||
to deprecated plugins, plugin options or plugin option-values in order to
|
||||
generate new configuration files that do not make use of deprecated options.
|
||||
|
||||
In the process, the migration procedure must ensure that only plugins with
|
||||
applicable migrations are modified. Existing configuration must be kept and not
|
||||
be overwritten without manual confirmation of the user. This should be
|
||||
accomplished by storing modified configuration files with a `.migrated` suffix
|
||||
and leaving it to the user to overwrite the existing configuration with the
|
||||
generated counterparts. If no migration is applied in a configuration file, the
|
||||
command might not generate a new file and leave the original file untouched.
|
||||
|
||||
During migration, the configuration, plugin behavior, resulting metrics and
|
||||
comments should be kept on a best-effort basis. Telegraf must inform the user
|
||||
about applied migrations and potential changes in the plugin behavior or
|
||||
resulting metrics. If a plugin cannot be automatically migrated but requires
|
||||
manual intervention, Telegraf should inform the user.
|
||||
|
||||
### Migration implementations
|
||||
|
||||
To implement migrations for deprecated plugins, plugin option or plugin option
|
||||
values, Telegraf must provide a plugin-based infrastructure to register and
|
||||
apply implemented migrations based on the plugin-type. Only one migration per
|
||||
plugin-type must be registered.
|
||||
|
||||
Developers must implement the required interfaces and register the migration
|
||||
to the mentioned framework. The developer must provide the possibility to
|
||||
exclude the migration at build-time according to
|
||||
[TSD-002](tsd-002-custom-builder.md). Existing migrations can be extended but
|
||||
must be cumulative such that any previous configuration migration functionality
|
||||
is kept.
|
||||
|
||||
Resulting configurations should generate metrics equivalent to the previous
|
||||
setup also making use of metric selection, renaming and filtering mechanisms.
|
||||
In cases this is not possible, there must be a clear information to the user
|
||||
what to expect and which differences might occur.
|
||||
A migration can only be informative, i.e. notify the user that a plugin has to
|
||||
manually be migrated and should point users to additional information.
|
||||
|
||||
Deprecated plugins and plugin options must be removed from the migrated
|
||||
configuration.
|
77
docs/specs/tsd-005-output-buffer-strategy.md
Normal file
77
docs/specs/tsd-005-output-buffer-strategy.md
Normal file
|
@ -0,0 +1,77 @@
|
|||
# Telegraf Output Buffer Strategy
|
||||
|
||||
## Objective
|
||||
|
||||
Introduce a new agent-level config option to choose a disk buffer strategy for
|
||||
output plugin metric queues.
|
||||
|
||||
## Overview
|
||||
|
||||
Currently, when a Telegraf output metric queue fills, either due to incoming
|
||||
metrics being too fast or various issues with writing to the output, oldest
|
||||
metrics are overwritten and never written to the output. This specification
|
||||
defines a set of options to make this output queue more durable by persisting
|
||||
pending metrics to disk rather than only an in-memory limited size queue.
|
||||
|
||||
## Keywords
|
||||
|
||||
output plugins, agent configuration, persist to disk
|
||||
|
||||
## Agent Configuration
|
||||
|
||||
The configuration is at the agent-level, with options for:
|
||||
|
||||
- **Memory**, the current implementation, with no persistence to disk
|
||||
- **Write-through**, all metrics are also written to disk using a
|
||||
Write Ahead Log (WAL) file
|
||||
- **Disk-overflow**, when the memory buffer fills, metrics are flushed to a
|
||||
WAL file to avoid dropping overflow metrics
|
||||
|
||||
As well as an option to specify a directory to store the WAL files on disk,
|
||||
with a default value. These configurations are global, and no change means
|
||||
memory only mode, retaining current behavior.
|
||||
|
||||
## Metric Ordering and Tracking
|
||||
|
||||
Tracking metrics will be accepted on a successful write to the output
|
||||
destination. Metrics will be written to their appropriate output in the order
|
||||
they are received in the buffer regardless of which buffer strategy is chosen.
|
||||
|
||||
## Disk Utilization and File Handling
|
||||
|
||||
Each output plugin has its own in-memory output buffer, and therefore will
|
||||
each have their own WAL file for buffer persistence. This file may not exist
|
||||
if Telegraf is successfully able to write all of its metrics without filling
|
||||
the in-memory buffer in disk-overflow mode, or not at all in memory mode.
|
||||
Telegraf should use one file per output plugin, and remove entries from the
|
||||
WAL file as they are written to the output.
|
||||
|
||||
Telegraf will not make any attempt to limit the size on disk taken by these
|
||||
files beyond cleaning up WAL files for metrics that have successfully been
|
||||
flushed to their output destination. It is the user's responsibility to ensure
|
||||
these files do not entirely fill the disk, both during Telegraf uptime and
|
||||
with lingering files from previous instances of the program.
|
||||
|
||||
If WAL files exist for an output plugin from previous instances of Telegraf,
|
||||
they will be picked up and flushed before any new metrics that are written
|
||||
to the output. This is to ensure that these metrics are not lost, and to
|
||||
ensure that output write order remains consistent.
|
||||
|
||||
Telegraf must additionally provide a way to manually flush WAL files via
|
||||
some separate plugin or similar. This could be used as a way to ensure that
|
||||
WAL files are properly written in the event that the output plugin changes
|
||||
and the WAL file is unable to be detected by a new instance of Telegraf.
|
||||
This plugin should not be required for use to allow the buffer strategy to
|
||||
work.
|
||||
|
||||
## Is/Is-not
|
||||
|
||||
- Is a way to prevent metrics from being dropped due to a full memory buffer
|
||||
- Is not a way to guarantee data safety in the event of a crash or system failure
|
||||
- Is not a way to manage file system allocation size, file space will be used
|
||||
until the disk is full
|
||||
|
||||
## Prior art
|
||||
|
||||
[Initial issue](https://github.com/influxdata/telegraf/issues/802)
|
||||
[Loose specification issue](https://github.com/influxdata/telegraf/issues/14805)
|
115
docs/specs/tsd-006-startup-error-behavior.md
Normal file
115
docs/specs/tsd-006-startup-error-behavior.md
Normal file
|
@ -0,0 +1,115 @@
|
|||
# Startup Error Behavior
|
||||
|
||||
## Objective
|
||||
|
||||
Unified, configurable behavior on retriable startup errors.
|
||||
|
||||
## Keywords
|
||||
|
||||
inputs, outputs, startup, error, retry
|
||||
|
||||
## Overview
|
||||
|
||||
Many Telegraf plugins connect to an external service either on the same machine
|
||||
or via network. On automated startup of Telegraf (e.g. via service) there is no
|
||||
guarantee that those services are fully started yet, especially when they reside
|
||||
on a remote host. More and more plugins implement mechanisms to retry reaching
|
||||
their related service if they failed to do so on startup.
|
||||
|
||||
This specification intends to unify the naming of configuration-options, the
|
||||
values of those options, and their semantic meaning. It describes the behavior
|
||||
for the different options on handling startup-errors.
|
||||
|
||||
Startup errors are all errors occurring in calls to `Start()` for inputs and
|
||||
service-inputs or `Connect()` for outputs. The behaviors described below
|
||||
should only be applied in cases where the plugin *explicitly* states that an
|
||||
startup error is *retriable*. This includes for example network errors
|
||||
indicating that the host or service is not yet reachable or external
|
||||
resources, like a machine or file, which are not yet available, but might become
|
||||
available later. To indicate a retriable startup error the plugin should return
|
||||
a predefined error-type.
|
||||
|
||||
In cases where the error cannot be generally determined be retriable by
|
||||
the plugin, the plugin might add configuration settings to let the user
|
||||
configure that property. For example, where an error code indicates a fatal,
|
||||
non-recoverable error in one case but a non-fatal, recoverable error in another
|
||||
case.
|
||||
|
||||
## Configuration Options and Behaviors
|
||||
|
||||
Telegraf must introduce a unified `startup_error_behavior` configuration option
|
||||
for inputs and output plugins. The option is handled directly by the Telegraf
|
||||
agent and is not passed down to the plugins. The setting must be available on a
|
||||
per-plugin basis and defines how Telegraf behaves on startup errors.
|
||||
|
||||
For all config option values Telegraf might retry to start the plugin for a
|
||||
limited number of times during the startup phase before actually processing
|
||||
data. This corresponds to the current behavior of Telegraf to retry three times
|
||||
with a fifteen second interval before continuing processing of the plugins.
|
||||
|
||||
### `error` behavior
|
||||
|
||||
The `error` setting for the `startup_error_behavior` option causes Telegraf to
|
||||
fail and exit on startup errors. This must be the default behavior.
|
||||
|
||||
### `retry` behavior
|
||||
|
||||
The `retry` setting for the `startup_error_behavior` option Telegraf must *not*
|
||||
fail on startup errors and should continue running. Telegraf must retry to
|
||||
startup the failed plugin in each gather or write cycle, for inputs or for
|
||||
outputs respectively, for an unlimited number of times. Neither the
|
||||
plugin's `Gather()` nor `Write()` method is called as long as the startup did
|
||||
not succeed. Metrics sent to an output plugin will be buffered until the plugin
|
||||
is actually started. If the metric-buffer limit is reached **metrics might be
|
||||
dropped**!
|
||||
|
||||
In case a plugin signals a partially successful startup, e.g. a subset of the
|
||||
given endpoints are reachable, Telegraf must try to fully startup the remaining
|
||||
endpoints by calling `Start()` or `Connect()`, respectively, until full startup
|
||||
is reached **and** trigger the plugin's `Gather()` nor `Write()` methods.
|
||||
|
||||
### `ignore` behavior
|
||||
|
||||
When using the `ignore` setting for the `startup_error_behavior` option Telegraf
|
||||
must *not* fail on startup errors and should continue running. On startup error,
|
||||
Telegraf must ignore the plugin as-if it was not configured at all, i.e. the
|
||||
plugin must be completely removed from processing.
|
||||
|
||||
### `probe` behavior
|
||||
|
||||
When using the `probe` setting for the `startup_error_behavior` option Telegraf
|
||||
must *not* fail on startup errors and should continue running. On startup error,
|
||||
Telegraf must ignore the plugin as-if it was not configured at all, i.e. the
|
||||
plugin must be completely removed from processing, similar to the `ignore`
|
||||
behavior. Additionally, Telegraf must probe the plugin (as defined in
|
||||
[TSD-009][tsd_009]) after startup, if it implements the `ProbePlugin` interface.
|
||||
If probing is available *and* returns an error Telegraf must *ignore* the
|
||||
plugin as-if it was not configured at all.
|
||||
|
||||
[tsd_009]: /docs/specs/tsd-009-probe-on-startup.md
|
||||
|
||||
## Plugin Requirements
|
||||
|
||||
Plugins participating in handling startup errors must implement the `Start()`
|
||||
or `Connect()` method for inputs and outputs respectively. Those methods must be
|
||||
safe to be called multiple times during retries without leaking resources or
|
||||
causing issues in the service used.
|
||||
|
||||
Furthermore, the `Close()` method of the plugins must be safe to be called for
|
||||
cases where the startup failed without causing panics.
|
||||
|
||||
The plugins should return a `nil` error during startup to indicate a successful
|
||||
startup or a retriable error (via predefined error type) to enable the defined
|
||||
startup error behaviors. A non-retriable error (via predefined error type) or
|
||||
a generic error will bypass the startup error behaviors and Telegraf must fail
|
||||
and exit in the startup phase.
|
||||
|
||||
## Related Issues
|
||||
|
||||
- [#8586](https://github.com/influxdata/telegraf/issues/8586) `inputs.postgresql`
|
||||
- [#9778](https://github.com/influxdata/telegraf/issues/9778) `outputs.kafka`
|
||||
- [#13278](https://github.com/influxdata/telegraf/issues/13278) `outputs.cratedb`
|
||||
- [#13746](https://github.com/influxdata/telegraf/issues/13746) `inputs.amqp_consumer`
|
||||
- [#14365](https://github.com/influxdata/telegraf/issues/14365) `outputs.postgresql`
|
||||
- [#14603](https://github.com/influxdata/telegraf/issues/14603) `inputs.nvidia-smi`
|
||||
- [#14603](https://github.com/influxdata/telegraf/issues/14603) `inputs.rocm-smi`
|
75
docs/specs/tsd-007-url-config-behavior.md
Normal file
75
docs/specs/tsd-007-url-config-behavior.md
Normal file
|
@ -0,0 +1,75 @@
|
|||
# URL-Based Config Behavior
|
||||
|
||||
## Objective
|
||||
|
||||
Define the retry and reload behavior of remote URLs that are passed as config to
|
||||
Telegraf. In terms of retry, currently Telegraf will attempt to load a remote
|
||||
URL three times and then exit. In terms of reload, Telegraf does not have the
|
||||
capability to reload remote URL based configs. This spec seeks to allow for
|
||||
options for the user to further these capabilities.
|
||||
|
||||
## Keywords
|
||||
|
||||
config, error, retry, reload
|
||||
|
||||
## Overview
|
||||
|
||||
Telegraf allows for loading configurations from local files, directories, and
|
||||
files via a URL. In order to allow situations where a configuration file is not
|
||||
yet available or due to a flaky network, the first proposal is to introduce a
|
||||
new CLI flag: `--url-config-retry-attempts`. This flag would continue to default
|
||||
to three and would specify the number of retries to attempt to get a remote URL
|
||||
during the initial startup of Telegraf.
|
||||
|
||||
```sh
|
||||
--config-url-retry-attempts=3 Number of times to attempt to obtain a remote
|
||||
configuration via a URL during startup. Set to
|
||||
-1 for unlimited attempts.
|
||||
```
|
||||
|
||||
These attempts would block Telegraf from starting up completely until success or
|
||||
until we have run out of attempts and exit.
|
||||
|
||||
Once Telegraf is up and running, users can use the `--watch` flag to enable
|
||||
watching local files for changes and if/when changes are made, then reload
|
||||
Telegraf with the new configuration. For remote URLs, I propose a new CLI flag:
|
||||
`--url-config-check-interval`. This flag would set an internal timer that when
|
||||
it goes off, would check for an update to a remote URL file.
|
||||
|
||||
```sh
|
||||
--config-url-watch-interval=0s Time duration to check for updates to URL based
|
||||
configuration files. Disabled by default.
|
||||
```
|
||||
|
||||
At each interval, Telegraf would send an HTTP HEAD request to the configuration
|
||||
URL, here is an example curl HEAD request and output:
|
||||
|
||||
```sh
|
||||
$ curl --head http://localhost:8000/config.toml
|
||||
HTTP/1.0 200 OK
|
||||
Server: SimpleHTTP/0.6 Python/3.12.3
|
||||
Date: Mon, 29 Apr 2024 18:18:56 GMT
|
||||
Content-type: application/octet-stream
|
||||
Content-Length: 1336
|
||||
Last-Modified: Mon, 29 Apr 2024 11:44:19 GMT
|
||||
```
|
||||
|
||||
The proposal then is to store the last-modified value when we first obtain the
|
||||
file and compare the value at each interval. No need to parse the value, just
|
||||
store the raw string. If there is a difference, trigger a reload.
|
||||
|
||||
If anything other than 2xx response code is returned from the HEAD request,
|
||||
Telegraf would print a warning message and retry at the next interval. Telegraf
|
||||
will continue to run the existing configuration with no change.
|
||||
|
||||
If the value of last-modified is empty, while very unlikely, then Telegraf would
|
||||
ignore this configuration file. Telegraf will print a warning message once about
|
||||
the missing field.
|
||||
|
||||
## Relevant Issues
|
||||
|
||||
* Configuration capabilities to retry for loading config via URL #[8854][]
|
||||
* Telegraf reloads URL-based/remote config on a specified interval #[8730][]
|
||||
|
||||
[8854]: https://github.com/influxdata/telegraf/issues/8854
|
||||
[8730]: https://github.com/influxdata/telegraf/issues/8730
|
78
docs/specs/tsd-008-partial-write-error-handling.md
Normal file
78
docs/specs/tsd-008-partial-write-error-handling.md
Normal file
|
@ -0,0 +1,78 @@
|
|||
# Partial write error handling
|
||||
|
||||
## Objective
|
||||
|
||||
Provide a way to pass information about partial metric write errors from an
|
||||
output to the output model.
|
||||
|
||||
## Keywords
|
||||
|
||||
output plugins, write, error, output model, metric, buffer
|
||||
|
||||
## Overview
|
||||
|
||||
The output model wrapping each output plugin buffers metrics to be able to batch
|
||||
those metrics for more efficient sending. In each flush cycle, the model
|
||||
collects a batch of metrics and hands it over to the output plugin for writing
|
||||
through the `Write` method. Currently, if writing succeeds (i.e. no error is
|
||||
returned), _all metrics of the batch_ are removed from the buffer and are marked
|
||||
as __accepted__ both in terms of statistics as well as in tracking-metric terms.
|
||||
If writing fails (i.e. any error is returned), _all metrics of the batch_ are
|
||||
__kept__ in the buffer for requeueing them in the next write cycle.
|
||||
|
||||
Issues arise when an output plugin cannot write all metrics of a batch bit only
|
||||
some to its service endpoint, e.g. due to the metrics being serializable or if
|
||||
metrics are selectively rejected by the service on the server side. This might
|
||||
happen when reaching submission limits, violating service constraints e.g.
|
||||
by out-of-order sends, or due to invalid characters in the serialited metric.
|
||||
In those cases, an output currently is only able to accept or reject the
|
||||
_complete batch of metrics_ as there is no mechanism to inform the model (and
|
||||
in turn the buffer) that only _some_ of the metrics in the batch were failing.
|
||||
|
||||
As a consequence, outputs often _accept_ the batch to avoid a requeueing of the
|
||||
failing metrics for the next flush interval. This distorts statistics of
|
||||
accepted metrics and causes misleading log messages saying all metrics were
|
||||
written sucessfully which is not true. Even worse, for outputs ending-up with
|
||||
partial writes, e.g. only the first half of the metrics can be written to the
|
||||
service, there is no way of telling the model to selectively accept the actually
|
||||
written metrics and in turn those outputs must internally buffer the remaining,
|
||||
unwritten metrics leading to a duplication of buffering logic and adding to code
|
||||
complexity.
|
||||
|
||||
This specification aims at defining the handling of partially successful writes
|
||||
and introduces the concept of a special _partial write error_ type to reflect
|
||||
partial writes and partial serialization overcoming the aforementioned issues
|
||||
and limitations.
|
||||
|
||||
To do so, the _partial write error_ error type must contain a list of
|
||||
successfully written metrics, to be marked __accepted__, both in terms of
|
||||
statistics as well as in terms of metric tracking, and must be removed from the
|
||||
buffer. Furthermore, the error must contain a list of metrics that cannot be
|
||||
sent or serialized and cannot be retried. These metrics must be marked as
|
||||
__rejected__, both in terms of statistics as well as in terms of metric
|
||||
tracking, and must be removed from the buffer.
|
||||
|
||||
The error may contain a list of metrics not-yet written to be __kept__ for the
|
||||
next write cylce. Those metrics must not be marked and must be kept in the
|
||||
buffer. If the error does not contain the list of not-yet written metrics, this
|
||||
list must be inferred using the accept and reject lists mentioned above.
|
||||
|
||||
To allow the model and the buffer to correctly handle tracking metrics ending up
|
||||
in the buffer and output the tracking information must be preserved during
|
||||
communication between the output plugin, the model and the buffer through the
|
||||
specified error. To do so, all metric lists should be communicated as indices
|
||||
into the batch to be able to handle tracking metrics correctly.
|
||||
|
||||
For backward compatibility and simplicity output plugins can return a `nil`
|
||||
error to indicate that __all__ metrics of the batch are __accepted__. Similarly,
|
||||
returing an error _not_ being a _partial write error_ indicates that __all__
|
||||
metrics of the batch should be __kept__ in the buffer for the next write cycle.
|
||||
|
||||
## Related Issues
|
||||
|
||||
- [issue #11942](https://github.com/influxdata/telegraf/issues/11942) for
|
||||
contradicting log messages
|
||||
- [issue #14802](https://github.com/influxdata/telegraf/issues/14802) for
|
||||
rate-limiting multiple batch sends
|
||||
- [issue #15908](https://github.com/influxdata/telegraf/issues/15908) for
|
||||
infinite loop if single metrics cannot be written
|
68
docs/specs/tsd-009-probe-on-startup.md
Normal file
68
docs/specs/tsd-009-probe-on-startup.md
Normal file
|
@ -0,0 +1,68 @@
|
|||
# Probing plugins after startup
|
||||
|
||||
## Objective
|
||||
|
||||
Allow Telegraf to probe plugins during startup to enable enhanced plugin error
|
||||
detection like availability of hardware or services
|
||||
|
||||
## Keywords
|
||||
|
||||
inputs, outputs, startup, probe, error, ignore, behavior
|
||||
|
||||
## Overview
|
||||
|
||||
When plugins are first instantiated, Telegraf will call the plugin's `Start()`
|
||||
method (for inputs) or `Connect()` (for outputs) which will initialize its
|
||||
configuration based off of config options and the running environment. It is
|
||||
sometimes the case that while the initialization step succeeds, the upstream
|
||||
service in which the plugin relies on is not actually running, or is not capable
|
||||
of being communicated with due to incorrect configuration or environmental
|
||||
problems. In situations like this, Telegraf does not detect that the plugin's
|
||||
upstream service is not functioning properly, and thus it will continually call
|
||||
the plugin during each `Gather()` iteration. This often has the effect of
|
||||
polluting journald and system logs with voluminous error messages, which creates
|
||||
issues for system administrators who rely on such logs to identify other
|
||||
unrelated system problems.
|
||||
|
||||
More background discussion on this option, including other possible avenues, can
|
||||
be viewed [here](https://github.com/influxdata/telegraf/issues/16028).
|
||||
|
||||
## Probing
|
||||
|
||||
Probing is an action whereby the plugin should ensure that the plugin will be
|
||||
fully functional on a best effort basis. This may comprise communicating with
|
||||
its external service, trying to access required devices, entities or executables
|
||||
etc to ensure that the plugin will not produce errors during e.g. data collection
|
||||
or data output. Probing must *not* produce, process or output any metrics.
|
||||
|
||||
Plugins that support probing must implement the `ProbePlugin` interface. Such
|
||||
plugins must behave in the following manner:
|
||||
|
||||
1. Return an error if the external dependencies (hardware, services,
|
||||
executables, etc.) of the plugin are not available.
|
||||
2. Return an error if information cannot be gathered (in the case of inputs) or
|
||||
sent (in the case of outputs) due to unrecoverable issues. For example, invalid
|
||||
authentication, missing permissions, or non-existent endpoints.
|
||||
3. Otherwise, return `nil` indicating the plugin will be fully functional.
|
||||
|
||||
## Plugin Requirements
|
||||
|
||||
Plugins that allow probing must implement the `ProbePlugin` interface. The
|
||||
exact implementation depends on the plugin's functionality and requirements,
|
||||
but generally it should take the same actions as it would during normal operation
|
||||
e.g. calling `Gather()` or `Write()` and check if errors occur. If probing fails,
|
||||
it must be safe to call the plugin's `Close()` method.
|
||||
|
||||
Input plugins must *not* produce metrics, output plugins must *not* send any
|
||||
metrics to the service. Plugins must *not* influence the later data processing or
|
||||
collection by modifying the internal state of the plugin or the external state of the
|
||||
service or hardware. For example, file-offsets or other service states must be
|
||||
reset to not lose data during the first gather or write cycle.
|
||||
|
||||
Plugins must return `nil` upon successful probing or an error otherwise.
|
||||
|
||||
## Related Issues
|
||||
|
||||
- [#16028](https://github.com/influxdata/telegraf/issues/16028)
|
||||
- [#15916](https://github.com/influxdata/telegraf/pull/15916)
|
||||
- [#16001](https://github.com/influxdata/telegraf/pull/16001)
|
Loading…
Add table
Add a link
Reference in a new issue