Adding upstream version 1.34.4.
Signed-off-by: Daniel Baumann <daniel@debian.org>
This commit is contained in:
parent
e393c3af3f
commit
4978089aab
4963 changed files with 677545 additions and 0 deletions
110
plugins/inputs/amd_rocm_smi/README.md
Normal file
110
plugins/inputs/amd_rocm_smi/README.md
Normal file
|
@ -0,0 +1,110 @@
|
|||
# AMD ROCm System Management Interface (SMI) Input Plugin
|
||||
|
||||
This plugin gathers statistics including memory and GPU usage, temperatures
|
||||
etc from [AMD ROCm platform][amd_rocm] GPUs.
|
||||
|
||||
> [!IMPORTANT]
|
||||
> The [`rocm-smi` binary][binary] is required and needs to be installed on the
|
||||
> system.
|
||||
|
||||
⭐ Telegraf v1.20.0
|
||||
🏷️ hardware, system
|
||||
💻 all
|
||||
|
||||
[amd_rocm]: https://rocm.docs.amd.com/
|
||||
[binary]: https://github.com/RadeonOpenCompute/rocm_smi_lib/tree/master/python_smi_tools
|
||||
|
||||
## Global configuration options <!-- @/docs/includes/plugin_config.md -->
|
||||
|
||||
In addition to the plugin-specific configuration settings, plugins support
|
||||
additional global and plugin configuration settings. These settings are used to
|
||||
modify metrics, tags, and field or create aliases and configure ordering, etc.
|
||||
See the [CONFIGURATION.md][CONFIGURATION.md] for more details.
|
||||
|
||||
[CONFIGURATION.md]: ../../../docs/CONFIGURATION.md#plugins
|
||||
|
||||
## Startup error behavior options
|
||||
|
||||
In addition to the plugin-specific and global configuration settings the plugin
|
||||
supports options for specifying the behavior when experiencing startup errors
|
||||
using the `startup_error_behavior` setting. Available values are:
|
||||
|
||||
- `error`: Telegraf with stop and exit in case of startup errors. This is the
|
||||
default behavior.
|
||||
- `ignore`: Telegraf will ignore startup errors for this plugin and disables it
|
||||
but continues processing for all other plugins.
|
||||
- `retry`: NOT AVAILABLE
|
||||
|
||||
## Configuration
|
||||
|
||||
```toml @sample.conf
|
||||
# Query statistics from AMD Graphics cards using rocm-smi binary
|
||||
[[inputs.amd_rocm_smi]]
|
||||
## Optional: path to rocm-smi binary, defaults to $PATH via exec.LookPath
|
||||
# bin_path = "/opt/rocm/bin/rocm-smi"
|
||||
|
||||
## Optional: timeout for GPU polling
|
||||
# timeout = "5s"
|
||||
```
|
||||
|
||||
## Metrics
|
||||
|
||||
- measurement: `amd_rocm_smi`
|
||||
- tags
|
||||
- `name` (entry name assigned by rocm-smi executable)
|
||||
- `gpu_id` (id of the GPU according to rocm-smi)
|
||||
- `gpu_unique_id` (unique id of the GPU)
|
||||
|
||||
- fields
|
||||
- `driver_version` (integer)
|
||||
- `fan_speed` (integer)
|
||||
- `memory_total` (integer, B)
|
||||
- `memory_used` (integer, B)
|
||||
- `memory_free` (integer, B)
|
||||
- `temperature_sensor_edge` (float, Celsius)
|
||||
- `temperature_sensor_junction` (float, Celsius)
|
||||
- `temperature_sensor_memory` (float, Celsius)
|
||||
- `utilization_gpu` (integer, percentage)
|
||||
- `utilization_memory` (integer, percentage)
|
||||
- `clocks_current_sm` (integer, Mhz)
|
||||
- `clocks_current_memory` (integer, Mhz)
|
||||
- `clocks_current_display` (integer, Mhz)
|
||||
- `clocks_current_fabric` (integer, Mhz)
|
||||
- `clocks_current_system` (integer, Mhz)
|
||||
- `power_draw` (float, Watt)
|
||||
- `card_series` (string)
|
||||
- `card_model` (string)
|
||||
- `card_vendor` (string)
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
Check the full output by running `rocm-smi` binary manually.
|
||||
|
||||
Linux:
|
||||
|
||||
```sh
|
||||
rocm-smi rocm-smi -o -l -m -M -g -c -t -u -i -f -p -P -s -S -v --showreplaycount --showpids --showdriverversion --showmemvendor --showfwinfo --showproductname --showserial --showuniqueid --showbus --showpendingpages --showpagesinfo --showretiredpages --showunreservablepages --showmemuse --showvoltage --showtopo --showtopoweight --showtopohops --showtopotype --showtoponuma --showmeminfo all --json
|
||||
```
|
||||
|
||||
Please include the output of this command if opening a GitHub issue, together
|
||||
with ROCm version.
|
||||
|
||||
## Example Output
|
||||
|
||||
```text
|
||||
amd_rocm_smi,gpu_id=0x6861,gpu_unique_id=0x2150e7d042a1124,host=ali47xl,name=card0 clocks_current_memory=167i,clocks_current_sm=852i,driver_version=51114i,fan_speed=14i,memory_free=17145282560i,memory_total=17163091968i,memory_used=17809408i,power_draw=7,temperature_sensor_edge=28,temperature_sensor_junction=29,temperature_sensor_memory=92,utilization_gpu=0i 1630572551000000000
|
||||
amd_rocm_smi,gpu_id=0x6861,gpu_unique_id=0x2150e7d042a1124,host=ali47xl,name=card0 clocks_current_memory=167i,clocks_current_sm=852i,driver_version=51114i,fan_speed=14i,memory_free=17145282560i,memory_total=17163091968i,memory_used=17809408i,power_draw=7,temperature_sensor_edge=29,temperature_sensor_junction=30,temperature_sensor_memory=91,utilization_gpu=0i 1630572701000000000
|
||||
amd_rocm_smi,gpu_id=0x6861,gpu_unique_id=0x2150e7d042a1124,host=ali47xl,name=card0 clocks_current_memory=167i,clocks_current_sm=852i,driver_version=51114i,fan_speed=14i,memory_free=17145282560i,memory_total=17163091968i,memory_used=17809408i,power_draw=7,temperature_sensor_edge=29,temperature_sensor_junction=29,temperature_sensor_memory=92,utilization_gpu=0i 1630572749000000000
|
||||
```
|
||||
|
||||
## Limitations and notices
|
||||
|
||||
Please notice that this plugin has been developed and tested on a limited number
|
||||
of versions and small set of GPUs. Currently the latest ROCm version tested is
|
||||
4.3.0. Notice that depending on the device and driver versions the amount of
|
||||
information provided by `rocm-smi` can vary so that some fields would start/stop
|
||||
appearing in the metrics upon updates. The `rocm-smi` JSON output is not
|
||||
perfectly homogeneous and is possibly changing in the future, hence parsing and
|
||||
unmarshalling can start failing upon updating ROCm.
|
||||
|
||||
Inspired by the current state of the art of the `nvidia-smi` plugin.
|
Loading…
Add table
Add a link
Reference in a new issue