aboutsummaryrefslogtreecommitdiff
path: root/docs
Commit message (Collapse)AuthorAge
* Merge pull request #1657 from povilasv/NodeTextFileCollectorScrapeErrorFrederic Branczyk2020-04-30
|\ | | | | Add NodeTextFileCollectorScrapeError alert to mixin
| * Add NodeTextFileCollectorScrapeError alert to mixinPovilas Versockas2020-03-31
| | | | | | | | Signed-off-by: Povilas Versockas <p.versockas@gmail.com>
* | fix typo in TIME.md (#1670)jangdm2020-04-09
| | | | | | | | | | fix typo in TIME.md Signed-off-by: jangdm <jamin4@naver.com>
* | Add more compatible rulesWOO CHANG HO2020-04-08
| | | | | | | | Signed-off-by: zodiac12k <zodiac12k@gmail.com>
* | Fix sign error in `NodeClockSkewDetected`beorn72020-03-25
| | | | | | | | Signed-off-by: beorn7 <beorn@grafana.com>
* | docs/node-mixin: alert on desynchronised clockpaulfantom2020-03-23
|/ | | | Signed-off-by: paulfantom <pawel@krupa.net.pl>
* Add missing comaNeraud2020-03-21
| | | | Signed-off-by: Neraud <neraud.login@gmail.com>
* Add NodeHighNumberConntrackEntriesUsedPovilas Versockas2020-03-20
| | | | Signed-off-by: Povilas Versockas <p.versockas@gmail.com>
* Make FS space alerts thresholds configurable (#1624)iuri aranda2020-03-02
| | | | | | | | | | | | * Make FS space alerts thresholds configurable (#1) This makes it possible to tweak the thresholds for the NodeFilesystemSpaceFillingUp alerts. Which might be necessary in systems like Kubernetes, where the image garbage collector runs at 85%, so it's not a problem that the disk reaches that usage %. Signed-off-by: iuri aranda <iuri@skyscrapers.eu>
* docs/node-mixin/dashboards: do not mix tabs and spacespaulfantom2019-11-01
| | | | Signed-off-by: paulfantom <pawel@krupa.net.pl>
* Fix the normalization for the cluster-wide dashboardsbeorn72019-10-30
| | | | | | | | | | | | | | | | | We actually have to count or sum, respectively, _all_ the selected metrics for the cluster-wide view. Which means it's easiest to use the `scalar` approach after all (but only in the cluster dashboard). This still propagates all the labels. I have extended the comment for the `nodeExporterSelector` to note that the cluster dashboard only makes sense if all the selected node exporter actually belong to the same cluster. Since this is jsonnet, users can easily disable the cluster dashboard. Or even create multiple instances of the dashboards with different `nodeExporterSelector`s for different clusters. Signed-off-by: beorn7 <beorn@grafana.com>
* docs/node-mixin: Improve memory pressure ruleBenoît Knecht2019-10-28
| | | | | | | | | | | | | | | The `instance:node_memory_swap_io_pages:rate1m` rule was intended to measure the amount of memory pressure a system is under, but its name is a bit misleading (it specifically refers to swap), and the rate of `node_vmstat_pgmajfault` is a better metric for memory pressure (see #1524). This commit renames `instance:node_memory_swap_io_pages:rate1m` to `instance:node_vmstat_pgmajfault:rate1m`, and defines it as `rate(node_vmstat_pgmajfault{%(nodeExporterSelector)s}[1m])`. The dashboards are updated accordingly. Signed-off-by: Benoît Knecht <benoit.knecht@fsfe.org>
* Two quick typo fixesScott Brenner2019-10-09
| | | | Signed-off-by: Scott Brenner <scott@scottbrenner.me>
* Merge pull request #1482 from ↵Björn Rabenstein2019-09-26
|\ | | | | | | | | leojonathanoh/fix-node-mixin-prometheus-alert-rules-to-use-percentage Fix node-mixin prometheus alert rules to use percentage
| * Fix node-mixin prometheus alert rules to use percentageLeo2019-09-11
| | | | | | | | Signed-off-by: Leo <leonardjonathanoh@live.com>
* | node-mixin: fix configuration for unset fsSelector/diskDeviceSelectorSergiusz Urbaniak2019-09-12
| | | | | | | | | | | | | | | | | | | | | | As per https://github.com/prometheus/node_exporter/pull/1429#discussion_r304210103 we want to fetch all devices and all fs types. Currently, this is done by setting empty string which breaks most queries which rely on it. This fixes it by setting the appropriate selector instead of empty string. Signed-off-by: Sergiusz Urbaniak <sergiusz.urbaniak@gmail.com>
* | node-mixin: fix query in Disk Space Utilisation dashboardSergiusz Urbaniak2019-09-12
|/ | | | Signed-off-by: Sergiusz Urbaniak <sergiusz.urbaniak@gmail.com>
* Node mixin: Clarify dashboard dependency on rules (#1475)Björn Rabenstein2019-09-08
| | | | | | Following @discordianfish's suggestion [here](https://github.com/prometheus/node_exporter/issues/1454#issuecomment-524225222). Signed-off-by: beorn7 <beorn@grafana.com>
* Update legendLinkbeorn72019-08-20
| | | | | | | This still had the 'k8s' in as it was copied and pasted from the kubernetes-mixin. Signed-off-by: beorn7 <beorn@grafana.com>
* Merge pull request #1449 from prometheus/beorn7/mixin3Björn Rabenstein2019-08-19
|\ | | | | node-mixin: Make the severity of "critical" alerts configurable
| * Make the severity of "critical" alerts configurablebeorn72019-08-14
| | | | | | | | | | | | | | | | This addresses the blissful scenario where single-node failures are unproblematic. No reason to wake somebody up if a node is about to screw itself up by filling the disk. Signed-off-by: beorn7 <beorn@grafana.com>
* | Add line for number of cores to load graphbeorn72019-08-15
| | | | | | | | | | | | Backported from the node dashboard in the kubernetes-mixin. Signed-off-by: beorn7 <beorn@grafana.com>
* | Fix title of CPU panel to usagebeorn72019-08-15
| | | | | | | | | | | | | | We use the `mode="idle"` metric, but we are inverting it, so this is usage, and that's intended. Signed-off-by: beorn7 <beorn@grafana.com>
* | node-mixin: Improve disk usage panelbeorn72019-08-15
| | | | | | | | | | | | | | | | | | | | - Use a stacked graph instead of a gauge as development over time is especially useful for disk space usage. - By only taking one metric per device into account, we avoid double-counting for devices that are mounted multiple times. Signed-off-by: beorn7 <beorn@grafana.com>
* | node-mxin: Improve nodes dashboard (#1448)Björn Rabenstein2019-08-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * node-mixin: Improve nodes dashboard - Use stacking where it makes sense. - Normalize idle CPU so that stacking is more meaningful. - Consistently fill where stacking is used but don't fill where not. - Fix y axis max value for Idle CPU panel. - Fix y axis min value for memory usage panel. - Use `$__interval` for range where applicable (and set min step to 1m). - Make the right Y axis for disk I/O actually work. This is just an incremental improvements. It doesn't touch the more involved TODOs. Signed-off-by: beorn7 <beorn@grafana.com>
* | node-mixin: Fix various straight-forward issues in the USE dashboardsbeorn72019-08-13
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | - Normalize cluster memory utilisation. - Fix missing `1m` in memory saturation. - Have both disk-related row next to each other instead with the network row in between. - Correctly render transmit network traffic as negative, using `seriesOverrides` and `min: null` for the y-axis. - Make panel and row naming consistent. - Remove legend where it would just display a single entry with exactly the title of the panel. - Fix metric name in individual node CPU Saturation panel. - Break up disk space utilisation by device in the panel for an individual node. NB: All of that doesn't touch any more subtle issues captured in the various TODOs. Signed-off-by: beorn7 <beorn@grafana.com>
* docs/node-mixin: move fsSelector and diskDeviceSelector to the end of querypaulfantom2019-07-24
| | | | | | | | | This will cause a query to be valid even if values of selector are empty. Additionally fixing query responsible for disk space usage. Signed-off-by: paulfantom <pawel@krupa.net.pl>
* Added `_excluding_lo` to name of network rules that exclude lobeorn72019-07-22
| | | | Signed-off-by: beorn7 <beorn@grafana.com>
* Improvement of comments and panel titlesbeorn72019-07-22
| | | | Signed-off-by: beorn7 <beorn@grafana.com>
* Break out device in disk IO rules/dashboardbeorn72019-07-18
| | | | Signed-off-by: beorn7 <beorn@grafana.com>
* Removed unneeded `sum_` and `avg_` from rule namesbeorn72019-07-18
| | | | Signed-off-by: beorn7 <beorn@grafana.com>
* Responses to review comments, round 3beorn72019-07-17
| | | | Signed-off-by: beorn7 <beorn@grafana.com>
* Convert annotations from message to summary/descriptionbeorn72019-07-16
| | | | Signed-off-by: beorn7 <beorn@grafana.com>
* Address review comments, batch 2beorn72019-07-16
| | | | Signed-off-by: beorn7 <beorn@grafana.com>
* Make more use of config.libsonnetbeorn72019-07-16
| | | | Signed-off-by: beorn7 <beorn@grafana.com>
* Address first batch of old review commentsbeorn72019-07-16
| | | | Signed-off-by: beorn7 <beorn@grafana.com>
* Make selector naming consistentbeorn72019-07-10
| | | | Signed-off-by: beorn7 <beorn@grafana.com>
* Fix indentationbeorn72019-07-10
| | | | Signed-off-by: beorn7 <beorn@grafana.com>
* (Re-)adjust to Grafana gauge expecting percentage 0-100 (rather than 1-0)beorn72019-07-10
| | | | Signed-off-by: beorn7 <beorn@grafana.com>
* Use promgrafonnet as a vendored library from its sourcebeorn72019-07-06
| | | | | | | | | | | | | | The only deviation that happened so far is to use format="percentunit" in a Grafana gauge. This change wasn't even properly used in this repo so far, so I opted to stick with "upstream" for now. If changes are really needed, we can try to change upstream first. Another change was done in parallal here and upstream, but it was "more correct" in upstream. (Change datasource to $datasource variable, only partially applied here.) Which is another point for using the upstream and not copy it here. Signed-off-by: beorn7 <beorn@grafana.com>
* Add README.mdbeorn72019-07-06
| | | | Signed-off-by: beorn7 <beorn@grafana.com>
* Add Makefile to easily make output files and lint sourcesbeorn72019-07-06
| | | | Signed-off-by: beorn7 <beorn@grafana.com>
* Create jsonnet files to create output filesbeorn72019-07-06
| | | | | | | This allows to create YAML files with rules and JSON files with dashboard descriptions. Signed-off-by: beorn7 <beorn@grafana.com>
* Update vendoring to current location of jsonnet-libsbeorn72019-07-06
| | | | Signed-off-by: beorn7 <beorn@grafana.com>
* Move node-mixin into docs directorybeorn72019-07-05
| | | | Signed-off-by: beorn7 <beorn@grafana.com>
* Add compat rules for node_time, node_memory_ShmemHugePages and ↵Cougar2018-11-05
| | | | | node_memory_ShmemPmdMapped (#1138) Signed-off-by: Cougar <cougar@random.ee>
* Fix supervisord collector (#978)Ben Kochie2018-08-06
| | | | | | | | | | | | | * Replace supervisord xmlrpc library * Use `github.com/mattn/go-xmlrpc` that doesn't leak goroutines. * Fix uptime metric * Use Prometheus best practices for uptime metric. * Use "start time" rather than "uptime". * Don't emit a start time if the process is down. * Add changelog entry. * Add example compatibility rules. Signed-off-by: Ben Kochie <superq@gmail.com>
* Fix sample rules for migration (#1022)Rene Treffer2018-07-27
| | | | | | | - add conversion from _ms to _seconds on disk metrics - add missing node_textfile_mtime section - add groups: header to pass promtool check rules Signed-off-by: Rene Treffer <rene.treffer@soundcloud.com>
* Add example of translating new metrics to old format in case of migration to ↵Ivan Kiselev2018-07-02
| | | | | | | | 1.16 version (#982) Add additional example of how to save old metrics Signed-off-by: Ivan Kiselev <ivan@messagebird.com>
* Add compat rules for filesystem collector. (#973)Roman Vynar2018-06-13
| | | Signed-off-by: Roman Vynar <roman.vynar@goquiq.com>