Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for device types and predictable device paths #235

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

Informatic
Copy link

@Informatic Informatic commented Jul 14, 2024

This is a couple of tightly related changes:

  • Adds a new --smartctl.scan-device-type command line option allowing for customization of autodetected device types and enables use of special by-id device type that forces use of predictable device paths (/dev/disk/by-id/...)
  • Reworks device label generation - this is now based off device.name and device.type instead of relying on device.info_name - this changes label since Added determining device type and use it at scrape data #205, but stays compatible with latest stable release, and is probably the best way forward (though I am open for suggestions - I was thinking of adding an extra option to still output device path and type as separate labels)
    • Added some label generation tests as an example.
  • Adds an option of specifying device type in --smartctl.device call via semicolon separator
    • Example: --smartctl.device=/dev/bus/0;megaraid,0 will result in: smartctl -d megaraid,0 /dev/bus/0 and output device="bus_0_megaraid_0" metric label)
  • Brings back previous autoscan behaviour - specifying --smartctl.device will disable autodiscovery.
    • Autodiscovery can still be forcibly reenabled using explicit --smartctl.scan, for whatever reason.
  • Fixes SATA device discovery broken by aforementioned PR by forcing auto type on autodetected scsi drives.

Fixes #134, #230, #89, probably #229 too...

We will now log a warning if smartctl path passed via command line is invalid.

Signed-off-by: Piotr Dobrowolski <[email protected]>
This adds a new command line option allowing for customization of
autodetected device types and enables use of special "by-id" device type
that forces use of predictable device paths (/dev/disk/by-id/...)

Relevant change to device name parsing regular expression is included
now, so predictable device paths are now also usable when directly
specified.

Signed-off-by: Piotr Dobrowolski <[email protected]>
@Informatic Informatic force-pushed the scan-device-type branch 3 times, most recently from 8331cff to 76cae35 Compare July 14, 2024 21:31
@Informatic Informatic changed the title Add support for autoscan device types and predictable device paths Add support for device types and predictable device paths Jul 14, 2024
@huv95
Copy link

huv95 commented Aug 8, 2024

Anyone here? I think this feature very useful and I need it, I wonder what other issues need to be reviewed before we can merge and release?

@jakubgs
Copy link

jakubgs commented Aug 11, 2024

I have attempted to use the --smartctl.scan-device-type with a server that has a Smart Array P440ar RAID card:

 > sudo ./smartctl_exporter --smartctl.scan-device-type=cciss  
caller=main.go:188 level=info msg="Starting smartctl_exporter" version="(version=, branch=, revision=319184ce66d88b706d4ce4939b68e0c68a712b0e)"
caller=main.go:189 level=info msg="Build context" build_context="(go=go1.22.2, platform=linux/amd64, user=, date=, tags=unknown)"
caller=main.go:199 level=info msg="Number of devices found" count=0

But no devices have been found. But if I specify a device explicitly it does work:

 > sudo ./smartctl_exporter --smartctl.device='/dev/sda;cciss,0'
caller=main.go:188 level=info msg="Starting smartctl_exporter" version="(version=, branch=, revision=319184ce66d88b706d4ce4939b68e0c68a712b0e)"
caller=main.go:189 level=info msg="Build context" build_context="(go=go1.22.2, platform=linux/amd64, user=, date=, tags=unknown)"
caller=main.go:203 level=info msg="Devices specified" devices=/dev/sda;cciss,0
caller=main.go:205 level=info msg="Devices filtered" count=1

Which does work:

[email protected]:~ % c 0:9633/metrics | grep smartctl_device_temperature
# HELP smartctl_device_temperature Device temperature celsius
# TYPE smartctl_device_temperature gauge
smartctl_device_temperature{device="sda",temperature_type="current"} 25
smartctl_device_temperature{device="sda",temperature_type="drive_trip"} 70

But it means each device has to be specified manually. Which is still better than what we had before, so good improvement.

I know I can identify all the devices using lsblk:

 > lsblk --json -O | jq -c '.blockdevices[] | { path, hctl, subsystems }'
{"path":"/dev/sda","hctl":"0:1:0:0","subsystems":"block:scsi:pci"}
{"path":"/dev/sdb","hctl":"0:1:0:1","subsystems":"block:scsi:pci"}
{"path":"/dev/sdc","hctl":"0:1:0:2","subsystems":"block:scsi:pci"}

So it can be automated with Ansible.

@jakubgs
Copy link

jakubgs commented Aug 11, 2024

If I use --smartctl.scan-device-type=scsi I get this:

level=warn msg="S.M.A.R.T. output reading" err="exit status 1" device="/dev/sda;auto (sda)"
level=error msg="Command line did not parse" device="/dev/sda;auto (sda)"
level=error msg="/dev/sda: requires option '-d cciss,N'"
level=warn msg="S.M.A.R.T. output reading" err="exit status 1" device="/dev/sdb;auto (sdb)"
level=error msg="Command line did not parse" device="/dev/sdb;auto (sdb)"
level=error msg="/dev/sdb: requires option '-d cciss,N'"
level=warn msg="S.M.A.R.T. output reading" err="exit status 1" device="/dev/sdc;auto (sdc)"
level=error msg="Command line did not parse" device="/dev/sdc;auto (sdc)"
level=error msg="/dev/sdc: requires option '-d cciss,N'"

Which is progress, but not sure if it can work with just a scan. Though a separate PR could possibly extend the functionality with parsing of outpuit from lsblk --json -O which does provide hctl IDs for the devices.

@jakubgs
Copy link

jakubgs commented Aug 11, 2024

Forgot to add that I really appreciate your work on this. Issues with lack of metrics for cciss devices have been an annoyance for me for a long while.

@k0ste
Copy link
Contributor

k0ste commented Aug 11, 2024

@jakubgs you should provide the output from smartctl --scan --json, to possibility to debug this 🦫

@jakubgs
Copy link

jakubgs commented Aug 11, 2024

Sure thing, here you go:

smartctl --scan --json
{
  "json_format_version": [
    1,
    0
  ],
  "smartctl": {
    "version": [
      7,
      2
    ],
    "svn_revision": "5155",
    "platform_info": "x86_64-linux-5.15.0-106-generic",
    "build_info": "(local build)",
    "argv": [
      "smartctl",
      "--scan",
      "--json"
    ],
    "exit_status": 0
  }
}
smartctl --json --scan --device scsi
{
  "json_format_version": [
    1,
    0
  ],
  "smartctl": {
    "version": [
      7,
      2
    ],
    "svn_revision": "5155",
    "platform_info": "x86_64-linux-5.15.0-106-generic",
    "build_info": "(local build)",
    "argv": [
      "smartctl",
      "--json",
      "--scan",
      "--device",
      "scsi"
    ],
    "exit_status": 0
  },
  "devices": [
    {
      "name": "/dev/sda",
      "info_name": "/dev/sda",
      "type": "scsi",
      "protocol": "SCSI"
    },
    {
      "name": "/dev/sdb",
      "info_name": "/dev/sdb",
      "type": "scsi",
      "protocol": "SCSI"
    },
    {
      "name": "/dev/sdc",
      "info_name": "/dev/sdc",
      "type": "scsi",
      "protocol": "SCSI"
    }
  ]
}

@k0ste
Copy link
Contributor

k0ste commented Aug 11, 2024

Sure thing, here you go:

Don't know exactly, but looks very strange. By default smartctl scan should report about all devices, and --device is a filter. Look what I mean:

[root@host:/]# smartctl --version | grep release
smartmontools release 7.4 dated 2023-08-01 at 10:59:45 UTC
[root@host:/]# smartctl --scan
/dev/sda -d scsi # /dev/sda, SCSI device
/dev/sdb -d scsi # /dev/sdb, SCSI device
/dev/sdc -d scsi # /dev/sdc, SCSI device
/dev/sdd -d scsi # /dev/sdd, SCSI device
/dev/sde -d scsi # /dev/sde, SCSI device
/dev/sdf -d scsi # /dev/sdf, SCSI device
/dev/sdg -d scsi # /dev/sdg, SCSI device
/dev/sdh -d scsi # /dev/sdh, SCSI device
/dev/sdi -d scsi # /dev/sdi, SCSI device
/dev/sdj -d scsi # /dev/sdj, SCSI device
/dev/sdk -d scsi # /dev/sdk, SCSI device
/dev/sdl -d scsi # /dev/sdl, SCSI device
/dev/sdm -d scsi # /dev/sdm, SCSI device
/dev/sdn -d scsi # /dev/sdn, SCSI device
/dev/bus/0 -d megaraid,0 # /dev/bus/0 [megaraid_disk_00], SCSI device
/dev/bus/0 -d megaraid,1 # /dev/bus/0 [megaraid_disk_01], SCSI device
/dev/bus/0 -d megaraid,2 # /dev/bus/0 [megaraid_disk_02], SCSI device
/dev/bus/0 -d megaraid,3 # /dev/bus/0 [megaraid_disk_03], SCSI device
/dev/bus/0 -d megaraid,4 # /dev/bus/0 [megaraid_disk_04], SCSI device
/dev/bus/0 -d megaraid,5 # /dev/bus/0 [megaraid_disk_05], SCSI device
/dev/bus/0 -d megaraid,6 # /dev/bus/0 [megaraid_disk_06], SCSI device
/dev/bus/0 -d megaraid,7 # /dev/bus/0 [megaraid_disk_07], SCSI device
/dev/bus/0 -d megaraid,8 # /dev/bus/0 [megaraid_disk_08], SCSI device
/dev/bus/0 -d megaraid,9 # /dev/bus/0 [megaraid_disk_09], SCSI device
/dev/bus/0 -d megaraid,10 # /dev/bus/0 [megaraid_disk_10], SCSI device
/dev/bus/0 -d megaraid,11 # /dev/bus/0 [megaraid_disk_11], SCSI device
/dev/nvme0 -d nvme # /dev/nvme0, NVMe device
/dev/nvme1 -d nvme # /dev/nvme1, NVMe device

But your output return None. May be you should try with smartctl 7.4? Don't forget sudo!

@jakubgs
Copy link

jakubgs commented Aug 11, 2024

I don't think I can easily install 7.4 on Ubuntu 22.04 since it depends on some pretty core stuff:

 > sudo dpkg -i smartmontools_7.4-2build1_amd64.deb
(Reading database ... 200369 files and directories currently installed.)
Preparing to unpack smartmontools_7.4-2build1_amd64.deb ...
Unpacking smartmontools (7.4-2build1) over (7.2-1ubuntu0.1) ...
dpkg: dependency problems prevent configuration of smartmontools:
 smartmontools depends on sysvinit-utils (>= 3.05-4~); however:
  Version of sysvinit-utils on system is 3.01-1ubuntu1.
 smartmontools depends on libc6 (>= 2.38); however:
  Version of libc6:amd64 on system is 2.35-0ubuntu3.7.

@Informatic
Copy link
Author

Regardless of smartctl --scan --json returns in this case, if you're looking for autodetection of cciss devices it's simply not supported in smartmontools: https://github.com/smartmontools/smartmontools/blob/master/smartmontools/os_linux.cpp#L3082-L3148 - if I understand it correctly, smartmontools does pick the devices up during SCSI scan if you explicitly ask it to, but then intentionally fails when trying to open it without specifying the cciss id: https://github.com/smartmontools/smartmontools/blob/master/smartmontools/os_linux.cpp#L3483

@k0ste
Copy link
Contributor

k0ste commented Aug 11, 2024

I don't think I can easily install 7.4 on Ubuntu 22.04 since it depends on some pretty core stuff

You always can build packages for yourself, or run another distro for debugging purposes

Regardless of smartctl --scan --json returns in this case, if you're looking for autodetection of cciss devices it's simply not supported in smartmontools: https://github.com/smartmontools/smartmontools/blob/master/smartmontools/os_linux.cpp#L3082-L3148

And seems there are the place https://github.com/smartmontools/smartmontools/blob/master/smartmontools/os_linux.cpp#L3137 - this is why we seen megaraid devices in list 'by default'

@k0ste
Copy link
Contributor

k0ste commented Aug 23, 2024

@NiceGuyIT @robbat2 @SuperQ

I was tested this patch over the master. For me - works as expected 🤝

  • works with HBA (SATA HDD)
  • works with megaraid in HBA mode (SATA HDD)

Only one breaking change that I found, is, for who's don't need megaraid at all:

smartctl.device-exclude="^/dev/bus/[0-9]+$"
must be changed to:
smartctl.device-exclude="(.*)bus(.*)" (or "(.*)megaraid(.*)")

I think should be merged, and then we can take a look to #204 and release 0.13 📦

Copy link

@jakubgs jakubgs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I think this PR solves a lot of long standing issues. My special case is not covered by smartctl so it can be ignored.

@robbat2
Copy link
Contributor

robbat2 commented Aug 23, 2024

@Informatic can you include some before & after examples of the device labels? I'm worried users may depend on specific values there, and I wouldn't want it to change and surprise them.

@k0ste
Copy link
Contributor

k0ste commented Aug 23, 2024

@Informatic can you include some before & after examples of the device labels? I'm worried users may depend on specific values there, and I wouldn't want it to change and surprise them.

The label device, changed only if admin added such special option:

--smartctl.scan-device-type=by-id

By default the device label == linux block device name (as expected)

@NiceGuyIT
Copy link
Member

@k0ste Sorry I've been away for a while. Sometimes life happens.

I was going to move the last merge request that broke things behind an "experimental" flag for people to test. I don't think it's realistic to test a pull request against all configurations. This way the experimental changes can be tested in the field without affecting the stable code.

I'll look at this in the next few days. More than likely, I'll put this behind an --experimental-device-scanning (or similar wording) flag.

@Informatic
Copy link
Author

Informatic commented Aug 27, 2024 via email

@k0ste
Copy link
Contributor

k0ste commented Oct 1, 2024

@k0ste Sorry I've been away for a while. Sometimes life happens.

We still have a chance to make second release in this year 🥈

@jakubgs
Copy link

jakubgs commented Oct 2, 2024

More than likely, I'll put this behind an --experimental-device-scanning (or similar wording) flag.

Or you could just make a Release Candidate and have people try it out. But yeah, it is a significant change.

@k0ste k0ste mentioned this pull request Oct 16, 2024
Copy link
Contributor

@SuperQ SuperQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a rebase after the change to the new logging library.

@k0ste
Copy link
Contributor

k0ste commented Nov 12, 2024

This needs a rebase after the change to the new logging library.

@Informatic 🙏🏼

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Device nodes are not guaranteed to be consistent over time
7 participants