Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[cleaner] Skip obfuscation of substrings for hostnames properly #3850

Merged

Conversation

pmoravec
Copy link
Contributor

Leftover from #3403 / #3496 where match_full_words_only is ignored for hostname mapping due to own get_regex_result method.

Resolves: #3593


Please place an 'X' inside each '[]' to confirm you adhere to our Contributor Guidelines

  • Is the commit message split over multiple lines and hard-wrapped at 72 characters?
  • Is the subject and message clear and concise?
  • Does the subject start with [plugin_name] if submitting a plugin patch or a [section_name] if part of the core sosreport code?
  • Does the commit contain a Signed-off-by: First Lastname [email protected]?
  • Are any related Issues or existing PRs properly referenced via a Closes (Issue) or Resolved (PR) line?
  • Are all passwords or private data gathered by this PR obfuscated?

pmoravec added a commit to pmoravec/sos that referenced this pull request Nov 22, 2024
Leftover from sosreport#3403 / sosreport#3496 where match_full_words_only is ignored for
hostname mapping due to own get_regex_result method.

Relevant: sosreport#3593
Resolves: sosreport#3850

Signed-off-by: Pavel Moravec <[email protected]>
@pmoravec pmoravec force-pushed the sos-pmoravec-cleaner-hostname-fullstring-2 branch from 8dbae3f to 5a4f38a Compare November 22, 2024 07:56
Copy link

Congratulations! One of the builds has completed. 🍾

You can install the built RPMs by following these steps:

  • sudo yum install -y dnf-plugins-core on RHEL 8
  • sudo dnf install -y dnf-plugins-core on Fedora
  • dnf copr enable packit/sosreport-sos-3850
  • And now you can install the packages.

Please note that the RPMs should be used only in a testing environment.

@@ -87,7 +87,7 @@ def get_regex_result(self, item):
"""
if '.' in item:
item = item.replace('.', '(\\.|_)')
return re.compile(item, re.I)
return super(SoSHostnameMap, self).get_regex_result(item)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to return that way, perhaps add:

# pylint: disable=super-with-arguments

To the end of the line?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or better, and then we don't need to disable it :)

return super().get_regex_result(item)

Leftover from sosreport#3403 / sosreport#3496 where match_full_words_only is ignored for
hostname mapping due to own get_regex_result method.

Relevant: sosreport#3593
Resolves: sosreport#3850

Signed-off-by: Pavel Moravec <[email protected]>
@pmoravec pmoravec force-pushed the sos-pmoravec-cleaner-hostname-fullstring-2 branch from 5a4f38a to 1d5b4ce Compare November 22, 2024 09:32
@pmoravec
Copy link
Contributor Author

pmoravec commented Nov 25, 2024

Tests fail on obfuscating hostname cleanedtest from /var/log/anaconda/lvm.log logs like:

17:43:58.633129 vgcreate[2015] format_text/format-text.c:1391  Writing fedora_fedora metadata to /etc/lvm/backup/.lvm_cleanertest_2015_76995411

These logs do contain hostname, often FQDN, wrapped by underscores, like:

16:00:55.756263 lvremove[2869] format_text/format-text.c:1328  Writing rhel_unused metadata to /etc/lvm/backup/.lvm_pmoravec-sat616.some.subdomain.redhat.com_2869_1648141267

Question: should hostname mapper treat underscore as a word separator, or not? See https://github.com/sosreport/sos/blob/main/sos/cleaner/mappings/__init__.py#L102 and original problem I aim to fix: #3593 (comment) .

Or in other words: if a host name is, say, HOST, which of these strings should we obfuscate?

someHOST
someHOST_something
HOST_something           YES
my_HOST                  YES
my_HOST_something        YES
my_HOST-something        YES
my-HOST_something        YES
someHOSTstring
my_HOSTstring

Exactly all those with YES? (so we must use item = rf'(?=\b){re.escape(item)}(?=\b)' instead of item = rf'(?=\b|_|-){re.escape(item)}(?=\b|_|-)' for hostname mapper?)

@arif-ali
Copy link
Member

I think, that makes sense to me, is it worth updating the tests to reflect this at all?

@pmoravec
Copy link
Contributor Author

I think, that makes sense to me, is it worth updating the tests to reflect this at all?

I came to the same conclusion - that someHOST-something does not contain HOST as a hostname. It contains it as a substring, but not as hostname.

So:

So I will come up with a better patch.

@pmoravec
Copy link
Contributor Author

This is tricky :) Let see what tests will return, but item = rf'(?=\b|_|-){re.escape(item)}(?=\b|_|-)' is simply wrong. We test positive lookahead - twice! - but dont test lookbehind. That makes a lot of confusion where my-hostname_TEST finds hostname, but my_hostname_TEST doesnt find it.

OK, let try positive lookbehind (?<=\b|_|-) - but Python denies it as "look-behind requires fixed-width pattern".

Let re-formulate it to negative lookbehind :) (?<![A-Za-z0-9]) . Since alphanumeric characters (not \w, that's another trick) are exactly those forbidden before a full word match.

So additionally to the patch, we must:

diff --git a/sos/cleaner/mappings/__init__.py b/sos/cleaner/mappings/__init__.py
--- a/sos/cleaner/mappings/__init__.py
+++ b/sos/cleaner/mappings/__init__.py
@@ -99,7 +99,7 @@ class SoSMap():
         :rtype:         ``re.Pattern``
         """
         if self.match_full_words_only:
-            item = rf'(?=\b|_|-){re.escape(item)}(?=\b|_|-)'
+            item = rf'(?!<[A-Za-z0-9]){re.escape(item)}(?=\b|_|-)'
         else:
             item = re.escape(item)
         return re.compile(item, re.I)

pmoravec added a commit to pmoravec/sos that referenced this pull request Nov 27, 2024
Currently, matching full words checks two lookaheads but does not check
what is behind the word. Since Python requires fixed width strings for
lookbehind, enumerate the forbidden chars behind the word explicitly.

Relevant: sosreport#3593
Closes: sosreport#3850
pmoravec added a commit to pmoravec/sos that referenced this pull request Nov 27, 2024
Currently, matching full words checks two lookaheads but does not check
what is behind the word. Since Python requires fixed width strings for
lookbehind, enumerate the forbidden chars behind the word explicitly.

Relevant: sosreport#3593
Closes: sosreport#3850

Signed-off-by: Pavel Moravec <[email protected]>
@pmoravec pmoravec force-pushed the sos-pmoravec-cleaner-hostname-fullstring-2 branch from 520de8d to f3d7c3e Compare November 27, 2024 11:36
@pmoravec
Copy link
Contributor Author

FYI I made a test script similar to #3766 (comment) :

import re

secret = "SECRET"

REG = rf'(?<![A-Za-z0-9]){re.escape(secret)}(?=\b|_|-)'

test_strings = (
        (True, 'test SECRET:something'),
        (True, 'test:SECRET something'),
        (True, 'test my_SECRET_TEST will be cleaned?'),
        (True, 'test my-SECRET_TEST will be cleaned?'),
        (False, 'test mySECRET-should stay'),
        (True, 'test SECRET-cleaned'),
        (True, 'SECRET-cleaned also'),
        (False, 'SECRETNOT-cleaned'),
        (True, 'SECRET_cleaned underscore'),
        (True, 'SECRET.cleaned as well'),
        (True, 'cleaned.SECRET also'),
        (False, 'mySECRET.notcleaned hostname'),
)

intro=f"Matching against {REG}"
print(intro)
print("-"*len(intro))
print()

for match, line in test_strings:
    search = re.search(REG, line)
    print(f"{line}\t:\t\t{search}")
    if (None!=search) != match:
        print(f"\t\tERROR {line} should {'' if match else 'not '}match, but re.search={search}")

The script tests if given word (SECRET) is matched as a full word in several lines by proper R.E..

@arif-ali
Copy link
Member

I've put this through regex101 (always my fallback these days) too ;), and works as expected

Currently, matching full words checks two lookaheads but does not check
what is behind the word. Since Python requires fixed width strings for
lookbehind, enumerate the forbidden chars behind the word explicitly.

Relevant: sosreport#3593
Closes: sosreport#3850

Signed-off-by: Pavel Moravec <[email protected]>
@pmoravec pmoravec force-pushed the sos-pmoravec-cleaner-hostname-fullstring-2 branch from f3d7c3e to 3cf79d0 Compare November 27, 2024 12:23
@pmoravec
Copy link
Contributor Author

Once I fixed the stupid typo, it should pass tests now :)

@pmoravec
Copy link
Contributor Author

The failing test is a false alarm: sosreport-host0-2024-11-27-pyqgxyc/sos_commands/systemd/systemctl_status_--all file contains log:

Nov 27 12:49:10 cirrus-task-5243713155235840Unit display-manager.service could not be found.

and the test finds the hostname cirrus-task-5243713155235840 there. But hostname is full word matching.

OK, next round of PR will fix the test :)

@pmoravec
Copy link
Contributor Author

pmoravec commented Nov 28, 2024

Since I think I touch a fragile&core part of cleaner (how explicitly we search for matches), and the PR fixes three bugs altogether, I would prefer yet another ACK from either @jcastill or @TurboTurtle before merging.

(three bugs: each commit fixes one)

@arif-ali arif-ali added Reviewed/Needs 2nd Ack Require a 2nd ack from a maintainer Status/Needs Review This issue still needs a review from project members labels Nov 28, 2024
@TurboTurtle
Copy link
Member

I'll give this a run through this Saturday.

@jcastill
Copy link
Member

@pmoravec I ran some very basic tests and the commits seem to work pretty well. I'll see if I have some time this weekend to run something more complex

Copy link
Member

@TurboTurtle TurboTurtle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@TurboTurtle TurboTurtle removed the Reviewed/Needs 2nd Ack Require a 2nd ack from a maintainer label Nov 30, 2024
@TurboTurtle TurboTurtle added Reviewed/Ready for Merge Has been reviewed, ready for merge and removed Status/Needs Review This issue still needs a review from project members labels Nov 30, 2024
@TurboTurtle TurboTurtle merged commit ecfcf50 into sosreport:main Nov 30, 2024
33 checks passed
TurboTurtle pushed a commit that referenced this pull request Nov 30, 2024
Leftover from #3403 / #3496 where match_full_words_only is ignored for
hostname mapping due to own get_regex_result method.

Relevant: #3593
Resolves: #3850

Signed-off-by: Pavel Moravec <[email protected]>
TurboTurtle pushed a commit that referenced this pull request Nov 30, 2024
Currently, matching full words checks two lookaheads but does not check
what is behind the word. Since Python requires fixed width strings for
lookbehind, enumerate the forbidden chars behind the word explicitly.

Relevant: #3593
Closes: #3850

Signed-off-by: Pavel Moravec <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Reviewed/Ready for Merge Has been reviewed, ready for merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants