Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Character encoding inconsistency / reporting #5681

Open
solardiz opened this issue Mar 4, 2025 · 6 comments
Open

Character encoding inconsistency / reporting #5681

solardiz opened this issue Mar 4, 2025 · 6 comments
Assignees
Labels

Comments

@solardiz
Copy link
Member

solardiz commented Mar 4, 2025

Originally in #5680 (comment)

Testing against the test vectors from openwall/john-samples#31 I am only able to directly crack the simple password 12345678. For cracking the complex password, I have to first process the wordlist through iconv -f utf8 -t iso-8859-1. I guess it got inadvertently converted the other way somewhere on the way to git commit? Should we replace it with the result of this iconv with a subsequent commit?

Oh, alternatively I am able to get it cracked by adding -target-enc=iso-8859-1.

@magnumripper @davidedg please suggest how to fix this encoding issue best, to minimize user confusion and users' wasted time on running with wrong encoding settings. Right now, by default we print Using default input encoding: UTF-8, but with the input wordlist actually in UTF-8 we fail to crack this password. So it feels like a bug.

@magnumripper
Copy link
Member

I generally recommend always using UTF-8 for wordlists, and -target-enc where needed. For samples however, maybe it's better to have it as the expected encoding already in the password hint file. If we do that, I suggest we use both encodings in the password hint file: Keep the UTF-8 and add one in ISO. Then also explain this with #!comment: lines!

If we do not change oubliette-passwords.txt (and perhaps even if we do), we should add some kind of README that explains the situation and the -target-enc option.

@magnumripper
Copy link
Member

Right now, by default we print Using default input encoding: UTF-8, but with the input wordlist actually in UTF-8 we fail to crack this password. So it feels like a bug.

We could amend the output when --target-encoding is not used, such as:

Using default input encoding: UTF-8 and expecting target encoding to the same

or

Using default input encoding: UTF-8
Expected target encoding: UTF-8

@magnumripper
Copy link
Member

magnumripper commented Mar 4, 2025

Right now, by default we print Using default input encoding: UTF-8, but with the input wordlist actually in UTF-8 we fail to crack this password. So it feels like a bug.

We could amend the output when --target-encoding is not used, such as:

Using default input encoding: UTF-8 and expecting target encoding to the same

or

Using default input encoding: UTF-8
Expected target encoding: UTF-8

Hmm no, that ends up even more confusing for the case when no encoding option is used, but the wordlist is already (in this case) in ISO-8859-1. So maybe we should change the Using default input encoding: UTF-8 to Expecting input encoding to match target encoding (for that case, but not if FMT_UNICODE)

@solardiz
Copy link
Member Author

solardiz commented Mar 4, 2025

Expecting input encoding to match target encoding

I like this one. Maybe even: Expecting input character encoding to match the target encoding to be clearer what kind of encoding we refer to.

I recall that there are cases where passing -enc=raw makes a difference, so perhaps the above isn't always the default?

@magnumripper
Copy link
Member

be clearer what kind of encoding we refer to.

What could it be other than character encoding?

I recall that there are cases where passing -enc=raw makes a difference, so perhaps the above isn't always the default?

For Unicode formats like NT, -enc=raw affects the conversion to UTF-16 (will behave like old john, which in turn behaves exactly like -enc=iso-8859-1 - perhaps we should clearly say so).
For any format including Unicode ones, rules processing with -enc=raw will also behave like old john: Can only lower/upper case ASCII, all character classes are ASCII only, and so on.

We could add a line when rules are in use with RAW:

Rules will not fully support non-ASCII characters

or s/support/handle/

@magnumripper magnumripper self-assigned this Mar 5, 2025
@magnumripper magnumripper changed the title Oubliette character encoding inconsistency Character encoding inconsistency / reporting Mar 5, 2025
@solardiz
Copy link
Member Author

solardiz commented Mar 5, 2025

What could it be other than character encoding?

e.g. base64 ;-)

Rules will not fully support non-ASCII characters

or s/support/handle/

Yes, we could. Maybe prefix it with "Note: " like we do for some other things that are almost but not quite warnings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants