-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Should non-ASCII contents in .asci blocks be rejected? #257
Comments
It was pointed out to me in linked ticket that this is a valid use-case of I guess however it's not the end of the story: So:
So I guess first |
I think assemblers typically read asm byte by byte; m2c should perhaps be refactored to do the same. I think the current behavior here isn't too bad though: for ASCII and UTF-8 it correctly preserves bytes, while for anything else it will cleanly fail with an error. And if you have something non-UTF-8 in your source code in 2023 I'd argue you're doing something wrong, and you should be using The other place where string encodings come up is in the output layer, for string literals. Here we decode bytes as UTF-8 with fallback to So I think I could possibly be convinced to take a PR that adds a string encoding cmdline flag that works like you describe if there's a concrete use case, but I'm also pretty happy with the current behavior. A flag just for disabling the "test if it's UTF-8" behavior of the string literal formatting may be a better choice, if anything. |
Normally I'd agree but all this started with splatting out a 25+ year old game so if the original strings in that are not UTF-8 (in my case it turned out that they were so that's nice), there's not much choice, right? Unless whatever original encoding gets processed and then output as UTF-8 or something so that everything downstream just works? |
There is choice. You can either |
I've made an issue at Decompollaborate/spimdisasm#121 which shows that there are sometimes
.asciz
blocks emitted with non-ASCII characters in them.m2c
then happily takes them and outputs bytes from UTF-8 encoding (as far as I can tell).That is, given
asciz "奩"
it will happily spit oute5 a5 a9 00
. A briefgrep
brings me toparse_ascii_directive
which at the very top says something about being wrong w.r.t. encodings: I guess this is exactly the issue it's talking about? I see few lines later a very explicitc.encode("utf-8")
. Is this what MIPS assemblers would usually do or would they just interpret the whole input as ASCII to start with? I don't know.The text was updated successfully, but these errors were encountered: