-
Notifications
You must be signed in to change notification settings - Fork 51
Pep8 and Python 3 compatibility #30
base: master
Are you sure you want to change the base?
Conversation
Also, use enumerate rather than range(len(..)).
else: | ||
return str(s) | ||
try: | ||
return s.decode('utf8', 'ignore') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kenny-evitt I included that here. It depends whether ensureUtf is supposed to "convert to utf" or "show as utf". If it's convert then this should be decode (python2 is very forgiving/lax).
@hayd Thanks for the PR! How hard would it be to test whether |
Here's some examples of using str/encode/decode. As you can see the behaviour is very different. Python 2 always returns a str in the below cases. Python 3 returns a str (utf8) only with the above decode path. python2
python 3
Note: unicode encoded becomes bytes (not utf), thus not "ensuring utf". |
Which is to say, in python 2 encode on unicode did nothing, but semantically it was doing the wrong thing (as we can see when python 3 tries to do "the right thing"). decode is what it ought to have been doing, that gives the same results as before but now works in python 2 and 3. |
@hayd I think the coders intention was to do something like the following with python 2 in mind:
Console output:
If you define the text as following, the decode part isn’t necessary because it is already Unicode, but I wanted to show the iso-8859-1 effect.
So if you want to change the encoding you have to do it with Unicode as intermediate step. In addition I want to quote from the python documentation:
The source parameter needs a normal string or a byte string, the ensureUft() function is called there, so it should return one of these types. After all I’m not sure what to do with this function. I didn’t notice any difference on output in lighttable with python 3 neither with encode nor with decode, if they are surrounded by a try-except block. |
@UnknownProgrammer Running your code above in python3 gives
(That is, if it's already unicode it'll hit the except in python 3. It's a no-op in python 2.) whilst I agree that it was doing something different before (not always returning a unicode/utf type), I think that was a bug rather than programmer intention of the "ensure uft" function.
That's great, that's precisely my example above.
This is the important thing IMO. |
@hayd I'm inclined to agree that your change is correct: Here's the commit where Chris added the @UnknownProgrammer Would you test these changes and confirm whether they work for you with Python 3 code? |
@UnknownProgrammer Upon reflection, your example above is quite interesting - thanks! Although I can't seem to get latin-1 coding to work on python 3 (I can see this on python 2): strings seem to always to be unicode (so "ÄÖÜ" is read as 'Ã\x84Ã\x96Ã\x9c' which prints as "Ã�Ã�Ã"). Note that this would have to do a encode/decode round trip to do this properly...
So the good news is that previously your example would have actually raised (in python 2):
whereas decode parses it to valid unicode u'\xc4\xd6\xdc' (i.e. it's works/is fixed with this PR). |
Which is to say, this PR also fixes #10. |
@hayd the example is Python 2 only code In Python 2:we have two text types: • a Unicode string gets encoded to a Python 2.x string (actually, a sequence of bytes) That‘s a quote from the accepted answer of this post [http://stackoverflow.com/questions/368805/python-unicodedecodeerror-am-i-misunderstanding-encode#370199] The original function:
I checked the type of the input objects
If the input object is of type str and you want return Unicode objects,
The prefferedencoding on windows is cp1252
In linux it‘s utf-8 Many encodings to take care of. In Python 3:Python 3 removed support for non Unicode data text (the python 2
s_unicodeutf8 is a unicode str so it gets encoded to bytes (here with iso-8859-1 encoding)
Bytes decoded to unicode str
BUT:
If you try to decode the wrong encoding you get nothing or error. With encode, all str objects are encoded to bytes in utf8, which At the moment I can't find a case where
You used the wrong object type, without the small b it is only a unicode string. The
The "iso-8859-1" representation of „ÄÖÜ“ looks like this
Conversion path from iso-8859-1 to utf8 and back: utf8 -> decode -> unicode string ->encode -> iso-8859-1 |
@hayd @UnknownProgrammer Thanks for looking at this. I'll gladly merge a PR with which you're both happy. |
@UnknownProgrammer I don't follow what your advocating here going forward? (and whether it fixes #10?)
http://stackoverflow.com/a/436299/1240268 :s
Like I say, in python 3 I can't get your example with latin-1 and the coding line to work at all (the initial string is read as broken unicode, whereas in python 2 it's a str). |
@hayd The picture in #10 shows me that the latest python plugin isn't installed. The last time I checked it was updated separately and nobody mentioned the version of the plugin. Using The addition of:
Is necessary if you run LT with Python 2 and want to use uft8 encoded files, because the defaultencoding of Python 2 is ascii. LT calls
As I explained in the first line of my previous post it is Python 2 only, because in Python 3 doesn't exist a non Unicode data text any more. All strings are decoded implicit to unicode strings. And in Python 3 you can only encode unicode to byte and decode byte to unicode. |
@hayd @kenny-evitt
Error:
|
If you're using non-unicode in a utf-coded document isn't that a bug in the python file? This is the bit I don't quite understand: why is non-valid unicode ending up in the (python3) str? Perhaps we could be forgive non-valid unicode and simply catch the UnicodeEncodeError... ? and just return s? The bit with 4 bars was intentional (str rather than bytes), as this seems the only way a round trip can be done in python 3, and it is the way your original example is interpreted in python 3 even with the latin-1 coding header. |
@hayd You want me to try this:
Or that:
Same error, both times.
It is not meant to be used with Python 3 because it takes advantage of the Python 2 str class, which doesn't exist in Python 3 in the same way.
You never had the "iso-8859-1" representation:
The only thing you did was converting utf8 bytes to unicode str in a strange way. Maybe it is clearer this way (Python 2 only!):
Output:
|
@hayd @UnknownProgrammer What should we do with this PR? |
Hi, |
The first commit uses autopep8 to clean up the code a little. Fixes #15.
The second commit makes a couple of changes for python 3 compatibility. #24 #28
Note: a smaller diff can be seen with whitespace ignored (though obviously whitespace is meaningful in python)...