Pep8 and Python 3 compatibility #30

hayd · 2015-06-28T06:20:17Z

The first commit uses autopep8 to clean up the code a little. Fixes #15.

The second commit makes a couple of changes for python 3 compatibility. #24 #28

Note: a smaller diff can be seen with whitespace ignored (though obviously whitespace is meaningful in python)...

Also, use enumerate rather than range(len(..)).

hayd · 2015-06-28T06:31:55Z

py-src/ltmain.py

-  else:
-    return str(s)
+    try:
+        return s.decode('utf8', 'ignore')


@kenny-evitt I included that here. It depends whether ensureUtf is supposed to "convert to utf" or "show as utf". If it's convert then this should be decode (python2 is very forgiving/lax).

kenny-evitt · 2015-06-29T13:28:58Z

@hayd Thanks for the PR!

How hard would it be to test whether ensureUtf should call encode or decode? If you can convince me that your change is correct, I'll be comfortable merging your changes.

hayd · 2015-06-29T16:31:30Z

Here's some examples of using str/encode/decode. As you can see the behaviour is very different. Python 2 always returns a str in the below cases. Python 3 returns a str (utf8) only with the above decode path.

python2

>>> u"abc".encode("utf-8", "ignore")
'abc'
>>> b"abc".encode("utf-8", "ignore")
'abc'
>>> str(b"abc")
'abc'

python 3

>>> u"abc".encode("utf-8", "ignore")
b'abc'
>>> b"abc".encode("utf-8", "ignore")
AttributeError
>>> str(b"abc")
"b'abc'"

>>> b"abc".decode("utf-8", "ignore")
'abc'
>>> u"abc".decode("utf-8", "ignore")
AtrributeError
>>> str(u"abc")
'abc'

Note: unicode encoded becomes bytes (not utf), thus not "ensuring utf".

hayd · 2015-06-29T21:42:48Z

Which is to say, in python 2 encode on unicode did nothing, but semantically it was doing the wrong thing (as we can see when python 3 tries to do "the right thing"). decode is what it ought to have been doing, that gives the same results as before but now works in python 2 and 3.

UnknownProgrammer · 2015-06-30T04:30:55Z

@hayd I think the coders intention was to do something like the following with python 2 in mind:


#!/usr/bin/env python
# -*- coding: iso-8859-1 -*-
# define text
s_iso88591 = "ÄÖÜ"
# convert text to unicode
s_unicode = s_iso88591.decode("iso-8859-1")
# encode to utf-8
s_utf8 = s_unicode.encode("utf-8", "ignore")
print(type(s_iso88591))
print(type(s_unicode))
print(type(s_utf8))
try:
    print(s_iso88591)
except:
    print("not presentable")
try:
    print(s_unicode)
except:
    print("not presentable")
try:
    print(s_utf8)
except:
    print("not presentable")

Console output:


<type 'str'>
<type 'unicode'>
<type 'str'>
��
ÄÖÜ
ÄÖÜ

If you define the text as following, the decode part isn’t necessary because it is already Unicode, but I wanted to show the iso-8859-1 effect.


 s_iso88591 = u"ÄÖÜ"

So if you want to change the encoding you have to do it with Unicode as intermediate step.

In addition I want to quote from the python documentation:

compile(source, filename, mode, flags=0, dont_inherit=False, optimize=-1)

Compile the source into a code or AST object. Code objects can be executed by exec() or eval(). source can either be a normal string, a byte string, or an AST object.

The source parameter needs a normal string or a byte string, the ensureUft() function is called there, so it should return one of these types.

After all I’m not sure what to do with this function. I didn’t notice any difference on output in lighttable with python 3 neither with encode nor with decode, if they are surrounded by a try-except block.

hayd · 2015-06-30T06:59:53Z

@UnknownProgrammer Running your code above in python3 gives

AttributeError: 'str' object has no attribute 'decode'

(That is, if it's already unicode it'll hit the except in python 3. It's a no-op in python 2.)

whilst I agree that it was doing something different before (not always returning a unicode/utf type), I think that was a bug rather than programmer intention of the "ensure uft" function.

The source parameter needs a normal string or a byte string, the ensureUft() function is called there

That's great, that's precisely my example above.

I didn’t notice any difference on output in lighttable

This is the important thing IMO.

kenny-evitt · 2015-06-30T13:41:32Z

@hayd I'm inclined to agree that your change is correct:

python - What is the difference between encode/decode? - Stack Overflow

Here's the commit where Chris added the ensureUtf function.

@UnknownProgrammer Would you test these changes and confirm whether they work for you with Python 3 code?

hayd · 2015-07-01T00:16:01Z

@UnknownProgrammer Upon reflection, your example above is quite interesting - thanks! Although I can't seem to get latin-1 coding to work on python 3 (I can see this on python 2): strings seem to always to be unicode (so "ÄÖÜ" is read as 'Ã\x84Ã\x96Ã\x9c' which prints as "Ã�Ã�Ã"). Note that this would have to do a encode/decode round trip to do this properly...

>>> "\xc3\x84\xc3\x96\xc3\x9c"
'Ã\x84Ã\x96Ã\x9c'
>>> "\xc3\x84\xc3\x96\xc3\x9c".encode("iso-8859-1")
b'\xc3\x84\xc3\x96\xc3\x9c'
>>> "\xc3\x84\xc3\x96\xc3\x9c".encode("iso-8859-1").decode("utf-8")
'ÄÖÜ'

So the good news is that previously your example would have actually raised (in python 2):

    print(s_iso88591.encode("utf-8", "ignore"))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

whereas decode parses it to valid unicode u'\xc4\xd6\xdc' (i.e. it's works/is fixed with this PR).

hayd · 2015-07-01T06:06:03Z

Which is to say, this PR also fixes #10.

UnknownProgrammer · 2015-07-01T12:03:05Z

@hayd the example is Python 2 only code

In Python 2:

we have two text types: str and unicode which is equivalent to the Python 3 str

• a Unicode string gets encoded to a Python 2.x string (actually, a sequence of bytes)
• a Python 2.x string gets decoded to a Unicode string

That‘s a quote from the accepted answer of this post [http://stackoverflow.com/questions/368805/python-unicodedecodeerror-am-i-misunderstanding-encode#370199]

The original function:


def ensureUtf(s):
    if type(s) == unicode:
        return s.encode('utf8', 'ignore')
    else:
        return str(s)

I checked the type of the input objects s in LightTable with Python 2 and all tested objects were of type Unicode. The return value for the Unicode objects is a byte string (str) encoded with utf8.
Unicode -> str(utf8)
Calling decode for Unicode objects makes no sense, because the objects are already Unicode.


def ensureUtf(s):
    try:
        return s.decode('utf8', 'ignore')
    except AttributeError:
        return str(s)

If the input object is of type str and you want return Unicode objects, decode makes sense, but you have to know the encoding of the byte string to have success.
Str(?) -> unicode
In python 2 the default encoding is ascii, Pyton3: utf-8


>>> import sys
>>> sys.getdefaultencoding()
'ascii'

The prefferedencoding on windows is cp1252


>>> import locale
>>> locale.getpreferredencoding()
'cp1252'

In linux it‘s utf-8

Many encodings to take care of.

In Python 3:

Python 3 removed support for non Unicode data text (the python 2 str class) and replaced it with the byte type bytes.
Therefore all strings are sequences of unicode characters and you can’t do the same as in Python 2.

In python 3 a unicode str becomes encoded to bytes
And a bytes type becomes decoded to a unicode str


>>> s_unicodeutf8 = "ÄÖÜ"
>>> type(s_unicodeutf8)
<class 'str'>
>>> s_unicodeutf8.encode("iso-8859-1")
b'\xc4\xd6\xdc'
>>> type(b'\xc4\xd6\xdc')
<class 'bytes'>

s_unicodeutf8 is a unicode str so it gets encoded to bytes (here with iso-8859-1 encoding)


>>> b'\xc4\xd6\xdc'.decode("iso-8859-1", "ignore")
'ÄÖÜ'

Bytes decoded to unicode str


>>> s_utf8bytes = "ÄÖÜ".encode("utf8")
>>> s_utf8bytes
b'\xc3\x84\xc3\x96\xc3\x9c'
>>> b'\xc3\x84\xc3\x96\xc3\x9c'.decode("utf8", "ignore")
'ÄÖÜ'

BUT:


>>> b'\xc4\xd6\xdc'.decode("utf8", "ignore")
''
>>> b'\xc4\xd6\xdc'.decode("utf8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc4 in position 0: invalid
continuation byte

If you try to decode the wrong encoding you get nothing or error.
And that is what could happen in the second ensureUtf function above, but in LightTable with Python 3 all tested input values of this function were of type str.
And you can’t call decode on str.

With encode, all str objects are encoded to bytes in utf8, which compile() can handle.

At the moment I can't find a case where decode is called, but encodeis called every time and converts from unicode to str (Python 2) and str to bytes (Python 3).

"\xc3\x84\xc3\x96\xc3\x9c"
'Ã\x84Ã\x96Ã\x9c'
"\xc3\x84\xc3\x96\xc3\x9c".encode("iso-8859-1")
b'\xc3\x84\xc3\x96\xc3\x9c'
"\xc3\x84\xc3\x96\xc3\x9c".encode("iso-8859-1").decode("utf-8")
'ÄÖÜ'


>>> type("\xc3\x84\xc3\x96\xc3\x9c")
<class 'str'>
>>> type(b"\xc3\x84\xc3\x96\xc3\x9c")
<class 'bytes'>

You used the wrong object type, without the small b it is only a unicode string. The decode function exists only on bytes.


>>> b"\xc3\x84\xc3\x96\xc3\x9c".decode("utf8")
'ÄÖÜ'

The "iso-8859-1" representation of „ÄÖÜ“ looks like this


>>> 'ÄÖÜ'.encode("iso-8859-1")
b'\xc4\xd6\xdc'
>>> b'\xc4\xd6\xdc'.decode("iso-8859-1")
'ÄÖÜ'

Conversion path from iso-8859-1 to utf8 and back:
iso-8859-1 -> decode -> unicode string -> encode -> utf8

utf8 -> decode -> unicode string ->encode -> iso-8859-1

kenny-evitt · 2015-07-01T13:50:26Z

@hayd @UnknownProgrammer Thanks for looking at this. I'll gladly merge a PR with which you're both happy.

hayd · 2015-07-01T17:05:35Z

@UnknownProgrammer I don't follow what your advocating here going forward? (and whether it fixes #10?)

but you have to know the encoding of the byte string to have success.

http://stackoverflow.com/a/436299/1240268 :s
Or do you look up #- coding line if it exists otherwise use getdefaultencoding (or getpreferredencoding??). Does this mean you're suggesting:

try:
    return s.decode(ENCODING_ABOVE, 'ignore')

Like I say, in python 3 I can't get your example with latin-1 and the coding line to work at all (the initial string is read as broken unicode, whereas in python 2 it's a str).

UnknownProgrammer · 2015-07-02T13:32:33Z

@hayd The picture in #10 shows me that the latest python plugin isn't installed. The last time I checked it was updated separately and nobody mentioned the version of the plugin.
With the latest plugin it is fixed in Python 2 and my PR will make in able to run with Python 3, too.

Using encode is correct.

The addition of:


# -*- coding: utf-8 -*-

Is necessary if you run LT with Python 2 and want to use uft8 encoded files, because the defaultencoding of Python 2 is ascii. LT calls __import__() on these files, if you don't specify another encoding, it tries to import with ascii, which will cause an error on non ascii encoded files.

Like I say, in python 3 I can't get your example with latin-1 and the coding line to work at all (the initial string is read as broken unicode, whereas in python 2 it's a str).

As I explained in the first line of my previous post it is Python 2 only, because in Python 3 doesn't exist a non Unicode data text any more. All strings are decoded implicit to unicode strings. And in Python 3 you can only encode unicode to byte and decode byte to unicode.
After the quote of your post (the quote with the 4 bars) I showed you that you did something wrong.

UnknownProgrammer · 2015-07-02T16:57:20Z

@hayd @kenny-evitt
This PR creates an error in LT with Python 2 and the following code:


# -*- coding: utf-8 -*-
s="äüö"
print(s)

Error:


Traceback (most recent call last):
  File "/home/user/.config/LightTable/plugins/Python/py-src/ltmain.py", line 214, in handleEval
    ensureUtf(code), ensureUtf(data[2]["name"]), 'exec')
  File "/home/user/.config/LightTable/plugins/Python/py-src/ltmain.py", line 55, in ensureUtf
    return s.decode('utf8', 'ignore')
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 5-7: ordinal not in range(128)

hayd · 2015-07-02T17:53:04Z

If you're using non-unicode in a utf-coded document isn't that a bug in the python file? This is the bit I don't quite understand: why is non-valid unicode ending up in the (python3) str?

Perhaps we could be forgive non-valid unicode and simply catch the UnicodeEncodeError... ? and just return s?

The bit with 4 bars was intentional (str rather than bytes), as this seems the only way a round trip can be done in python 3, and it is the way your original example is interpreted in python 3 even with the latin-1 coding header.

UnknownProgrammer · 2015-07-03T13:23:11Z

@hayd You want me to try this:


# -*- coding: utf-8 -*-
s=unicode("äüö")
print(s)

Or that:


# -*- coding: utf-8 -*-
s=unicode("äüö","utf8")
print(s)

Same error, both times.

The bit with 4 bars was intentional (str rather than bytes), as this seems the only way a round trip can be done in python 3, and it is the way your original example is interpreted in python 3 even with the latin-1 coding header.

It is not meant to be used with Python 3 because it takes advantage of the Python 2 str class, which doesn't exist in Python 3 in the same way.
The only thing you had was the utf-8 encoded bytes representation:


b"\xc3\x84\xc3\x96\xc3\x9c"

You never had the "iso-8859-1" representation:


b'\xc4\xd6\xdc'

The only thing you did was converting utf8 bytes to unicode str in a strange way.

Maybe it is clearer this way (Python 2 only!):


#!/usr/bin/env python
# -*- coding: iso-8859-1 -*-
# define text
s_iso88591 = "ÄÖÜ"
# convert text to unicode
s_unicode = s_iso88591.decode("iso-8859-1")
# encode to utf-8
s_utf8 = s_unicode.encode("utf-8", "ignore")
print(type(s_iso88591))
print(type(s_unicode))
print(type(s_utf8))
try:
    print(s_iso88591)
    print(repr(s_iso88591))
except:
    print("not presentable")
try:
    print(s_unicode)
    print(repr(s_unicode))
except:
    print("not presentable")
try:
    print(s_utf8)
    print(repr(s_utf8))
except:
    print("not presentable")

Output:


<type 'str'>
<type 'unicode'>
<type 'str'>
��
'\xc4\xd6\xdc'
ÄÖÜ
u'\xc4\xd6\xdc'
ÄÖÜ
'\xc3\x84\xc3\x96\xc3\x9c'

kenny-evitt · 2015-08-11T14:57:21Z

@hayd @UnknownProgrammer What should we do with this PR?

RockyRoad29 · 2016-01-17T11:14:59Z

Hi,
I just found this PR which is pretty much the same as mine I think (I have no time right now to read it thoroughly). Might be worth cross-ref' it : #47

hayd added 2 commits June 27, 2015 22:50

Apply autopep8 to all python files.

4a7ade9

Python 3 compat.

4bbc41c

Also, use enumerate rather than range(len(..)).

hayd reviewed Jun 28, 2015
View reviewed changes

UnknownProgrammer mentioned this pull request Jun 30, 2015

Fixing compatibility with python3 #28

Open

kenny-evitt mentioned this pull request Apr 27, 2016

Evaluation don't work. LightTable/LightTable#2187

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pep8 and Python 3 compatibility #30

Pep8 and Python 3 compatibility #30

hayd commented Jun 28, 2015

hayd Jun 28, 2015

kenny-evitt commented Jun 29, 2015

hayd commented Jun 29, 2015

hayd commented Jun 29, 2015

UnknownProgrammer commented Jun 30, 2015

hayd commented Jun 30, 2015

kenny-evitt commented Jun 30, 2015

hayd commented Jul 1, 2015

hayd commented Jul 1, 2015

UnknownProgrammer commented Jul 1, 2015

kenny-evitt commented Jul 1, 2015

hayd commented Jul 1, 2015

UnknownProgrammer commented Jul 2, 2015

UnknownProgrammer commented Jul 2, 2015

hayd commented Jul 2, 2015

UnknownProgrammer commented Jul 3, 2015

kenny-evitt commented Aug 11, 2015

RockyRoad29 commented Jan 17, 2016

Pep8 and Python 3 compatibility #30

Are you sure you want to change the base?

Pep8 and Python 3 compatibility #30

Conversation

hayd commented Jun 28, 2015

hayd Jun 28, 2015

Choose a reason for hiding this comment

kenny-evitt commented Jun 29, 2015

hayd commented Jun 29, 2015

hayd commented Jun 29, 2015

UnknownProgrammer commented Jun 30, 2015

hayd commented Jun 30, 2015

kenny-evitt commented Jun 30, 2015

hayd commented Jul 1, 2015

hayd commented Jul 1, 2015

UnknownProgrammer commented Jul 1, 2015

In Python 2:

The original function:

In Python 3:

BUT:

kenny-evitt commented Jul 1, 2015

hayd commented Jul 1, 2015

UnknownProgrammer commented Jul 2, 2015

UnknownProgrammer commented Jul 2, 2015

hayd commented Jul 2, 2015

UnknownProgrammer commented Jul 3, 2015

kenny-evitt commented Aug 11, 2015

RockyRoad29 commented Jan 17, 2016