Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF-8 Encoding #133

Open
nim-odoo opened this issue Nov 28, 2017 · 9 comments
Open

UTF-8 Encoding #133

nim-odoo opened this issue Nov 28, 2017 · 9 comments

Comments

@nim-odoo
Copy link

Hello,

I am a bit confused by these lines: https://github.com/jseutter/ofxparse/blob/master/ofxparse/ofxparse.py#L85-L95
These are used to determine the headers of the file, and more particularly the encoding.

However, a typical header is:

<?xml version="1.0" encoding="UTF-8"?>
<?OFX OFXHEADER="200" VERSION="202" SECURITY="NONE" OLDFILEUID="NONE" NEWFILEUID="NONE"?>

This method won't parse the encoding="UTF-8" header. Therefore, the encoding will (always?) be considered as ASCII in https://github.com/jseutter/ofxparse/blob/master/ofxparse/ofxparse.py#L111-L117

You can imagine that if the OFX file contains non-ASCII characters, the reading will fail. As a workaround, I added encoding:UTF-8 on top of the file. But that seems odd...

Thanks
cc @qdp-odoo

@jseutter
Copy link
Owner

I think you are correct. I don’t see any test cases that test this behavior.

Back when I started this library pretty much no producers were generating Ofx 2.2 and the code shows this. I wonder if nowadays it would be better to try parsing as xml first and if that errors out, switch to the sgml parsing. This would give us proper encoding detection for free for 2.2 and newer files.

At any rate, thanks for bringing this up. :)

-Jerry

@alexis-via
Copy link

I confirm that the bug is still present today in ofxparse 0.20 and in master.

My OFX files starts with

<?xml version="1.0" encoding="utf-8"?>
<?OFX OFXHEADER="200" VERSION="220" SECURITY="NONE" OLDFILEUID="NONE" NEWFILEUID="NONE"?>
<OFX>
  <SIGNONMSGSRSV1>

When I read the OFX 2.2.0 spec here https://www.ofx.net/downloads/OFX%202.2.pdf
it says in section 2.2 page 34-35:

"The standard XML declaration must come first. This Processing Instruction (PI) includes an option to specify the version of XML being used, and may include options to show such things as the encoding declaration, and the standalone status of the document."

"The OFX declaration must come next in the file. This PI identifies the contents as an Open Financial Exchange file and provides the version number of the Open Financial Exchange declaration itself (not the version number of the contents). The Open Financial Exchange PI contains the following attributes:
OFXHEADER
VERSION
SECURITY
OLDFILEUID
NEWFILEUID"

So it's clear that, with OFX 2.2, the encoding of the file must be read from the first line that should look like

<?xml version="1.0" encoding="UTF-8"?>

and not from the second line "<?OFX OFXHEADER=" that doesn't have an ENCODING attribute any more.

With the current version of ofxparse, my OFX file fails because it contains

<NAME>Direction Générale des Finances </NAME>

but, as the file is encoded in utf-8 but read as ascii by ofxparse, I get:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 3636: ordinal not in range(128)

I'm ready to contribute to fix this bug. Any advice for the best way to fix it would be welcomed.

@alexis-via
Copy link

When looking at the test suite under the directory "tests/fixtures", I can find 4 files in OFX 2.x that start with

<?xml version="1.0" encoding="UTF-8" standalone="no"?>

(anzcc.ofx, error_message.ofx, multiple_accounts2.ofx and multiple_accounts.ofx)
but they only contain ASCII caracters, there aren't any accentuated character in them. That's why the tests work.

The first problem is to fix the method read_headers() https://github.com/jseutter/ofxparse/blob/master/ofxparse/ofxparse.py#L86
On an OFX 2.x file, this second line of code of this method will set head_data as an empty byte ; so, at the end of this method, self.headers is an empty OrderedDict. Then, the method handle_encoding() uses self.headers to read the encoding and fallsback on ascii.

My ideas: the current methods read_headers() and handle_encoding() work well for OFX 1.x but not for OFX 2.x. Should we try to patch them to work on both, or should we have separate methods for OFX 2.x and use some code to detect if the file is OFX 1.x or 2.x ?

@alexis-via
Copy link

I made a first prototype here: https://people.via.ecp.fr/~alexis/ofxparse-proto-bug-133.diff

This patch only changes the method read_headers(): if the file starts with '<?xml ', I use lxml to parse the ofx file and get the encoding. @jseutter What do you think about this approach ?

@jseutter
Copy link
Owner

I'm okay with this approach. Although I try to avoid adding dependencies, I think it is warranted in this case. I haven't run your code but the way I read it, it should still work for 1.x files. Is this the case?

If you have time to add a test case that highlights the broken (and now fixed) behavior I would be grateful. If not, I'll try to add one myself.

@alexis-via
Copy link

@jseutter Thanks for your feedback. It should work with ofx v1 and v2. I'll prepare a PR with a test with a v2 file.

@dev590t
Copy link

dev590t commented Jul 6, 2023

the PR is merged to master? I have the same problem

@alexis-via
Copy link

@dev590t
Copy link

dev590t commented Jul 8, 2023

The header of my ofx file is:

OFXHEADER:100
DATA:OFXSGML
VERSION:102
SECURITY:NONE
ENCODING:UTF-8
CHARSET:NONE
COMPRESSION:NONE
OLDFILEUID:NONE
NEWFILEUID:NONE
<OFX>...

and ofxparser fails on é char.

Thank you @alexis-via, After patching your patch, that give me other errors. I'm not sure if it is from my ofx file or no.

After modify the header to:

OFXHEADER:100
DATA:OFXSGML
VERSION:102
SECURITY:NONE
ENCODING:USASCII
CHARSET:1252
COMPRESSION:NONE
OLDFILEUID:NONE
NEWFILEUID:NONE
<OFX>
...

The ofxparser runs without the patch.

Why don't submit your PR if the patch solve the issue for some case? The project seems unmaintained since 2021.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants