-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved string quoting #41
Conversation
Happy Saturday morning @jojoelfe ! This is great - thanks for getting back to it. I'll take a look this afternoon |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome @jojoelfe, appreciate the effort here! Minor suggestions on naming and a quick q about what to do about nans for empty strings, should be able to get this in right away once those are addressed!
starfile/functions.py
Outdated
@@ -35,6 +36,8 @@ def write( | |||
float_format: str = '%.6f', | |||
sep: str = '\t', | |||
na_rep: str = '<NA>', | |||
quotechar: str = '"', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
quotechar: str = '"', | |
quote_character: str = '"', |
would prefer something more explicit and snake case, could you update in all relevant places?
starfile/functions.py
Outdated
@@ -35,6 +36,8 @@ def write( | |||
float_format: str = '%.6f', | |||
sep: str = '\t', | |||
na_rep: str = '<NA>', | |||
quotechar: str = '"', | |||
quote_always: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
quote_always: bool = False, | |
quote_all_strings: bool = False, |
simple name preference, could you refactor to this too?
starfile/functions.py
Outdated
quotechar=quotechar, | ||
quote_always=quote_always, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to be updated wrt previous comments
starfile/writer.py
Outdated
quotechar: str = '"', | ||
quote_always: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same again
starfile/writer.py
Outdated
self.quotechar = quotechar | ||
self.quote_always = quote_always |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same again
starfile/writer.py
Outdated
quotechar=self.quotechar, | ||
quote_always=self.quote_always |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same again
starfile/writer.py
Outdated
quotechar=self.quotechar, | ||
quote_always=self.quote_always |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same again
starfile/writer.py
Outdated
quotechar: str = '"', | ||
quote_always: bool = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same again
tests/test_parsing.py
Outdated
@pytest.mark.parametrize("quotechar, filename", [("'",basic_single_quote), | ||
('"',basic_double_quote), | ||
]) | ||
def test_quote_basic(quotechar,filename): | ||
import math | ||
parser = StarParser(filename) | ||
assert len(parser.data_blocks) == 1 | ||
assert parser.data_blocks['']['no_quote_string'] == "noquote" | ||
assert parser.data_blocks['']['quote_string'] == "quote string" | ||
assert parser.data_blocks['']['whitespace_string'] == " " | ||
assert parser.data_blocks['']['empty_string'] == "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is sick 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent! I'll push a release too, thanks again @jojoelfe 🙂
release pending https://github.com/teamtomo/starfile/actions/runs/7094776248 |
This addresses #35 and follows the official starfile specs a bit more.
During parsing strings can now be quoted using single- or double-quotes. For simple datablocks this is achieved using the builtin
shlex
module, for loops by replacing single quotes with double quotes. Theshlex
solution is probably better as it allows string containing one type of quote, quoted by another such as"This ain't perfect"
, which should work according to the starfile specs, but will not using the current solution for loop blocks. But I feel this might not be important enough to give up the simplicity of directly parsing usingpandas
.During writing the user can now choose which quote character to use and whether to quote all strings or only white space-containing and empty ones. The quoting for loops is done by disabling the
pandas
quoting logic and instead "manually" adding the quotes if appropriate. This is becausepandas
will only add quotes if the string contains the delimiter, which is '\t'.I think all of this behavior is covered by tests, but there are some caveats:
NaN
and not an empty string. This is whatto_numeric
does and I could find no way of changing this. I don't know if this is important enough to give up the simplicity of it.applymap
. No idea how much this affects performance. I added a test to check the write time for the million_row_starfile and it seems to be fine.