Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved string quoting #41

Merged
merged 10 commits into from
Dec 5, 2023
Merged

Improved string quoting #41

merged 10 commits into from
Dec 5, 2023

Conversation

jojoelfe
Copy link
Collaborator

@jojoelfe jojoelfe commented Dec 2, 2023

This addresses #35 and follows the official starfile specs a bit more.

During parsing strings can now be quoted using single- or double-quotes. For simple datablocks this is achieved using the builtin shlex module, for loops by replacing single quotes with double quotes. The shlex solution is probably better as it allows string containing one type of quote, quoted by another such as "This ain't perfect", which should work according to the starfile specs, but will not using the current solution for loop blocks. But I feel this might not be important enough to give up the simplicity of directly parsing using pandas.

During writing the user can now choose which quote character to use and whether to quote all strings or only white space-containing and empty ones. The quoting for loops is done by disabling the pandas quoting logic and instead "manually" adding the quotes if appropriate. This is because pandas will only add quotes if the string contains the delimiter, which is '\t'.

I think all of this behavior is covered by tests, but there are some caveats:

  • For loops an empty string is parsed as NaN and not an empty string. This is what to_numeric does and I could find no way of changing this. I don't know if this is important enough to give up the simplicity of it.
  • For very large loops writing might be slower, because the quoting logic will iterate over all values with applymap. No idea how much this affects performance. I added a test to check the write time for the million_row_starfile and it seems to be fine.

@alisterburt
Copy link
Collaborator

Happy Saturday morning @jojoelfe ! This is great - thanks for getting back to it. I'll take a look this afternoon ☺️

Copy link
Collaborator

@alisterburt alisterburt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome @jojoelfe, appreciate the effort here! Minor suggestions on naming and a quick q about what to do about nans for empty strings, should be able to get this in right away once those are addressed!

@@ -35,6 +36,8 @@ def write(
float_format: str = '%.6f',
sep: str = '\t',
na_rep: str = '<NA>',
quotechar: str = '"',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
quotechar: str = '"',
quote_character: str = '"',

would prefer something more explicit and snake case, could you update in all relevant places?

@@ -35,6 +36,8 @@ def write(
float_format: str = '%.6f',
sep: str = '\t',
na_rep: str = '<NA>',
quotechar: str = '"',
quote_always: bool = False,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
quote_always: bool = False,
quote_all_strings: bool = False,

simple name preference, could you refactor to this too?

Comment on lines 50 to 51
quotechar=quotechar,
quote_always=quote_always,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be updated wrt previous comments

Comment on lines 28 to 29
quotechar: str = '"',
quote_always: bool = False,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same again

Comment on lines 39 to 40
self.quotechar = quotechar
self.quote_always = quote_always
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same again

Comment on lines 76 to 77
quotechar=self.quotechar,
quote_always=self.quote_always
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same again

Comment on lines 87 to 88
quotechar=self.quotechar,
quote_always=self.quote_always
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same again

Comment on lines 136 to 137
quotechar: str = '"',
quote_always: bool = False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same again

Comment on lines 247 to 257
@pytest.mark.parametrize("quotechar, filename", [("'",basic_single_quote),
('"',basic_double_quote),
])
def test_quote_basic(quotechar,filename):
import math
parser = StarParser(filename)
assert len(parser.data_blocks) == 1
assert parser.data_blocks['']['no_quote_string'] == "noquote"
assert parser.data_blocks['']['quote_string'] == "quote string"
assert parser.data_blocks['']['whitespace_string'] == " "
assert parser.data_blocks['']['empty_string'] == ""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is sick 🙂

tests/test_parsing.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@alisterburt alisterburt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent! I'll push a release too, thanks again @jojoelfe 🙂

@alisterburt alisterburt merged commit 9f3fe21 into teamtomo:main Dec 5, 2023
5 checks passed
@alisterburt
Copy link
Collaborator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants