Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add parse_index function for parsing NDIndex from string #91

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

telamonian
Copy link
Contributor

@telamonian telamonian commented Oct 12, 2020

Fixes #86, see #86 for discussion.

Adds a literal_eval_index function, built off of a modified version of the ast.literal_eval function from the cpy std lib.

I've tested the literal_eval_index function in both cpy37 and cpy38, seems to be working pretty well. Still needs some work to integrate this into the ndindex package (including adding tests).

@telamonian
Copy link
Contributor Author

I've now also tested this code in cpy39 (after some teething issues, I managed to use pyenv plus some json hackery to get isolated jupyter kernels for each of cpy 37, 38, and 39), and it seems to be working well. Basically, the ast authors removed a bunch of classes from ast in cpy38 and cpy39 (eg ast.Num), but they left in class stubs so my calls to eg isinstance(x, ast.Num) don't cause errors in any version.

@telamonian
Copy link
Contributor Author

I've now added unittests for literal_eval_index. Locally they're passing with 100% coverage.

There's still a couple of minor flaws in how I'm using hypothesis to generate string representations of ellipses (always'Ellipsis', never '...') and slices (always 'slice(...)', never 'x:y'), but all of those cases are being covered by my @examples. Shouldn't be too hard to improve that.

Hypothesis is tricky! And interesting. I've never used it before

@@ -0,0 +1,67 @@
import ast
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a module level docstring

@scopatz
Copy link
Contributor

scopatz commented Oct 13, 2020

The code looks reasonable to me. @asmeurer - will you take a look at this please, whenever you get a chance? Thanks!

@example('slice(-12, -72, 14)')
@example('3:15, -5, slice(12, -14), (1,2,3)')
@example('..., -5, slice(12, -14), (1,2,3)')
@example('3:15, -5, slice(12, -14), (1,2,3), Ellipsis')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are not valid indices, at least as far as ndindex is concerned. Nested tuples are not allowed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did see

https://github.com/Quansight/ndindex/blob/9e5399fb20f5800474aa2bf081e8323cfeaeabf4/ndindex/tuple.py#L65

Why did you decide to be more restrictive about this syntax than numpy?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason is that I wanted to avoid confusion from both Tuple(1, 2) and Tuple((1, 2)) being valid and meaning different things. See #17.

I also had thought that using a tuple to represent an array was deprecated, but now I can't find that. Certainly there is ambiguity even at the NumPy level. If you expect a tuple to act like a fancy index, then you will be surprised that a[array([1, 2])] will give a different thing than a[(1, 2)].

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Particularly this comment #17 (comment). Perhaps this should be added to the docs. I do have https://quansight.github.io/ndindex/type-confusion.html#tuple, but that is more about best practices and doesn't really discuss why things are the way they are.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so it will be a little difficult to implement disallowing nested tuples indexes, since there are a bunch of places where we do need to parse ast.Tuple regardless. I think the way to do this will be to include some kind of guard in the ast.Tuple handling clause of _convert that will raise an error if it gets called twice. So far I've got:

class _Guard:
    def __init__(self):
        self.val = False

    def __call__(self):
        if self.val:
            return True
        else:
            self.val = True
            return False

and then

    _nested_tuple_guard = _Guard()
    def _convert(node):
        if isinstance(node, ast.Tuple):
            if _nested_tuple_guard():
                raise ValueError(f'tuples inside of tuple indices are not supported: {node!r}')

            return tuple(map(_convert, node.elts))
    ...

There's probably a cleaner way to implement the guard

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the simpler way is to not worry about nonsense combinations here, and let ndindex() handle it at the end.

@asmeurer
Copy link
Member

asmeurer commented Oct 13, 2020

Thanks for doing the work on this. This is looking good. Some comments:

  • eval_literal_index should pass the result to ndindex() so that it returns an ndindex type. The main benefit of this is that you will get type checking on the index that you wouldn't get without doing so. For example, currently nonsense like literal_eval_index('slice(slice(1, 2))') is allowed. literal_eval_index('1.0') is also allowed. In this way, we don't have to do type checking at the ast level, and can only handle the most basic things.

  • eval_literal_index isn't the greatest name. I would prefer something like parse_index.

  • This code can go in ndindex.py.

  • The hypothesis tests are fine. You are basically checking that literal_eval_index(str(index)) works for any ndindex type. I think it might be simpler to do

@given(ndindices)
def test_literal_eval_index(idx):
    assert literal_eval_index(str(idx)) == idx

This doesn't handle the cases of non-conforming indices, though. For that, we would need some strategies that generate ast. I think hypothesis doesn't have first-class support for this. We could try something like https://github.com/Zac-HD/hypothesmith, but I think it might not be easy. So for this, I think we can just have a bunch of normal tests that syntax features outside of the allowed list are not valid. And also test that things that look valid from the ast but aren't actually valid fail properly.

@asmeurer
Copy link
Member

asmeurer commented Oct 13, 2020

The hypothesis tests are fine. You are basically checking that literal_eval_index(str(index)) works for any ndindex type. I think it might be simpler to do

Actually I think it needs to look like this

@given(ndindices)
def test_literal_eval_index(idx):
    index = ndindex(idx)
    str_index = str(index)
    repr_index = repr(index)
    try:
        assert literal_eval_index(str_index) == idx
    except SyntaxError:
        # str() gives invalid syntax for array indices
        assume(False)
    assert literal_eval_index(repr_index) == idx

This will test both the str and repr forms, and properly skip the test for the str form of an array (which isn't valid Python).

This still won't handle a lot of combinations, including slice literals, but we'd have to create custom strategies to do that. Although I've also considered that perhaps Slice.__str__ should return slice literal syntax.

@asmeurer
Copy link
Member

Regarding Python 3.9, it is out now, so we need to support it. See #92

@ericdatakelly ericdatakelly added this to the October 2020 milestone Oct 13, 2020
@telamonian telamonian changed the title [WIP] Add literal_eval_index function for parsing ndindex from string Add parse_index function for parsing NDIndex from string Oct 19, 2020
@telamonian
Copy link
Contributor Author

Actually I think it needs to look like this

@given(ndindices)
def test_literal_eval_index(idx):
    index = ndindex(idx)
    str_index = str(index)
    repr_index = repr(index)
    try:
        assert literal_eval_index(str_index) == idx
    except SyntaxError:
        # str() gives invalid syntax for array indices
        assume(False)
    assert literal_eval_index(repr_index) == idx

This will test both the str and repr forms, and properly skip the test for the str form of an array (which isn't valid Python).

@asmeurer I tried the approach to writing test_parse_index that you suggest above, but it fundamentally won't work as-is. If I understand your intention correctly, you want (at least roughly) the repr/str ops and the parse_index op to be reciprocal w.r.t. equality, right? This will require some reworking of how str and repr work in NDIndex and/or some of its subclasses.

@asmeurer @scopatz Can you please give this PR another review and let me know if anything still needs to change? Since the version of parse_index in this PR works and is passing all tests/coverage checks as of right now, my vote would be to pull this PR in (more-or-less) as-is. We could then ensure that repr/str and parse_index are reciprocal in a future PR which also makes the necessary rationalizations to the behavior of repr/str themselves.

@telamonian
Copy link
Contributor Author

@asmeurer I played around with repr/str and parse_index a bunch, but everything I tried ran into some kind of trouble. Here's a summary of some of the obstacles:

  • Most returns from repr(ndindex(x)) or str(ndindex(x)) start with the name of an ndindex class (eg Tuple, IntegerArray, etc) which parse_index does not (and probably should not, ideally it will stick to literals) understand

  • There's special handling for repr(ellipsis()) which turns it into an empty plain tuple (ie ())

  • For array indices, str(IntegerArray([0])) is still valid syntax for parse_index if the shape of the array index is 0 or 1

Basically, whenever I fixed one issue I ran into another, so all of the complete fixes I came up so far involve radical changes to .__repr__, .__str__, and/or .raw for some/all NDIndex subclasses

@telamonian
Copy link
Contributor Author

I'll close/reopen this PR to trigger the new py39 test

@telamonian telamonian closed this Oct 23, 2020
@telamonian telamonian reopened this Oct 23, 2020
@telamonian
Copy link
Contributor Author

The PR is now passing for py39 as well

@@ -65,6 +66,91 @@ def __init__(self, f):
def __get__(self, obj, owner):
return self.f(owner)

class _Guard:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this guard stuff is needed. The Tuple constructor should catch this after calling ndindex() at the end (and will give a better error message than what this currently gives).

node_or_string = node_or_string.slice

def _raise_malformed_node(node):
raise ValueError(f'malformed node or string: {node!r}, {ast.dump(node)!r}')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The node part here isn't particularly helpful alongside the dump.

Also I don't understand why the literal_eval error message says "node or string". How can the string be malformed? If there is a syntax error, it raises a different exception. I like this better:

Suggested change
raise ValueError(f'malformed node or string: {node!r}, {ast.dump(node)!r}')
raise ValueError('unsupported node when parsing index: {ast.dump(node)}')

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your language is clearer in any case. I'll pull this suggestion in


def slices(start=one_of(none(), ints()), stop=one_of(none(), ints()),
step=one_of(none(), ints())):
step=one_of(none(), ints_nonzero())):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This includes zero step because it is used to test the slice constructor. Although it would probably be better to use this for the tests that don't actually test that (see the discussion at #46). It may be better to work on this in a separate PR, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this since I kept getting ValueErrors due to the related check in the Slice constructor.

So what I don't get is how this wasn't causing errors previously for tests using the @given(ndindices) strategy. Wouldn't that just mean that (due to one of the many weird quirks of hypothesis) @given(ndindices) just happened to never attempt to generate ndindex(slice(x, y, 0))?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ndindices uses filter(_doesnt_raise) which make it only generate valid indices. Actually I suspect this means several tests

ints(),
slices(),
Tuples,
).map(lambda x: f'{x}')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put strategies in helpers.py

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can include arrays here too, although you'll have to use the repr form in that case.

@example('..., 3:15, -5, slice(-12, -72, 14)')
@given(ndindexStrs)
def test_parse_index_hypothesis(ixStr):
assert eval(f'_dummy[{ixStr}]') == parse_index(ixStr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can check here if eval gives an error than so should parse_index.

@asmeurer
Copy link
Member

One idea to get better test coverage would be to __str__ for ndindex types to give the raw string form, like

>>> str(Slice(1, 2))
'1:2'
>>> str(Tuple(1, 2))
'(1, 2)'

and then test both str and repr of ndindex(idx) (and ndindex(idx).raw) (str for arrays will still have to be skipped because it uses non-valid syntax).

It may also be a good idea to support the ndindex classes as valid names, so parse_index('Slice(1, 2)') works.


def parse_index(node_or_string):
"""
"Safely" (needs validation) evaluate an expression node or a string containing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "needs validation" here mean?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it depends on what we mean by "safely". parse_index('('*1000 + ')'*1000) raises MemoryError (but so does ast.literal_eval('('*1000 + ')'*1000)). It shouldn't be possible to execute arbitrary code, though, as far as I can tell.

Copy link
Contributor Author

@telamonian telamonian Oct 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't really mean anything, mostly just a reflection of my paranoia about building a) any parser, and b) something meant to take raw user input. My goal has been to make something that's at least as safe as the existing literal_eval. At this point I've examined the code and the surrounding issues enough that I'm reasonably confident this has actually been archived, so I'll edit out the wishy-washy language

def parse_index(node_or_string):
"""
"Safely" (needs validation) evaluate an expression node or a string containing
a (limited subset) of valid numpy index or slice expressions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring should probably clarify what is and isn't allowed. I guess we aren't allowing any kind of expressions that create numbers.

Actually, we may at one point want to support just anything that parses, and make the current behavior under a safe=True flag, which I guess should be the default. But we can worry about that later.

Also this docstring needs examples. and should be added to Sphinx.

@ericdatakelly ericdatakelly modified the milestones: October 2020, November 2020 Nov 16, 2020
@ericdatakelly ericdatakelly modified the milestones: November 2020, December 2020 Dec 17, 2020
@ericdatakelly ericdatakelly modified the milestones: December 2020, January 2021 Jan 6, 2021
@ericdatakelly ericdatakelly modified the milestones: January 2021, February 2021 Feb 4, 2021
@ericdatakelly ericdatakelly modified the milestones: February 2021, March 2021 Mar 8, 2021
@ericdatakelly ericdatakelly modified the milestones: March 2021, April 2021 Apr 5, 2021
@ericdatakelly ericdatakelly modified the milestones: April 2021, May 2021 May 10, 2021
@ericdatakelly ericdatakelly removed this from the May 2021 milestone Oct 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

A way to serialize/deserialize ndindex
4 participants