-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Interpreting malformed chemical formulas as substances #202
Comments
Hi @bertiewooster, glad to hear you're finding ChemPy useful, and thanks for this report! Currently my time available for ChemPy is somewhat limited, but we have a few interested parties so hopefully someone will find the time to implement this soon (you're of course welcome to submit a patch yourself if you wish/can but just reporting issues is already much appreciated!). |
Hi @bjodah thanks for the response! If no one else expresses interest in taking this on, I can try. Regardless, I thought it'd be useful to establish the parameters of a chemical formula that ChemPy is designed to interpret. Is there an existing standard for chemical formulas (I couldn't find one easily from IUPAC)? If not, here's my understanding: ElaqElbq(ElcqEldq)r...Elnqsc where
whereas parentheses are currently supported, for example
Expanded formulasShould we support any of the following?
|
I think I actually have a version of the formula parser (or at least the grammar) in chempy that handles dot notation and brackets for complexes. As I recall it actually broke a fair amount of the current tests and time constraints stopped my inquiries in that direction. While writing parsers for other situations I have used pyparsing's parse exceptions to signal errors so it can be done (easily). I suppose the questions are what features are desirable in the formula parser (dot notation for hydrated crystals, square brackets for complexes, @ caged symbol, etc.), what behavior is desired for the If I can get some clarity on these issues, I will patch the relevant bits of my parser onto the current version and investigate further. |
To summarize what I believe are the questions, and offer my suggestions:
|
I have this in a local parser, but I intended to treat brackets as complexes. PyParsing lets you treat matches individually, so I think in general you could treat brackets as grouping symbols and also have the complex information available. Is there a special meaning for braces or are they simply a grouping alternative?
This is part of the grammar already.
This should be easy to add.
This is already in the parser and tests. I can't remember if it is documented.
I think this is easily doable. I believe it would reduce the number of errors as well because it does seem unexpected that anything else would run after a failed parse. Let me work on my local branch some and I'll report back with some details. |
IUPAC provisional recommendation (see IR-2.2.1 at bottom of page 3) lists braces as a grouping alternative, to be used alternately with parentheses to help avoid confusion:
By the way, how would treating square brackets as complexes (rather than simply a grouping alternative) affect the resulting substance? |
I assumed that was the case. But nomenclature is such a wide and specialized area, sometimes it's hard to keep track of it all.
Hopefully, it wouldn't. I was looking more to add information as opposed to treating them differently. Things that depend on elemental composition should never know the difference. But, if something is in square brackets, then code could (optionally) treat it as a coordination complex (as opposed to a polyatomic ion or other group) and do appropriate stuff with it like deducing oxidation states, naming, decomposing the group into ligands, etc. I think my goal at the time was to use the parser with some additional nomenclature code to be able to interconvert formulas and names. |
I haven't been able to find an approved IUPAC standard for formula nomenclature. The closest thing I found is this story, Nomenclature Notes:
|
I'm sure there is one since they have a standard for everything, but I haven't looked. I have looked at the one for organic nomenclature and it makes an excellent doorstop. Regardless, I've done a little work and hacked in everything except the I'll try to finish the remaining two (or three) in the next few days in the above referenced branch. I have a feeling that avoiding formatter hacks and adding metadata will require more extensive use of PyParsing's parse actions and some (hopefully internal only) reformatting of the parsing functions. |
Parse species like `Li@C60` indicated one species trapped by another, such as a lithium atom inside a buckminsterfullerene. Closes: closes bjodah#202 Signed-off-by: Jeremy A Gray <[email protected]>
My branch should have everything provisionally implemented now, so have a look. I think it should have more tests probably some more reorganization of the tests. I really think the choice of operators bears some thought and discussion. In this branch, I'm using These changes don't affect most of the other tests, but accomodating rational subscripts did require some additional |
@jeremyagray Thanks. Could you create a (draft) pull request to highlight the changes? |
How about using a raised dot |
Suggest a couple additional tests that the following parse to substances correctly, unless I missed you already adding similar tests:
|
I was thinking in ASCII only. I know everything is in utf-8 now, but the parser and tests are still written as if they expect ASCII. I don't know when that paradigm shift should be made. If the switch is made to parse utf-8, then it doesn't matter as the subscripts, coefficients, and charges are all different characters and become easy to parse.
|
Both of those decisions seem fine:
Just to check, there's no currently-supported symbol for hydrates, correct? |
Right now it is |
While it would be nice to keep the hydrate symbol as the period How about tests to verify that capitalization is being taken into account, for example
And test that a charged caged item is valid, for example |
I checked the As for the latter example, the parser will not handle that as it is currently constructed since the actual parser is only handling formula groups and their subscripts and helper functions are processing the rest in To actually allow flexibility like this, the parser itself needs to generate a useful data structure that holds all this information and the helper functions need to just query this data structure and not do any text processing at all. That way, parsing {
"composition": {0: 1, 3: 1, 6: 60},
"charge": 1,
"cage": True,
"cations": ("Li+",),
"terms": ("Li+", "@C60",),
...
} instead of |
Thanks for all your work on this as I throw edge cases at you :-)
That works for me--I think it's best to merge in (and perhaps release?) your much-improved parser with its current capabilities, and leave that Where should we document the symbols such as Other than that, unless you have any more little improvements in the works, recommend you create a pull request so it can be reviewed and merged. |
The documentation is in the README, the example Jupyter notebooks, and in autogenerated API documentation. I have to look at the autogen setup to see where and how to document these changes for the API docs, but some documentation will need to be added to the README as well. The only other things to do are to decide what to do about the break caused by changing from |
Would you like me to try updating the Parsing formulae section of README.rst? |
Great work guys! And thank you @jeremyagray for making this happen. I agree with your direction, and I too think we should mint a new release once we've merged the changes to the parser. I'm fine with breaking backwards compatibility if needed (unless we can do a deprecation cycle with very little extra work). |
Sure, @bertiewooster, just let me get things together and push out the pull request, probably later this week or this weekend. As for the breaking change on the hydrate operator, it would probably be nice to the users to warn them about the change. I'll look at it some more to see if there is a less hacky way to integrate rational subscripts that avoids the breaking change. |
The breaking change is that The parser did not previously support rational subscripts, for example |
Yes, this does add rational subscripts, with some tests, and changes the operators as you describe. This was requested back in issue #176. But I am still trying to avoid the new operator so as to not break anything. I've got my installation issues sorted and have found some more parsing induced test failures in some skipped tests in test_chemistry.py that I need to investigate, so my PR is pending that fix. |
I've finally got the PR finished for those interested. I found no good way to avoid using the new hydrate operator without presupposing some knowledge of the formula and structure so that will still be a breaking change. There are alternatives, such as requiring unicode input, that would vastly simplify parsing. I did need to make a few more modifications to the parser internals, but I don't think there are any external changes other than the intended ones and the added tests seem to support this conclusion as far as they go. |
@bjodah I'm happy to work on updating the README for the new . hydrate Side note: I was thinking of updating the format to demonstrate what ChemPy produces to clarify a new user doesn't need to create a dictionary, for example instead of
using
|
Hi @bertiewooster, sorry for the delay. Sure those updates look great! |
I have not tried the doctest yet. How do I run it? |
Let's see if I remember:
So |
@bjodah That doctest command ran as it should! I asked partly because one can't run I got all the doctests to pass except the known issue that @jeremyagray is coding a fix for:
So as soon as he pushes that fix, I should be able to pull it and get all the README.rst doctests passing. P.S. Minor annoyance: For some reason, when I run
( |
I can't replicate the image getting changed on running the readme doctests, so it may be environment dependent. I didn't know those tests were there and not running with the rest. When I did run them on my branch for #209, the redox tests around line 225 had to be reversed to work. I think that this reversal doesn't really matter since there is no implication that the reaction is spontaneous as produced since we're not using any equilibrium constant or reaction potential data to produce the reaction. If this is the case, we probably need to document that the reaction is balanced but not necessarily spontaneous as written. |
Hi, thanks for the ChemPy package! As I was using
Substance.from_formula
, I noticed that one could easily make a mistake in the formula string and not be notified about it. For example, if I wanted to add methanol, if I don't make any mistakes it works great:If I make a minor mistake, for example forgetting to capitalize the first H for hydrogen, ChemPy gives no warning and simply stops at the last valid element. So the formula string is interpreted as simply
C
for carbon, even though thename
is the entire supplied formula stringCh3OH
:Is there something like a
strict
flag (option) to throw a warning or even exception (error) if the entire formula string cannot be interpreted as a substance? If the entire formula string cannot be interpreted as a substance, should the substance'sname
have only the part that was interpreted as a substance?The text was updated successfully, but these errors were encountered: