-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Formalizing the Schema #37
Conversation
Continue to move to repo snapshot Finshes transition to project Finishes moving to a module finshed modulizing
.gitignore
Outdated
@@ -99,3 +99,11 @@ ENV/ | |||
|
|||
# mypy | |||
.mypy_cache/ | |||
|
|||
# Vim leftover |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not everyone uses vim, and this is not strictly related to the schema. Maybe put this in your private ignore list? (.git/info/exclude)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, can do. I normally see these left in the .gitignore
files, is there a reason to move them out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO projects should have just entries relevant to their code. But of course it's not super important.
qc_schema/dev/molecule.py
Outdated
} | ||
}, | ||
"masses": { | ||
"description": "The mass of the molecule, canonical weights assumed if not given.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name is plural but the description has singular mass.
"The doubles portion of the MP2 correlation energy including same-spin and opposite-spin correlations." | ||
} | ||
|
||
mp_properties['mp2_total_correlation_energy'] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be useful to also have a total MP energy property. I know it's easy to sum up, but some people like to see the total energy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, missed this.
scf_properties = {} | ||
|
||
scf_properties["scf_one_electron_energy"] = { | ||
"description": "The one-electron energy contribution [H] to the total SCF energy.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here and later on, what is [H]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The core Hamiltonian, I dont really look at these descriptions as complete (they are not intended to be), but more about the general scaffold to build on.
"$ref": "#/definitions/provenance" | ||
} | ||
}, | ||
"required": ["symbols", "geometry"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this actually enforced somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you run this through a json-schema validator you will get a raise like the following:
E jsonschema.exceptions.ValidationError: 'symbols' is a required property
E Failed validating 'required' in schema['properties']['molecule']:
Im not sure how to require this at other levels.
docs/source/tech_specs.rst
Outdated
How do we store large lists of objects (such as lists of atoms or bonds?) | ||
|
||
1) As a table of values (these values do very well in HDF5 as well) | ||
2) As a set of arrays |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should probably emphasize that 2)
has been chosen.
On a personal note, I was using 1)
in psi4 as closer to local layout. Switched over to 2)
after LBNL mtg and found it perfectly easy to work in and much tidier to inspect.
* Repeat of input components | ||
* Driver return - Return of the requested driver (energy/gradient/etc) | ||
* Properties - Other properties/values constructed as by products of the computation | ||
* Provenance - Code, computer, user information, actual settings used by the code (lots |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Request that this be a list so that multiple programs can contribute provenance to a calc.
"description": "The singles portion of the MP2 correlation energy. Zero except in ROHF." | ||
} | ||
|
||
mp_properties["mp2_double_energy"] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this and the previous be "singles" and "doubles"?
qc_schema/dev/molecule.py
Outdated
} | ||
}, | ||
"geometry": { | ||
"description": "The 3N XYZ coordinates of the atoms involved.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
explicitly mention flat (3N, ) rather than nester (N, 3), I think.
scf_properties["scf_one_electron_energy"] = { | ||
"description": "The one-electron energy contribution to the total SCF energy.", | ||
"type": "number" | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is meant to refer to Hartree--Fock, not HF-or-DFT, I suggest we be explicit now and use "HF".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we define that is as simply \sum{D * H}
? I worry that splitting HF and SCF into two parts could get much more complex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other thing missing IMHO was Bob Hanson's discussion about having a required 'magic header' that indicates that this JSON is, in fact, a QC_JSON version 1.0.
I think that's useful for handling files 'in the wild' for parsers, but also to ensure we can gracefully upgrade the spec over time.
@@ -23,15 +23,15 @@ The following are optional fields and default values (option, more a list of pos | |||
- `comment` (str) - Any additional comment one would attach to the molecule. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, based on #35 that it makes sense to have an "identifier" section, with comment, molecule name, formula, InChI, SMILES, etc. as optional. This would probably also include provenance and DOI.
Looking at my notes, I think these were grouped in the discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you thinking of comment
as a dictionary? I was considering moving up a few of those to top level optional fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I'm thinking that there's an obvious request for identifiers, and I think comment/title have usually been identifiers in QC codes.
So my suggestion is that there's an explicit identifier object - something like:
"identifier" : {
"name": "aspirin",
"comment": "training set",
"formula": "C9H8O4",
"smiles": "O=C(C)Oc1ccccc1C(=O)O",
"inchikey": "BSYNRYMUTXBXSQ-UHFFFAOYSA-N"
}
docs/source/index.rst
Outdated
|
||
**Translators:** | ||
- `cclib <http://cclib.github.io>`_ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
openbabel <http://openbabel.org/>
_
docs/source/index.rst
Outdated
- `Avogadro <https://avogadro.cc>`_ | ||
- `Molecular Design Toolkit <https://github.com/Autodesk/molecular-design-toolkit>`_ | ||
- `VTK <http://www.vtk.org>`_ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jmol / JSmol <http://jmol.org/>
_
docs/source/tech_specs.rst
Outdated
['N', 7, 14.20, 0, [0.214,12.124,1.12], [0,0,0]], | ||
...} | ||
|
||
// 2) Storing fields as arrays: much more compact, but harder to read and edit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps its harder to read/edit as a human, but not in a program. Please don't use this language. As @loriab mentioned, this was the choice at the workshop and there are a lot of benefits on the "writing code" side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was mostly porting over current Markdown documentation to RST as an example, but there seems to be bringing up a bit of old discussion that is indeed solved. Should we covert the solved topics over to a simpler discussion of why the choice was made?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think that's a good idea. As people implement, I don't think there should be ambiguity and I think a discussion of the choice is good.
(Incidentally, I'd suggest when the schema is done, that this be published in J. Cheminf.)
docs/source/tech_specs.rst
Outdated
|
||
.. code:: python | ||
|
||
// 1) Storing fields as tables: creates an mmCIF/PDB-like layout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please no. Let's not even suggest something remotely like the mess of PDB files.
"description": "The name of the molecule.", | ||
"type": "string" | ||
}, | ||
"comment": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make these part of an object for "identifiers" to reflect feedback from #35
qc_schema/dev/molecule.py
Outdated
} | ||
}, | ||
"fix_com": { | ||
"description": "Whether to adjust to the molecule to the COM or not.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
COM? I assume you mean "center of mass" but let's be explicit.
""" | ||
} | ||
|
||
scf_properties["scf_dipole_moment"] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this have to be 'scf' dipole moment? Maybe I'm pedantic, but what if I run CCSD(T) - most programs don't print out two different dipole moments IIRC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think its good to clarify if someone does say CCSD bond orders. It can become unclear if you stored the SCF or CCSD dipoles.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't that lead to a proliferation of new objects? Maybe there should be a properties
object:
property: {
type: "dipole moment"
method: "SCF"
value: ...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't seem like too many more fields as we will have energy definitions for these slots anyways. I might be missing your point on the object side of things however.
For the magic header, poking around there doesn't seem to be a full standard. However, finding a few examples this following seems popular:
Should this be spun off into its own issue however? |
I'd probably go with name/version/url/description in that order, but implementations will go with alphabetical order. Can we drop the description part? Then you'd have:
That seems suitably unique for file/magic sensing. |
IMHO, this is a great start and I'm in favor of merging this and starting new issues / pull requests. |
Im in favor of merging as well. Certainly a lot left to do, but we can break the current topics off into issues. |
This is looking great, and would agree that merging, and then breaking off into smaller PRs/discussions would be beneficial. |
Going to pull this in and setup the docs. Please spin off issues or PR's as needed to fix current issues. |
Description
To kick this off again, I have started creating a json-schema implementation based the discussed schema along with documentation via Sphinx and RTD. As a forward I should note that the implementation here is an attempt to build something concrete. Many things can be changed, but some sort of core needs to be formed to move forward.
This has three basic parts:
The presented schema aims to provide a very basic interface with a few notable points such as lack of unit support. Before continuing to build out this schema I wanted to get intermediate feedback about the current setup and general organization.
Questions
Status