Skip to content

Type system

Lukas Gerlach edited this page Oct 25, 2023 · 3 revisions

Everything concerning Types in Nemo is still under heavy development and subject to changes that might not be immediately reflected in this Wiki page. This Wiki Page is all around the type system in Nemo. This concerns the available types, type checks, type inference and also the interplay with values when reading and writing.

Status Quo

First, we describe the behavior that is currently implemented.

Types

We have four PrimitiveTypes in the logical layer of Nemo (in theory, there is also a Tuple Type but this one is not used yet.)

  • Any - Read as "any rdf literal" - stored as String in the physical layer
  • String - a plain string - stored as String in the physical layer
  • Integer - a 64bit integer - stored as i64 in the physical layer
  • Float64 - a double precision floating point number - stored as Double in the physical layer

The physical type of a PrimitiveType is determined by impl From<PrimitiveType> for DataTypeName.

Dataflow

Nemo assigns each predicate that occurs in the input program a list of the above types (one for each position). This can be made explicit by the user using e.g. @declare P(any, string). For more information on how these types are determined if not given explicitly, see Type Checks and Inference below.

The types of a predicate determine how a value is handled by Nemo and processed internally. At the moment this also determines how the value is written later. However, it does not determine how the value is read! Literals that occur directly in the program in rules or facts are processed by the parser, which stores them in an enum named Constant that reflects the syntax of the value that has been parsed. Note that literals in the rule file are always put into the Constant data structure.

For values that are read from sources like csv can be handled differently and skip the Constant data structure in some cases. Just like predicates, sources have a type for each predicate position can be annotated with types like @source P(any, integer) load-csv(...) that can be different from the types of the predicate. (If no types are annotated for a source, it falls back to a defined default; for csv this would be string for example.) Source-types do not (immediately) determine how a value is handled by Nemo but it just specifies how a value should be interpreted when reading it. For example, with the above predicate- and source-declarations for P, the first column of P would be parsed as an rdf-literal and also handled as such by Nemo. The second column is read as integers (throwing an error if there is a malformed integer in the column) but then stringify the values and only treat the values as strings internally.

When all values have been read from the program and the sources (and mapped to the desired logical representation), the logical values are converted into their physical representation. This conversion is rather straightforward at least for Integer and Float64. For the Any type, i.e. rdf-literals stored in the Constant data structure, the Constant enum is stringified using the enum variants as prefixes to be able to reverse the mapping later. For example, a numeric integer literal 3 is stored as the string INTEGER:3 and a string literal "my string" is stored as STRING:my string. Internally, we want to be able to combine Any and String in their physical representations. Therefore, we store strings in logical String columns also with a STRING: prefix.

After the reasoning process in the physical layer is finished, the values are mapped back to the logical representations according to their logical type. There are output iterators can provide directly these logical representations (for the API) or serialize these logical representations to strings directly (for csv output). The serialization of a Constant in particular is determined by its Display implementation.

Type Checks and Inference

All predicate types are checked for consistency. Before this check happens, types are inferred for wherever possible. Unknown types are set to the default type Any.

The inferences and checks consider type requirements. A TypeRequirement can be Hard, Soft or None. Hard requirements have to be matched and will error on conflicting requirements. Soft requirements give a type hint but can be overridden by type inference. None requirements are simply unknown and can freely be overridden.

First, all explicit predicate declarations (@declare) are converted into hard type requirements. Type declarations from @sources are interpreted as type hints and therefore converted into soft type requirements if no explicit declaration has been provided. Literals from rules and facts are also used as type hints by assigning each literal a suitable type, which is again turned into a soft type requirement. Predicate positions with existential variables are assigned a hard type requirement of Any since we cannot use nulls otherwise. If this hard Any requirement leads to a clash with another requirement an error is thrown.

Based on the type requirements, type information is now propagated from rule bodies to rule heads for shared variables and aborted on conflicts that occur in the process. All type requirements that are still None afterwards are set to Any. Afterwards body positions with the same variable are checked for compatibility and a few additional consistency checks e.g. for arithmetic operations are carried out.

Desired Behavior

Here, we keep track of decisions that have been made regarding the type system. Those shall be implemented in the long run.

TODO

Clone this wiki locally