-
Notifications
You must be signed in to change notification settings - Fork 0
How do we represent data?
Data is arranged in RemoteHDT as follows. First, we create a dictionary of non-repeated data, where each entry represents a mapping between a URI and a number. As such, we can highly compress the information stored in the resulting dump. Second, a 2-dimensional weighted sparse matrix is created, where each axis represents either the Subjects or the Objects. The way data is oriented can be arbitrarily chosen. To put it into perspective, provided the SPO orientation; data is going to be stored in s matrices, one per different subject, where rows represent the subjects and columns the objects. At the end, we will have a 2-dimensional array having the shape of s x o. In case there's a mapping between a subject, predicate, and object, that is a triple, a number will be stored at its corresponding position in the array indicating which predicate corresponds to that subject - object relation; a 0 will be stored otherwise.
It is worth mentioning that arrays in this model are said to be sparse, as most of the elements are 0. As such, it is interesting to be aware of that as we could highly increase the performance of queries. As such, we could highly compress matrices and evaluating only parts of the computation; to put an example, if we multiply two matrices and all the values in a row of the first factor are 0, we know that the results in which that row is involved will be 0 in any case. Hence, we can perform some techniques related to early evaluations, or any other optimization for Sparse matrices.
Having the following RDF dataset, serialized in Turtle, let us see how it is stored in RemoteHDT.
PREFIX : <http://example.org/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
:alan :instanceOf :human ;
:placeOfBirth :warrington ;
:placeOfDeath :wilmslow ;
:dateOfBirth "1912-06-23"^^xsd:date ;
:employer :GCHQ .
:warrington :country :uk .
:wilmslow :country :uk ;
:instanceOf :town .
:bombe :discoverer :alan ;
:instanceOf :computer ;
:manufacturer :GCHQ .
As we have mentioned, a dictionary will be created where unique identifiers will be mapped to an unsigned integer value, corresponding to its index in the matrix. Note that for each dimension, the serialization starts at 0 again. That is, subjects will be mapped in a range from 0 to s - 1, while predicates will be in a range from 1 to p, and objects will be from 0 to o - 1.
== Subjects =====================================================
0 --> <http://example.org/bombe>
1 --> <http://example.org/warrington>
2 --> <http://example.org/alan>
3 --> <http://example.org/wilmslow>
== Predicates ===================================================
1 --> <http://example.org/employer>
2 --> <http://example.org/country>
3 --> <http://example.org/placeOfDeath>
4 --> <http://example.org/placeOfBirth>
5 --> <http://example.org/manufacturer>
6 --> <http://example.org/dateOfBirth>
7 --> <http://example.org/discoverer>
8 --> <http://example.org/instanceOf>
== Objects ======================================================
0 --> <http://example.org/warrington>
1 --> <http://example.org/wilmslow>
2 --> <http://example.org/GCHQ>
3 --> "1912-06-23"^^<http://www.w3.org/2001/XMLSchemadate>
4 --> <http://example.org/computer>
5 --> <http://example.org/alan>
6 --> <http://example.org/town>
7 --> <http://example.org/uk>
8 --> <http://example.org/Human>
Lastly, the two-dimensional matrix will be created where for each Triple in the aforementioned RDF dump, a number will be placed in its corresponding (x,y) position. Where x is the integer value representing the subject, y for the object. Here is an example of the resulting matrix
[[2, 4, 5, 0, 0, 0, 0, 7, 8],
[0, 0, 0, 0, 0, 0, 1, 0, 0],
[0, 0, 0, 0, 0, 5, 1, 0, 0],
[0, 6, 0, 3, 5, 0, 0, 0, 0]]