-
Notifications
You must be signed in to change notification settings - Fork 14
Representing time dependent graphs in Neo4j
Large-scale data collection efforts using wearable sensors to mine for proximity of individuals (for example, the SocioPatterns project) produce time-dependent social graphs, where nodes are individuals, edges represent proximity/contact relations of individuals, and the proximity graph changes over time. Both nodes and edges can have rich attributes.
Data formats for exchanging the time-dependent graphs are available, see for instance the GEXF format. Efficiently mining large time-dependent graphs, however, requires a database and a representation that can support complex topological queries, temporal queries, multi-scale temporal indexing and aggregation, and more. A growing research community working on temporal networks may benefit from sound and efficient techniques to represent, store and query dynamic graphs.
Here we want to start a discussion on best practices for modeling time-dependent graphs in Neo4j.
In the following we will assume that the social network is measured over adjacent discrete intervals of time that we will call frames. A frame is the finest unit of temporal aggregation that we use, and is associated with a time interval defined by a starting time and an ending time. A frame allows to retrieve the status of the social network during the corresponding time interval, and it is thus associated with a social graph at a given point in time. The nodes of such social graphs represent individuals. Edges represent, for instance, the proximity relations of individuals during the time interval of the frame. Edges are undirected and weighted, but the model described here generalizes to directed weighted edges and to multi-relational networks.
The Neo4j property graph representation is appropriate for time-dependent graphs, however it requires a more complex representation than the obvious choice of representing nodes as Neo4j nodes and edges as Neo4j relations. The time-dependent nature of the social graph requires to associate nodes and edges to frames in arbitrary ways. The natural way in Neo4j to represent this association of nodes and edges with a given interval of time (frame) is to draw relations to the frame node(s) they refer to. This suggests to represent both nodes and edges as Neo4j nodes. To avoid ambiguity and stay close to the language of our application, where nodes of the social graph (individals) are represented by their wearable sensor (tag), in the following we will refer to the nodes of the social graph as tags.
We use the following representation for a dynamic social graph in Neo4j (see figure below):
- A time-dependent graph as a whole is accessed as a RUN node, corresponding to a time-resolved record of social network evolution. A RUN node is connected to the reference node by means of a HAS_RUN relation.
- RUN nodes have RUN_FRAME relations to all the FRAME nodes. They also have a RUN_FRAME_FIRST relation to the first frame of the graph history, and a RUN_FRAME_LAST to the last one.
- Each FRAME node points to the successive frame by means of a FRAME_NEXT relation.
- Each FRAME node points to the status of the social graph during the corresponding time interval: it has FRAME_TAG relations to TAG nodes (representing tagged individuals) and FRAME_EDGE relations to EDGE nodes (representing a proximity/contact/social relation). Timestamps and interval metadata are represented as attributes of the FRAME node.
- An EDGE node has two EDGE_TAG relations to the TAG nodes that the edge connects to one another.
- Time-dependent attributes for an edge of the social graph (for example, time-dependent edge weights) are represented as attributes of the FRAME_EDGE relations that associate that edge with the corresponding frame(s).
- Similarly, time-dependent attributed of a nodes (tag) of the social graph are represented as attributes of the FRAME_TAG relations that associate that node with the corresponding frame(s).
- All TAG and EDGE nodes are connected to the main RUN node with RUN_TAG and RUN_EDGE relations.
- [QUESTION 1: are the (many) RUN_FRAME and RUN_EDGE relations really useful? The RUN_EDGE relations can be used to hold the weight of edges in the cumulative network, but the same information can be placed on EDGE nodes.]
The evolution of the graph can be tracked by walking from one frame to the next. Efficient random access to the frame timeline can be supported better by suitably indexing the time-stamped sequence of FRAME nodes. This can be achieved in different ways, such as using binary trees (like in the Neo4j Timeline class), or mirroring the natural temporal hierarchy of the data (year/month/day/hour/...) with hierarchical temporal relations between nodes.
Here we choose the latter technique, because it preserves the natural units of temporal aggregation of the data (days, hours, etc.). The temporal indexing structure we use is shown in the figure below, and is similar to the multilevel indexing structure of the Cypher cookbook. Of course, it can co-exist with other temporal indexes of the FRAME nodes, such as the native one.
- We build a tree that explicitly represents the temporal hierarchy of the dataset. The nodes of the tree are TIMELINE nodes. The top-level node, which is the entry point for the temporal index, is reachable from the RUN node through a HAS_TIMELINE relation.
- Nodes at each level of the tree have NEXT_LEVEL relations to nodes at the level below.
- The nodes at each level of the tree represent different scales of temporal aggregation, according to the time units that are most appropriate for the dataset at hand. For the same of simplicity, the tree in figure has a "day" and a "hour" levels only, but in other domains it may be useful to have a deeper hierarchy such as "year" / "month" / "day" / "hour" / "minute".
- At each level, time attributes are associated with the NEXT_LEVEL relations [QUESTION 2: is it more efficient to have them on nodes? Should we have both?]
- The nodes at the bottom level of the tree correspond to the finest scale of temporal aggregation (hours, in figure) and are connected to the indexed FRAME nodes via TIMELINE_INSTANCE relations. The timestamp of each FRAME node is replicated as a relation attribute of the corresponding TIMELINE_INSTANCE relation.
- [QUESTION 3: would it be convenient to have relations connecting horizontally nodes at each level of the tree, in order to walk sequentially, for example, through all the "hour" nodes? Breadth-first traversal and suitable naming of the TIMELINE nodes probably takes care of this in an efficient way.]
- [QUESTION 4: is it best to record the time attributes at different scales (day, hour, etc.) as types of the relations connecting nodes in the tree, as described in the Cypher cookbook's path tree ? ]
To simplify testing with the proposed data structures, we provide a simple Python script that uses the Neo4j Python REST bindings to load a dynamic GEXF document into a Neo4j store.
For testing, we will use an empirical dynamic social network measured at the ACM Hypertext 2009 conference by the SocioPatterns collaboration. The network of proximity relations among the conference attendees was recorded every 20 seconds for about 3 days and is available for download in GEXF format. In this dataset, nodes are individuals and edges represent face-to-face proximity relations of individuals.
Example: The following command loads the GEXF document into a local Neo4j REST server. The RUN node is named "HT2009" and data are loaded starting at POSIX time 1246191120 (2.12pm on Jun 28th 2009) with time frames of 20 seconds:
./load_gexf_to_neo4j.py ht2009_20sec.gexf HT2009 1246191120 20 http://localhost:7474/db/data/
NOTE: naturally the script takes a rather long time to load the data. Things will improve considerably when the Python REST client(s) will support batch operations.
The dataset we use for testing is a relatively small one: 113 persons (RUN nodes), 2163 proximity relations (EDGE nodes), 13956 frames of 20 seconds each (FRAME nodes), for a total duration of slightly more than 3 days. Production datasets are typically much larger than this, involving several hundreds persons, ~50-100,000 proximity relations, 70-100,000 frames.
In the following we provide a few sample Cypher queries performed on a real-world dynamic social graph from the SocioPatterns project, loaded into Neo4j as described above.
We will work with Neo4j 1.8+ (we make use of the "WITH" construct). We will assume that the top-level RUN node of the dynamic graph is node 1234. The reported timings for each query are obtained by running the query once, and then averaging the execution time over 3 additional runs of the same query.
- QUERY 1: get all time frames of run "HT2009" recorded between 9am and 1pm of July 1st 2009, ordered by timestamp [411 ms]
START root = node(0)
MATCH root-[:HAS_RUN]->run-[:HAS_TIMELINE]->()-[y:NEXT_LEVEL]->()-[m:NEXT_LEVEL]->()-[d:NEXT_LEVEL]->()-[h:NEXT_LEVEL]->()-[:TIMELINE_INSTANCE]->frame
WHERE run.name="HT2009" and y.year=2009 and m.month=7 and d.day=1 and h.hour>=9 and h.hour<13
RETURN frame ORDER BY frame.timestamp
- QUERY 2: get the names of all persons present in frame 1234 [3.3 ms for 80-100 returned rows]
START frame = node(1234)
MATCH frame-[:FRAME_TAG]-tag
RETURN tag.name
- QUERY 3: get the weighted proximity graph during frame 1234, filtering out the weak contacts [1.3 ms for 40-50 returned rows]
START frame=node(1234)
MATCH frame-[r:FRAME_EDGE]-edge
WHERE r.weight > 20
RETURN edge.tag1, edge.tag2, r.weight;
- QUERY 4 (slow): get a list of all the persons, and for each person get the number of frames it was present in [6100 ms , ~110 returned rows] [QUESTION 5: should we model the number of frames a tag is connected to as a property of TAG nodes?]
START run = node(2345)
MATCH run-[:RUN_TAG]->tag<-[r:FRAME_TAG]-()
RETURN tag.name, count(r)
- QUERY 5 (slow): get the names of all persons that were present in more than 1000 frames, ranked by time of presence [5500 ms, ~110 returned rows] [see QUESTION 5 above].
START run = node(2345)
MATCH run-[:RUN_TAG]->tag<-[r:FRAME_TAG]-()
WITH tag.name as name, COUNT(r) as freq
WHERE freq > 1000
RETURN name, freq ORDER BY freq DESC
Different way of writing it, similar performance:
START run = node(2345)
MATCH run-[:RUN_TAG]-tag
WITH tag
MATCH ()-[r:FRAME_TAG]-tag
WITH tag.name as name, COUNT(r) as freq
WHERE freq > 1000
RETURN name, freq ORDER BY freq DESC;
- QUERY 6 (slow): list all (distinct) days on which tag 1111 was present [180 ms on the sample dataset, but it scales very badly with the density of the static graphs associated to frames. For a 10-day dataset covering ~200 persons, frames have typically more than 20,000 relations and the execution time goes easily up to 300,000 ms !]
START tag = node(1111)
MATCH ()-[d:NEXT_LEVEL]->()-[:NEXT_LEVEL]->()-[:TIMELINE_INSTANCE]-()-[:FRAME_TAG]-tag
RETURN DISTINCT(d.day)
This runs surprisingly slow on larger datasets, considering that from each FRAME node there is a single path to the corresponding "day" node of the timeline index. Does this happen because of the high number of relations of the FRAME nodes (the inappropriately named "super-node problem") ?
- QUERY 7: return the names of all persons that were in proximity of user 8888, sorted alphabetically [3.3 ms]
START tag1 = node(8888)
MATCH tag1<-[:EDGE_TAG]-()-[:EDGE_TAG]->tag2
RETURN tag2.name ORDER BY tag2.name
- QUERY 8: return the names of all persons that were in proximity of user 8888 on the 7th [130 ms]
START tag1 = node(8888)
MATCH tag1<-[:EDGE_TAG]-edge-[:EDGE_TAG]->tag2
WITH edge, tag2
MATCH ()-[d:NEXT_LEVEL]->()-[:NEXT_LEVEL]->()-[:TIMELINE_INSTANCE]-()-[:FRAME_EDGE]-edge
WHERE d.day = 7
RETURN tag2.name
- QUERY 9: find all the common neighbors of user 1111 and 2222 [52 ms]
START tag1 = node(1111), tag2 = node(2222)
MATCH tag1<-[:EDGE_TAG]-()-[:EDGE_TAG]->tag
WITH COLLECT(tag) as neighs1, tag2
MATCH tag2<-[:EDGE_TAG]-()-[:EDGE_TAG]->tag
WHERE tag IN neighs1
RETURN tag
NOTE: need to use this form, as the straightforward query to find common neighbors, reported below, runs very slowly for the larger datasets mentioned above for QUERY 6.
START tag1 = node(1111), tag2 = node(2222)
MATCH tag1<-[:EDGE_TAG]-()-[:EDGE_TAG]->tag<-[:EDGE_TAG]-()-[:EDGE_TAG]->tag2
RETURN tag
- QUERY 10: compute the degree of all persons in the contact graph [84 ms]
START run = node(2345)
MATCH run-[:RUN_TAG]-tag-[r:EDGE_TAG]-()
RETURN tag.name, COUNT(r) ORDER BY COUNT(r) DESC
Most graph datasets from real-world systems such as social networks, on-line and off-line infrastructural networks, records of human-driven activity and a host of natural phenomena exhibit fat-tailed statistics for the node connectivity (see for example the review paper on Power laws, Pareto distributions and Zipf's law). This means that graph databases have to deal with the inevitable high heterogeneity of node connectivity, i.e., with the fact that a small fraction of nodes have connections to a large fraction of other nodes.
In our data there are two main sources of heterogeneity: the first is the topological heterogeneity of the social networks we measure, the second is the temporal heterogeneity in activity due to the bursty nature of human dynamics. These heterogeneities are reflected in the properties of the dynamic social graphs we want to model in Neo4j.
A word of warning: we do not claim that the simple experiments reported here reflect the general performance of Neo4j/Cypher in any specific application domain except our own.
-
Overall, Neo4j and Cypher achieve a perfectly satisfactory performance in querying the sample dataset we used here, suitable for exploratory data analysis, for interactive Web applications build on top of the Neo4j REST server, and for research-oriented data mining.
-
The bad performance of commonly used queries such as QUERY 4 and QUERY 5 is a reason for concern, especially considering that the dataset we used for testing is very small. However, in our case it is possible to solve this problem by pre-computing additional node attributes (number of frames a tag is connected to, see QUESTION 5 above).
-
The bad performance of QUERY 6 on scaling up the size/density of the dataset is definitely a problem, especially considering the simplicity of the query.
As stated at the beginning of this report, we want to start a discussion on best practices for modeling time-dependent graphs in Neo4j. We will be grateful for any comments, further testing by using our data, and in general any insights that the Neo4j team and community are willing to share.
We welcome comments and modifications to this wiki page.
- Ciro Cattuto, ISI Foundation
- André Panisson, ISI Foundation
- Marco Quaggiotto, Politecnico di Milano and ISI Foundation