-
Notifications
You must be signed in to change notification settings - Fork 61
Lucene Neo4J comparison
This document describes a comparison of a Neo4j implementation and a pure Lucene implementation of the HCI Memory Structure.
The test inserts a large amount of randomly generated LRUs into the Memory Structure index, followed by a very large amount of randomly generated links between those.
The test data is generated as follows: first, a random list of LRUs is generated. This happens by starting with the hardcoded scheme "http://" followed by the hardcoded 'host' ".fr". Then at least one random string is generated and added as host segment. From here, a number (between 0 and 3) of recursive method calls create 0 or more "extra" LRUs by adding to the number of host segments and/or adding path segments. The max number of path segments is restricted to 5.
This method of random generation ensures that in general, a LRU 'family' is created with several LRUs sharing the same host name and different subdomain segments and/or path segments; therefore the tree structure will be more realistic and wider branched.
This process is repeated a large number of times to create a large number of LRU 'families'.
The resulting set of LRUs is used to generate a set of links. For each LRU, a number (between 0 and 5) of links is generated using this LRU as source. For target, a random other LRU from the set is chosen. So the resulting set of links is consistent in that no source or target of any link is an unknown LRU.
The Lucene test doesn't use LRUs as input but URLs, but the method I've used to create test data for it is functionally equivalent.
The Lucene implementation (by Jérôme Thièvre) originally had separate runnable classes for the performance test of URL insertion and for the test of link insertion. To make it run in one go, I've joined the implementations of the two into a single runnable test. Also I've removed the maximum of 1 million link insertions as the generated test data contains a bit more than 1 million links.
The Lucene implementation reads the URLs from file. It transforms each URL into a LRU representation with max 5 path segments, and stores this in the Lucene index. If the index already contained this LRU, it's not stored again.
Then the Lucene test continues with the links. It reads them from file, parsing each line into source, target and weight. The source and target URLs are transformed into LRUs with max 5 path segments and added to the index. Because my input data is consistent (the links are generated between the generated URLs) no link source or target will not be already indexed during the URL indexing phase, so this doesn't increase the index. Then the links are added to the index as links (with source, target and weight).
The Lucene test keeps the index in RAM until the program closes, when it writes it to disk. The size on disk (as reported by my Windows OS) is ..
Throughout the test the Lucene implementation writes to screen, at every 10000 document insertions, how many documents per second are inserted.
The Neo4j implementation generates its test data when the test starts. For the LRU insertion test it expects a set of LRUs (not URLs), so no URL-to-LRU conversion is needed. It creates a graph database in tree-form, corresponding to the LRU-tree diagram. The nodes are LRU segments (scheme, host, or path) which are connected to each other through the relation IS_PARENT_OF. Every time a node is added to the tree, it is also indexed in Lucene.
Then the Neo4j test continues with the links. It retrieves the source and target nodes using a Lucene query, then creates a relation between them of type NODE_LINK. This relation is stored in the graph database as well as indexed in Lucene.
All Lucene interaction in the Neo4j test is done through Neo4j's index API.
Throughout the test the Neo4j implementation writes to screen, displaying what is the current size of the index and how many nodes are stored per second.
Because of the different nature of the two tests the results are not 1-on-1 comparable: the Lucene test stores whole LRUs in Lucene documents, whereas the Neo4j test replicates the tree structure and inserts a node for each segment of a LRU. The number of those depends of course on the (randomly generated) LRU, but is at least 3 (scheme, host, and .fr), but can be much higher (no limit on how many subdomain host segments were generated) and can have between 0 and 5 path segments. A fair estimate would be that on average, the insertion of a LRU in the Lucene tests corresponds to around 7-8 insertions of LRU segments in the Neo4j test.
The following diagrams depict the speed of inserting LRUs and links against index size in the Lucene test:
Although I did not measure it, occasional monitoring of CPU usage during the Lucene test showed a constant CPU usage around 85-90%.
The following diagrams depict the speed of inserting LRU segments and links against index size in the Lucene test:
TODO discussion of actual results (test still running now).
One thing that surprised me is that Jérôme Thièvre reported a throughput of his test with the pure Lucene implementation of 5000 insertions per second:
Les vitesses d'écriture/requêtage sont constantes, elles ne dépendent pas de la taille de la base. De mémoire cela tournait autour de 5000 écriture ou requêtes par secondes, ce qui est plus que satisfaisant.in my test, with the same code !, this started at 41 document insertions per second and after that kept dropping until I killed the test when it was at 17. This seems rather a dramatic difference considering the code is the same, so this is something to definitely follow up on. Maybe it's my laptop; can we do these tests on some buff Linux test server, or so ?
Even more important, is that in my repeat of Jérôme Thièvre's test the throughput does not at all remain steady when the index grows. It steadily drops. This calls for further investigation.
Performance aside, there are some pro and cons to either solution.
Both implementations rely on Lucene to provide the index for quick retrieval. In this respect there is little difference between the two solutions.
- you have total control over how you employ Lucene, because you're using its own API. This may be important for performance tweaks. In Neo4j you can only do so much to alter the configuration of its underlying Lucene index. None of the configuration options seem very pertinent to the HCI Memory Structure as you're just using the index to look up fixed values (e.g. types like 'url' or 'urlLink', and non-free-text values such as LRUs). In contrast, in a pure Lucene solution you have total control and can do tweaks like which may help the performance.
setOmitTermFreqAndPositions(true);
- although not really measured during my tests, it was apparent that memory use remains at a stable level throughout.
- Lucene is a very well-known library with an enormous user base and very active mailing list.
- writing to Neo4j is transactional. This includes writing to the underlying Lucene index. If something goes wrong, the graph database and Lucene index are left in a consistent state.
- the use of a Lucene index makes retrieval just as fast as with a pure Lucene implementation
- Neo4j offers a high availability version, Neo4j Enterprise that can be scaled horizontally. With pure Lucene it's not at all obvious how to go about horizontal scaling. The Neo4j scaling solution however (unlike the 'community edition') is NOT Free and Open Source Software.
- the tree structure of the LRUs and the graph structure of the links between them is actually preserved in the persistence layer. This can lead to cleaner-looking code. See for example the code to retrieve the incoming links to all LRUs in the Web Entity denoted by LRU s:http://|h:com|h:blogspot :
// Neo4j code example
// lookup node
String luceneQuery = LRUPropertyKey + ":" + QueryParser.escape("s:http://|h:com|h:blogspot");
IndexHits<Node> nodes = nodeIndex.query(luceneQuery);
for(Iterator<Node> nit = nodes.iterator(); nit.hasNext();) {
Node blogspot = nit.next();
// retrieve children of blogspot (i.e. subdomains):
Iterable<Relationship> blogspotSubdomains = blogspot.getRelationships(RelTypes.IS_PARENT_OF, Direction.OUTGOING);
for(Iterator<Relationship> subdomainIterator = blogspotSubdomains .iterator(); subdomainIterator.hasNext();) {
Relationship r = subdomainIterator.next();
Node subdomain = r.getEndNode();
// retrieve incoming links to this subdomain
Iterable<Relationship> incomingLinks = subdomain.getRelationships(RelTypes.NODE_LINK, Direction.INCOMING);
for(Iterator<Relationship> linksIterator = incomingLinks .iterator(); linksIterator.hasNext();) {
System.out.println("incoming link to blogspot subdomain " + subdomain.getProperty(LRUPropertyKey) + " from " + linksIterator.next().getStartNode());
}
}
break;
}
It is kind of cool to be able to program with relations as first-class citizens. Traversing them in either direction is easy, although certainly not lightning fast if you want to traverse a complete LRU tree with millions of nodes.
In the light of the above results I'd recommend to choose TODO