Indexing in Neo4j

In earlier builds, Neo4j had no support for indexing and was a simple property graph. However, as the datasets scaled in size, it was inconvenient and error-prone to traverse the entire graph for even the smallest of queries, so the need to effectively define the starting point of the graph had to be found. Hence, the need for indexing arose followed by the introduction of manual and then automatic indexing. Current versions of Neo4j have extensive support for indexing as part of their fundamental graph schema.

Manual and automatic indexing

Manual indexing was introduced in the early versions of Neo4j and was achieved with the help of the Java API. Automatic indexing was introduced from Neo4j 1.4. It's a manual index under the hood that contains a fixed name (node_auto_index, relationship_auto_index) combined with TransactionEventHandler that mirrors changes on index property name configurations. Automatic indexing is typically set up in neo4j.properties. This technique removes lot of burden from the manual mirroring of changes to the index properties, and it permits Cypher statements to alter the index implicitly. Every index is bound to a unique name specified by the user and can be associated with either a node or a relationship. The default indexing service in Neo4j is provided by Lucene, which is an Apache project that is designed for high-performance text-search-based projects. The component in Neo4j that provides this service is known as neo4j-lucene-index and comes packaged with the default distribution of Neo4j. You can browse its features and properties at http://repo1.maven.org/maven2/org/neo4j/neo4j-lucene-index/. We will look at some basic indexing operations through the Java API of Neo4j.

Creating an index makes use of the IndexManager class using the GraphDatabaseService object. For a graph with games and players as nodes and playing or waiting as the relationships, the following operations occur:

//Create the index reference
IndexManager idx = graphDb.index();
//Index the nodes
Index<Node> players = idx.forNodes( "players" );
Index<Node> games = idx.forNodes( "games" );
//Index the relationships in the graph
RelationshipIndex played = idx.forRelationships( "played" );

For an existing graph, you can verify that an entity has been indexed:

IndexManager idx = graphDb.index();
boolean hasIndexing = idx.existsForNodes( "players" );

To add an entity to an index service, we use the add(entity_name) method, and then for complete removal of the entity from the index, we use the remove ("entity name") method. In general, indexes cannot be modified on the fly. When you need to change an index, you will need to get rid of the current index and create a new one:

IndexHits<Node> result = players.get( "name", "Ronaldo" );
Node data = result.getSingle();

The preceding lines are used to retrieve the nodes associated with a particular index. In this case, we get an iterator for all nodes indexed as players who have the name Ronaldo. Indexes in Neo4j are useful to optimize the queries. Several visual wrappers have been developed to view the index parameters and monitor their performance. One such tool is Luke, which you can view at https://code.google.com/p/luke/.

Having told Neo4j we want to auto-index relationships and nodes, you might expect it to be able to start searching nodes straightaway, but in fact, this is not the case. Simply switching on auto-indexing doesn't cause anything to actually happen. Some people find that counterintuitive and expect Neo4j to start indexing all node and relationship properties straight away. In larger datasets, indexing everything might not be practical since you are potentially increasing storage requirements by a factor of two or more with every value stored in the Neo4j storage as well as the index. Clearly, there will also be a performance overhead on every mutating operation from the extra work of maintaining the index. Hence, Neo4j takes a more selective approach to indexing, even with auto-indexing turned on; Neo4j will only maintain an index of node or relationship properties it is told to index. The strategy here is simple and relies on the key of the property. Using the config map programmatically requires the addition of two properties that contain a list of key names to index as shown below.

Schema-based indexing

Since Neo4j 2.0, there is extended support for indexing on graph data, based on the labels that are assigned to them. Labels are a way of grouping together one or more entities (both nodes and relationships) under a single name. Schema indexed refers to the process of automatically indexing the labeled entities based on some property or a combination of properties of those entities. Cypher integrates well with these new features to locate the starting point of a query easily.

To create a schema-based index on the name_of_player property for all nodes with the label, you can use the following Cypher query:

CREATE INDEX ON :Player(name_of_player)

When you run such a query on a large graph, you can compare the trace of the path that Neo4j follows to reach the starting node of the query with and without indexing enabled. This can be done by sending the query to the Neo4j endpoint in your database machine in a curl request with the profile flag set to true so that the trace is displayed.

curl http://localhost:7474/db/data/cypher?profile=true -H "Accept: application/json" -X POST -H "Content-type: application/json" --data '{"query" : "match pl:Player where pl.name_of_player! = "Ronaldo" return pl.name_of_player, pl.country"}'

The result that is returned from this will be in the form of a JSON object with a record of how the execution of the query took place along with the _db_hits parameter that tells us how many entities in the graph were encountered in the process.

The performance of the queries will be optimized only if the most-used properties in your queries are all indexed. Otherwise, Neo4j will have no utility for the indexing if it has one property indexed and the retrieval of another property match requires traversing all nodes. You can aggregate the properties to be used in searches into a single property and index it separately for improved performance. Also, when multiple properties are indexed and you want the index only on a particular property to be used, you can specify this using the following construct using the p:Player(name_of_player) index with schema indexes; we no longer have to specify the use of the index explicitly. If an index exists, it will be used to process the query. Otherwise, it will scan the whole domain. Constraints can be used with similar intent as the schema indexes. For example, the following query asserts that the name_of_player property in the nodes labeled as Player is unique:

CREATE CONSTRAINT ON (pl:Player) ASSERT player.name_of_player IS UNIQUE

Currently, schema indexes do not support the indexing of multiple properties of the label under a same index. You can, however, use multiple indexes on different properties of nodes under the same label.

Indexing takes up space in the database, so when you feel an index is no longer needed, it is good to relieve the database of the burden of such indexes. You can remove the index from all labeled nodes using the DROP INDEX clause:

DROP INDEX ON :Player(name_of_player)

The use of schema indices is much simpler in comparison to manual or auto-indexing, and this gives an equally efficient performance boost to transactions and operations.

Indexing benefits and trade-offs

Indexing does not come for free. Since the underlying application code is responsible for the management and use of indexes, the strategy that is followed should be thought over carefully. Inappropriate decisions or flaws in indexing result in decreased performance or unnecessary use of disk storage space.

High on the list of trade-offs for indexing is the fact that an index result uses storage, that is, the higher the number of entities that are indexed, the greater the disk usage. Creating indexes for the data is basically the process of creating small lookup maps or tables to allow rapid access to the data in the graph. So, for write operations such as INSERT or UPDATE, we write the data twice, once for the creation of the node and then to write it to the index mapping, which stores a pointer to the created node.

Moreover, with an elevated number of indexes, operations for insertions and updates will take a considerable amount of time since nearly as many operations are performed to index as compared to creating or updating entities. The code base will naturally scale since updates/inserts will now require the modification of the index for that entity and as is observed, if you profile the time of your query, the time to insert a node with indexes is roughly twice of that when inserted without indexes.

On the other hand, the benefit of indexing is that query performance is considerably improved since large sections of the graph are eliminated from the search domain.

Note that storing the Neo4j-generated IDs externally in order to enable fast lookup is not a good practice, since the IDs are subject to alterations. The ID of nodes and relationships is an internal representation method, and using them explicitly might lead to broken processes.

Therefore, the indexing scenario would be different for different applications. Those that require updation or creation more frequently than reading operations should be light in indexing entities, whereas applications dealing primarily with reads should generously use indexes to optimize performance.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset