We have looked at Cypher syntax, how to load data into a graph, and retrieving data and APOC in the previous chapters. In this chapter, we will discuss the best practices to get the most out of Cypher queries, including how to leverage data modeling and patterns by looking under the hood to understand how Neo4j stores data. We will also discuss the tips and tricks to identify performance bottlenecks and how to go about addressing them. We will take a look at the following topics:
We will also take a look at a few tips and tricks for good query patterns and identifying issues and what to look out for when we are building queries.
First, we will look at understanding the internals of Neo4j.
Having a good understanding of how Neo4j stores data can help us to build better queries. There are a few files that are important to understand how Neo4j stores data.
They are given in the following list:
Let’s take a look at these store sizes for a database.
Figure 12.1 – Database store files
We can see from the screenshot that the store files are highlighted. These are the files that will have the most impact on the memory and performance requirements. We are discussing these aspects here because understanding how Neo4j stores and retrieves data helps us to write more optimal queries. The reason we are discussing this aspect in this chapter instead of earlier ones is that combining this knowledge and query processing with tips can better assist you to build queries. For example, if you have been working with Cypher queries and want to look at how to improve your skills, this chapter is more beneficial.
Let’s now look at node store structure.
A node store is a single file that stores all the nodes in a graph. Each node is a fixed-size data structure, as shown in the following table.
ID |
Labels |
Props |
Rels |
01 |
:Patient |
01 |
01 |
Table 12.2 – Node data structure
From the preceding table, we can see that the node data structure contains the node ID. This is an index in the node store file (nodestore.db) that we looked at in the previous section. It has labels associated with it. We can see in the table that this node has a label, Patient. The Props section contains the first property of the node. If there are other properties, they are stored as a linked list, with the first property being stored in the node. The Rels section points to the first relationship this node has. If the number of relationships starts becoming dense, such as more than 100, then this can point toward a relationship group store. The relationships also get connected as a linked list. When it is stored as a group, it will have one entry per relationship type, with IN and OUT directions separated. The labels stored are not the actual strings but, instead, ID numbers associated with the label name, and the corresponding ID is stored in the node data structure.
Now, let’s look at the relationship store.
The relationship data structure is a bit bigger than a node. It can be seen as follows:
ID |
Type |
Source node |
Target node |
Props |
Source previous |
Source next |
Target previous |
Target next |
01 |
HAS_ENCOUNTER |
01 |
03 |
09 |
6 |
02 |
4 |
03 |
Table 12.3 – Relationship data structure
The ID value is the index in the relationship store file (relationshipstore.db) that we looked at in the previous section. It has one type, which is an ID number associated with that name. It contains both Source node and Target node IDs in the data structure. This should make it clear that in Neo4j, data is connected as a doubly linked list to make it easier to traverse in any direction in an efficient manner. The properties section contains the ID of the first property associated with this relationship. The Source previous entry contains the number of relationships in the chain for the source node when this is the first relationship in the chain. Similarly, Target previous contains the number of relationships in the chain for the target node if this is the first relationship in the chain. This is useful to tell us the number of relationships the node is connected quickly with. If it is not the first relationship in the chain, then these entries will contain the previous relationship in the chain.
The Source next entry contains the next relationship for the source node in the chain. The same applies to the Target next entries. This should make it clear that once we have obtained the relationship, we can traverse back and forth using the previous and next entries, depending on the source or target node context.
Next, let us take a look at the property store.
The property store contains the property values for both nodes and relationships in a single stored file. It is split into three different stores to make sure the data structure size can be fixed.
Let us look at the basic property data structure:
ID |
NextProp |
PrevProp |
Payload |
01 |
02 |
-1 |
name | String | “Jon” |
Table 12.4 – Property data structure
Similar to node and relationship stores, ID is the index in the property store file (propertystore.db). It contains the locations to store next and previous properties. The Payload section contains the ID for the property name, the ID for the type, and the actual value. This can store all the basic property types except lists, which are stored in a property arrays store (propertystore.db.arrays).
One exception is that, if the string value does not fit into the data structure here, it will be stored in the string’s property store (propertystore.db.strings). In that case, the last value we stored in the payload points to the ID of the data structure stored in the strings property store or arrays property store. Even the strings property store and arrays property store use fixed-size data structures. If a value does not fit in one data structure, it can be split into a chain of data structures to store all the values. So, when you are storing a large string or reading it, the cost of it can be higher.
Note
As you can see from the previous paragraph, storing and retrieving large strings or arrays can be costly, as we may need to traverse multiple chain blocks to retrieve data. Therefore, we need to be careful about how and what we are storing and how we are retrieving the data.
You can read about data structure sizes here: https://neo4j.com/developer/kb/understanding-data-on-disk/.
Next, we will take a look at how Neo4j uses memory to execute queries.
Neo4j splits the memory into three different segments, given as follows:
When a query is executed, a transaction state is initiated in heap memory, and it goes to the page cache to retrieve the data. The query execution always goes to the page cache to get the nodes, relationships, and properties. If a node, relationship, or property is not found in the page cache, it causes a page fault that will load the required entity from the corresponding store. Since the page cache loads not just one data structure but a single page block around 4 KB, it may be reading more data than required.
You can read more about these aspects at https://www.graphable.ai/blog/neo4j-performance/ and https://maxdemarzi.com/2012/08/13/neo4j-internals/.
Now that we have taken a brief look at how Neo4j stores data, let’s look at some query patterns and review what a good query and a bad query are.
Let’s review a few query patterns and see how a database will try to execute them to understand how to write optimal queries:
MATCH (Mango {color: 'Yellow'}) return count(*)
We forgot to add :, even if Mango is a valid label. When we look at the query, we read it as return the count of yellow mangoes. Say we have mangoes, oranges, and apples in our cart. Since a database does not know what a mango is, it is going through all the fruits and checking for a color property, which should be Yellow, and returns the count of that. This is because when the database sees the query, it does not see the name Mango as a label.
Tip
Look for browser warnings before running the queries. Browsers are very good at highlighting these kinds of errors.
MATCH (p:Patient)--(e:Encounter) RETURN p,e
There are two issues here. First, no direction is provided for the relationship. Second, there is no relationship type provided. Let’s see whether we can explain, in layman’s terms, the work the database needs to do here. Let’s say there is a junction with four incoming roads and three outgoing roads. There is a post office one block away. If someone asks us to find the post office one block away, since we don’t know where it is, we have to traverse all the roads by one block to find the post office. That’s what the database is doing here. It is traversing one hop in all directions and giving you data. Let’s see how this changes if a direction was provided. If we are looking for the post office one block away on the outgoing roads, we would be looking at only three roads instead of seven roads. This is the same with the database. Let’s see how this changes if there was more information provided. Out of the three roads, let’s say two are going east and one is going west. If we are asked to find a post office one block away on eastward outgoing roads, then we will be looking at only two roads instead of three. Similarly, if we provide the relationship type, then the amount of work the database does to retrieve data is also less.
Tip
Look at the query profile to understand the amount of work the database is doing. We should always be using the direction and relationship type for better performant queries.
MATCH path=(:Patient)-[*]-(:Drug) RETURN path
Again, if we use the layman example from before, this is like trying to find a post office from a given point. This means we have to traverse in all directions and find all the paths that connect this point and a post office. Say you are at a point in a city and traversing all the roads in all directions to find all the paths; you can imagine how much time and effort it would take. Almost the same thing is happening here with the database. These kind of queries should be avoided at all costs. This query should have used relationship types, direction, and limiting the length to be able to respond in time.
Tip
You can avoid having out-of-memory exceptions thus causing outage by using the dbms.memory.transaction.global_max_size configuration. This configuration makes sure that all queries will not use more memory than this value. This should be a bit smaller than the maximum heap size configured. Also, you can use dbms.memory.transaction.max_size to make sure a given query does not use more than this memory. If it tries to use more memory, it would be terminated.
Cypher queries can be termed as anchor and traverse. Anchor means we are finding a node from which we start our traversal of a graph. This is the strongest feature of Neo4j.
In general, the following rules help to write better queries.
Next, we will take a look at troubleshooting a few common issues.
When you are troubleshooting, logs are your friends. You must take a look at query log and debug log files to identify any issues. Please note that query logs are not available in Community Edition of Neo4j. If you are using Neo4j Desktop to create a database and test it, then you are using a single-user enterprise license, so you will have access to query logs. We will take a look at what information we have to troubleshoot issues and how we can fix them here:
Let’s take a look at few of the common issues faced during query execution and how to address them:
The best way to resolve this issue is to change the query to avoid any EAGER steps. Please read this article about it: https://medium.com/neo4j/cypher-sleuthing-the-eager-operator-84a64d91a452. Another option is to use client drivers and avoid using LOAD CSV in production environments. This should be your approach for stable applications.
MERGE (p:Person {id:1})-[:LIVES_AT]->(:Address {id:"A1"})
Here, even if the Person and Address nodes exist, if there is not a relationship between them, then MERGE will try to create the whole path, which means creating the Person and Address nodes again.
MERGE (p:Person {id:1})
MERGE (a:Address {id:"A1"})
MEGE (p)-[:LIVES_AT]->(a)
This query is doing MERGE on the nodes first and separates the relationship creation using another MERGE statement. Since they are separated, this query will not attempt to create duplicate nodes.
Note
This option is only available in Enterprise mode. Also, from 5.0 onward, the slotted runtime is the default option in Community Edition, and interpreted runtime is not available. You can read more about these options at https://neo4j.com/docs/cypher-manual/4.4/query-tuning/how-do-i-profile-a-query/.
Next, let’s take a look what’s new in Cypher 5.0.
In version 5.0, multiple changes have been made to the Cypher language. You can read about all these changes at https://neo4j.com/docs/cypher-manual/current/deprecations-additions-removals-compatibility/. We will take a look few of the changes that impact the Cypher queries.
The first important change to note is index creation. In version 5.0, the indexes are separated to represent the different types so that the indexes can be more performant. The indexing types that are available in version 5.0 are listed here:
You can read more about the new index types at https://neo4j.com/docs/cypher-manual/current/indexes-for-search-performance/.
Another change that could be important for developers applies to label filtering and relation type filtering, along with the WHERE clause. It is possible to use logical predicates for node label and relationship types.
Here’s a sample query that demonstrates this aspect:
MATCH (n: A&(B|C)&!D|E) RETURN n
You can see in this query that we want all the nodes with E or A labels, either B or C, and not D.
It is also possible to use the same syntax for relationships.
Here’s a sample query that demonstrates this aspect:
MATCH p=()-[: A&(B|C)&!D|E]->() RETURN p
The logic here is also very similar to how node label filtering works.
Another change that is interesting is that we can use the WHERE clause in line with node labels and relationship types.
Here’s an example of inline usage with nodes:
MATCH (a:Person WHERE a.name = 'Rob') -[:KNOWS]->(b:Person WHERE b.age > 25) RETURN b.name
We can see the WHERE clause is used in line with nodes.
The same syntax also applies to relationships:
MATCH (a:Person {name: 'Tom'}) RETURN [(a) -[r:KNOWS WHERE r.since < 2020]->(b:Person) | r.since] AS years
We can see that we are able to use the WHERE clause in line with the relationship that is inside a list comprehension. This new syntax can help us build complex queries with ease.
You can read more about this syntax and examples at https://neo4j.com/docs/cypher-manual/current/clauses/where/.
One last thing that may impact how users can add configuration to the neo4j.conf file is the introduction of a new configuration parameter called server.config.strict_validation.enabled. This is, by default, set to true. What this configuration does is not start the database instance if there are unknown configuration namespaces that are not part of the core database configuration, such as apoc, or a configuration is repeated multiple times, and then the database will fail to start. This is more of a security feature. So, when you want to add a new configuration, such as apoc.import.file.enabled=true, it would cause a problem.
There are two options available when you run into this kind of scenario:
server.config.strict_validation.enabled=false
A sample apoc.conf configuration would look like this:
apoc.import.file.enabled=true apoc.import.file.use_neo4j_config=false
Please note that in the new configuration file, you need to add the apoc.import.file.use_neo4j_config=false config for the apoc plugin to use this new configuration file. If not, it looks for the configuration in the neo4j.conf file.
Now, let’s summarize everything we have learned.
In this chapter, we took a deeper look at Neo4j internals to understand how a database works to execute queries. We also reviewed a few query patterns and saw the right and wrong ways to build queries and looked at troubleshooting common issues.
Cypher is an easy language to learn compared to SQL. However, it takes a bit of an effort to get the most out of it. One thing to remember is that Neo4j is a schemaless storage. This gives us great flexibility when it comes to data modeling. If your application use case starts changing, the current data model becomes too limiting, and your queries get slower, there is no need to create a completely new model. You can start adapting the existing model by adding new model concepts, thus keeping the same graph for the old and new functionality. Once you are satisfied with new model changes, it is possible to remove the remnants of the old model that are not required. Combining this kind of model flexibility with the simplicity and power of Cypher makes it easy to build effective and complex applications.
Furthermore, openCypher, which is the open source version of the Cypher language, is being adapted by other graph databases such as Amazon Neptune. So, by learning Cypher, your knowledge is not just limited to Neo4j but can also be leveraged to work with other databases. However, there might be subtle nuances that you need to be aware of to get the most out of different types of graph databases.
In short, to become an effective Cypher query developer, understand the domain and take a look at the graph database capabilities and graph modeling. Also, become familiar with capabilities such as EXPLAIN and PROFILE along with logs that are available to be able to identify and fix any issues.