Migration techniques for SQL users

Neo4j has been around for just a few years, while most organizations that are trying to move into business intelligence and analytics have their decades of data piled up in SQL databases. So, rather than moving or transforming the data, it is better to fit in a graph database along with the existing one in order to make the process a less disruptive one.

Handling dual data stores

The two databases in our system are arranged in a way that the already-in-place MySQL database will continue being the primary mode of storage. Neo4j acts as the secondary storage and works on small related subsets of data. This would be done in two specific ways:

  • Online mode for transactions that are critical for essential business decisions that are needed in real time and are operated on current sets of data
  • The system can also operate in batch mode in which we collect the required data and process it when feasible

This will require us to first tune our Neo4j system to get updated with the historical data already present in the primary database system and then adapt the system to sync between the two data stores.

We will try to avoid data export between the two data stores and design the system without making assumptions about the underlying data model.

Analyzing the model

Let's use a simple entity set of people buying cars sold by dealers to illustrate the process. You can fit in the process to your existing data setup. We outline the features of the objects in our domains as follows:

Analyzing the model

The SQL schema of an existing database

We can represent the corresponding model in Neo4j as a directed acyclic graph. The corresponding Neo4j acyclic graph handles the persistence with the database. Mutating Cypher is used to transform the data into a graph that contains nodes and relationships using the API of Neo4j in certain cases so that complex objects can be handled. Each entity relates back to the underlying database with the help of an ID that acts as the system's primary key and indexing operations are performed on this key. The corresponding graph model is as follows:

Analyzing the model

The corresponding representation in a graph database

When the graph modeling is complete, our application becomes independent of our primary data store.

Initial import

We now have to initiate the import of our data and store it after transformation into the form of our graph objects. We use SQL queries in order to obtain the required data by reading from the database, or requesting an existing API or previously exported set of data.

//Update a person node or create one. If created, its id is indexed.

SELECT name, id from person where 
Person person=new Person(res.getString("name"), res.getInt("id"));
person.save();
//Update a car or create one. If created, its id is indexed.

SELECT name, id from car where 
Car car=new Car(res.getString("name"),res.getInt("id"));
car.save();

//Update a dealer or create one. An "affiliation" node is created if not already existing. A relationship is created to the affiliation from the dealer. The ids of both dealer and affiliation are indexed.

SELECT name, id, affiliation from dealers where 
Dealer dealer=new Dealer(res.getString("name"), res.getInt("id"));
dealer.setAffiliation(res.getString("affiliation"));
dealer.save();
//A relationship is created to the car from the person and we set the date of buying as a property of that relationship

SELECT person.id, car.id, buying_date from Purchase_Table
Person person=repository.getById(res.getInt("customer.id"));
person.buyCar(res.getInt("id"),res.getDate("buying_date");

Note that the preceding queries are abstract and will not run standalone. They illustrate the method of integrating with a relational database. You can change them according to your relational database schema.

Keeping data in sync

Having successfully imported data between the systems, we are now required to keep them in sync. For this purpose, we can schedule a cron job that will perform the previous operations in a periodic manner. You can also define an event-based trigger that will report on updates, such as cars being bought or new customers joining, in the primary system and incorporate them in the Neo4j application.

This can be implemented with simple concepts, such as message queues, where you can define the type of message required to be used by our secondary database system. Regardless of the content of the message, our system should be able to read and parse the message and use it for our business application logic.

The result

There is a loose coupling between the applications and we have used an efficient parsing approach to adapt the data between multiple formats. Although this technique works well for most situations, the import process might require a slightly longer time initially due to the transactional nature, but the initial import is not a process that occurs frequently. The sync based on events is a better approach in terms of performance.

You need an in-depth understanding of the data pattern in your application so that you can decide which technique is suitable. For single-time migrations of large datasets, there are several available tools such as the batch importer (https://github.com/jexp/batch-import) or the REST batch API on a database server that runs Neo4j.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset