Getting ready

The power of nested structures goes far beyond traditional use cases, though. It has been very difficult to represent hierarchical data in highly normalized databases. Data needs to be joined across tables as needed. This does provide us with flexibility. Let's understand it with the example we covered in the previous recipe. In the Yelp dataset, a user reviews a business, which is represented by yelp_academic_dataset_review.json. In reality, a user reviews multiple businesses and a business is reviewed by multiple users. One would argue that it represents standard NxN relationships between entities, so what's the big deal here? The challenges come in how distributed systems operate. To make a join happen, the data needs to be shuffled over the network, which is very costly.

A high degree of normalization definitely saves us some disk space and minimizes redundancy, but in the big data world, both are not real issues. The real issue is latency. The question that arises is: how should we represent nested data? Should users be nested inside businesses or businesses under users? There is no perfect answer here as it depends on what your query is. If you would like to query both ways, you'll need two nested structures. Once the structure is created, we can retain default parallelism at the node level, and no shuffle is required.

One question that arises is: how are nested structures different than denormalized data structures? The difference is that the amount of data stored is much less, which leads to efficiencies.

Table of Contents for Getting ready

Create new playlist

Sign In

Sign Up

Table of Contents for
Getting ready