Hunk can be used not only for doing analytics on data stored in Hadoop. We will discover other options using special integration applications. These come from the https://splunkbase.splunk.com/ portal, which has hundreds of published applications. This chapter is devoted to integration schemes between the popular NoSQL document-oriented Mongo and Hunk stores.
Mongo is a popular NoSQL solution. There are many pros and cons for using Mongo. It's a great choice when you want to get simple and rather fast persistent key-value storage with a nice JavaScript interface for querying stored data. We recommend you start with Mongo if you don't really need a strict SQL schema and your data volumes are estimated in terabytes. Mongo is amazingly simple compared to the whole Hadoop ecosystem; probably it's the right option to start exploring the denormalized NoSQL world.
Mongo is already installed and ready to use. Mongo installation is not described. We use Mongo version 3.0.
You will install the special Hunk app that integrates Mongo and Hunk.
Visit https://splunkbase.splunk.com/app/1810/#/documentation and download the app. You should use the VM browser to download it:
A Mongo provider is created and used to access Mongo data. Go to the Virtual Indexes tab and see the created local-mongodb
provider:
You can change the property named vix.mongodb.host if you want to connect to some other Mongo instance.
Now it's time to create virtual indexes based on Mongo collections. There is a bug. So you have to:
There is a sample of data collected by the recommendation engine backend. When the user clicks on the recommendation, the event is recorded and sent to Mongo. That data is used to self-tune the recommendation engine later. Data is stored into daily collections. Mongo allows us to create daily collections easily and helps to partition the data. The best approach is to think about data partitioning in advance.
Let's explore data schemas. The schema describes the document stored in MongoDB and extracted to Hadoop:
{ [-] _timestamp: 1423080002 block_id: 4 cross_id: 896bba91c21c620b0902fbec05b3246bce21859c idvisitor: 783852c991fbefb8 is_napoleon: 2 original_id: null rec: 2291655 service: 2 shop_id: 173 target_site_id: 0 time: 1423080002 type: 1 }
We are interested in these fields:
timestamp
: When the recommendation click event happenedshop_id
: Where the recommendation has been displayedservice
: Which service provided the recommendationLets see how data in being generated. There is a user that comes to e-commerce site. His browser gets cookie named cross_id
. Cookie gives us a chance to track user interaction with site: what pages he visits, what items he clicks. There is a service that shows recommendations to user based on user site activity. Each recommendation has unique ID rec_id
. Service stores list of recommendations that were displayed to user. Service captures click
event, when user clicked promoted item from recommendations set. We know exactly which recommendations (with unique key named rec_id
) did user see and click.