Appendix A. Big Data Sets

If you really want to check out the real power of Big Data solutions based on the Hadoop Distributed File System (HDFS), you will have to choose the right set of data. If you analyze files of merely a few KBs on this platform, it will take much more time than the conventional database systems. As data keeps growing in GBs and TBs and there are enough nodes in the cluster, you will start seeing the real benefit of HDFS-based solutions.

Data preparation is an important step in a Big Data solution where you have to harmonize various data sources by integrating them seamlessly, using appropriate ETL methodology to ensure that this integrated data can be easily analyzed by a Big Data solution. If you are well aware of the data, you can identify the patterns easily by discovering the data.

Now, the challenge would be to get the Big Data sample from a public domain without any copyright issues. If you have your own large dataset, you are a lucky person. If you don't have such data, no need to curse your luck; there are many such gigantic datasets available in the public sphere with a variety of data, such as that found in social media, science and research, government, the private sector, and so on. Although it's easy to find such sites hosting free data from Google or Quora, for quick reference, this book will share a few links for sites hosting these public datasets. Please do not forget to read the usage terms carefully before using each source just to avoid any infringements.

Freebase

Freebase is a collection of datasets collected from CrowdSource. At the time of writing this book, the size of the data dump in Freebase has reached 88 GB. Freebase is a part of Google.

Visit the Freebase website at https://developers.google.com/freebase/data. For the terms and conditions of the services, visit https://developers.google.com/freebase/terms.

Freebase uses a Turtle data format from Resource Description Framework (RDF), a Semantic Web metadata standard.

For more information on RDF, visit http://www.rdfabout.com/intro/.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset