Download any sample big data files or extract logs from systems using Flume and so on. For the purpose of the book we will be downloading the dataset from the following URL:
http://www.seanlahman.com/?s=lahman591-csv.zip
Extract the ZIP file.
Upload the data file to HDFS by following these steps:
/usr/maria_dev
and click on the Upload button.batting.csv
file. create table intermediate_batting (col_value STRING);
intermediate_batting
table under default databases.batting.csv
data file into the intermediate_batting
table:Load data inpath '/user/maria_dev/Batting.csv' overwrite into table intermediate_batting;
batting
using the following command:create table batting (player_id STRING, year INT, runs INT);
Extract data from the intermediate_batting
table to the batting
table using the following commands:
insert overwrite table batting SELECT regexp_extract(col_value, '^(?:([^,]*),?){1}', 1) player_id, regexp_extract(col_value, '^(?:([^,]*),?){2}', 1) year, regexp_extract(col_value, '^(?:([^,]*),?){9}', 1) run from intermediate_batting;
Now, that we have the table in Hadoop we can start creating a MicroStrategy report based on this as:
This gives you two data access options, as follows:
Select Connect Live and create a dashboard based on the data imported.
With MicroStrategy 10, users have the ability to prepare data. In the previous section, when we were creating a dashboard using data from Hadoop, we were presented with the step of data preparation, or data wrangling, which allows business users to explore the data to improve its quality before it is imported to MicroStrategy. Example of data preparation include:
The following screenshot presents data wrangling:
So, even if the user is exporting data from any source, they can still prepare it without ETL and data modeling.
So, let's say we have data loaded from a source to store coordinates in one column, but we want to have two separate columns to store this data. We can do it using data wrangling.
The following screenshot shows data loaded from source:
Use the data wrangle functionality to prepare data for reporting:
Output columns will be displayed as follows: