Chapter 8
In This Chapter
Installing and activating Power Query
Spelling out the Power Query basics
Understanding Query steps
Managing existing queries
Overviewing query actions
In information management, the term ETL (Extract, Transform, Load) refers to the three separate functions typically required to integrate disparate data sources: extract, transform, and load. The extraction function refers to the reading of data from a specified source and extracting a desired subset of data. The transformation function refers to the cleaning, shaping, and aggregating of data to convert it to the desired structure. The loading function refers to the actual importing or writing of the resulting data to a target location.
Excel analysts have been manually performing ETL processes for years — although they rarely call it ETL. Every day, millions of Excel users manually pull data from a source location, manipulate that data, and integrate it into their reporting. This process requires lots of manual effort.
Power Query enhances the ETL experience by offering an intuitive mechanism to extract data from a wide variety of sources, perform complex transformations on that data, and then load the data into a workbook or the Internal Data Model.
In this chapter, you explore the basics of the Power Query Add-in. You also get a glimpse of how it can help you save time and automate the steps needed to ensure that clean data is imported into your reporting models.
In Excel 2016, Power Query isn’t an add-in — it’s a native feature of Excel, just like charts and pivot tables are native features. If you’re working with Excel 2016, you don’t have to install any additional components. You’ll find Power Query in Excel 2016 hidden on the Data tab, in the Get & Transform group (see Figure 8-1).
If you’re working with Excel 2010 or Excel 2013, you need to explicitly download and install the Power Query add-in. As of this writing, the Power Query add-in is available to you only if you have one of these editions of Office or Excel:
If you have any of these editions, you can install and activate the Power Query add-in. Simply enter the search term Excel Power Query add-in into your favorite search engine to find the free installation package.
Note that Microsoft offers Power Query for both Excel 2010 and Excel 2013 in both 32- and 64-bit platforms. Be sure to download the version that matches your version of Excel as well as the platform on which your PC is running.
After the add-in is installed, activate it by following these steps:
Open Excel and look for the Power Query command on the Insert tab (see Figure 8-2).
If you see it, the Power Query add-in is already activated. You can skip the remaining steps.
A successful install results in a new Power Query tab on the Excel Ribbon.
In this section, I walk you through a simple example of using Power Query. Imagine that you need to import Microsoft Corporation stock prices from the past 30 days by using Yahoo! Finance. For this scenario, you need to perform a web query to pull the data you need from Yahoo! Finance.
To start the query, follow these steps:
If you have Excel 2016, select the New Query command on the Data tab, and then select From Other Sources ⇒ From Web (see Figure 8-3).
If you’re working with Excel 2010 or Excel 2013, click the Power Query tab and select the From Web command.
In the dialog box that appears, enter the URL for the data you need, as shown in Figure 8-4.
In this example, you type http://finance.yahoo.com/q/hp?s=MSFT
.
After a bit of gyrating, the Navigator pane shown in Figure 8-5 appears. You can select the data source that you want to extract. Click on each table to see a preview of the data.
In this case, Table 4 holds the historical stock data you need, so click Table 4 in the list box on the left and then click the Edit button.
You may have noticed that the Navigator pane, shown in Figure 8-5, offers a Load button (next to the Edit button). You can use this button to skip any editing and import your targeted data as is. If you’re sure that you won’t need to transform or shape your data in any way, click the Load button to import the data directly into the data model or a spreadsheet in your workbook.
Excel has another From Web command button, on the Data tab in the Get External Data group. This unfortunate duplicate command is the legacy web-scraping capability found in all Excel versions since Excel 2000.
The Power Query version of the From Web command (choose New Query ⇒ From Other Sources ⇒ From Web) goes beyond simple web scraping. Power Query can pull data from advanced web pages and then manipulate it. Make sure you’re using the correct feature when pulling data from the web.
When you click the Edit button, Power Query activates a new Query Editor window, which contains its own Ribbon and a preview pane that shows a preview of the data (see Figure 8-6). You can apply certain actions to shape, clean, and transform the data before importing.
The idea is to work with each column shown in the Query Editor, applying the necessary actions that will give you the data and structure you need. You can dive deeper into column actions later in this chapter. For now, continue toward the goal of getting the last 30 days of stock prices for Microsoft Corporation.
Remove all unnecessary columns by right-clicking each one and selecting Remove.
(Besides the Date field, the only other columns you need are the High, Low, and Close fields.)
Alternatively, you can hold down the Ctrl key on the keyboard, select the columns you want to keep, right-click any selected column, and then choose Remove Other Columns (see Figure 8-8).
To ensure that the High, Low, and Close fields are formatted as proper numbers, hold down the Ctrl key on the keyboard, select the three columns, and right-click and choose Change Type ⇒ Decimal Number.
After you do this, you may notice that some of the rows show the word Error. These are rows that contain text values that could not be converted.
After all errors are removed, add a Week Of field that displays which week each date in the table belongs to.
Here’s how to do this:
Right-click the Date field and select the Duplicate Column option.
A new column is added to the preview.
Select the Transform tab on the Power Query Ribbon, look to the right to find the Date & Time Column group, and then choose Date ⇒ Week ⇒ Start of the Week, as shown in Figure 8-10.
Excel transforms the date to display the start of the week for a given date.
When you’ve finished configuring your Power Query feed, save and output the results.
To do this, click the Close & Load drop-down menu on the Home tab of the Power Query Ribbon to reveal the two options shown in Figure 8-11:
At this point, you have a table similar to the one shown in Figure 8-12, which can be used to produce the pivot table you need.
Take a moment to appreciate what Power Query allowed you to do just now. With a few clicks, you searched the Internet, found some base data, shaped the data to keep only the columns you needed, and even manipulated that data to add an extra Week Of dimension to the base data. This is what Power Query is about: enabling you to easily extract, filter, and reshape data without the need for any programmatic coding skills.
Power Query uses its own formula language (known as the “M” language) to codify your queries. As with macro recording, each action you take when working with Power Query results in a line of code being written into a query step. Query steps are embedded M code that allow your actions to be repeated each time you refresh your Power Query data.
You can see the query steps for your queries by activating the Query Settings pane. While in the Query Editor window, you choose View ⇒ Query Settings. You can also place a check mark in the Formula Bar option to enhance your analysis of each step with a formula bar that displays the syntax for the given step.
The Query Settings pane appears to the right of the preview pane, as shown in Figure 8-13. The formula bar is located directly above the preview pane.
Each query step represents an action you took to get to a data table. You can click on any step to see the underlying M code in the Power Query formula bar. For example, clicking the step called Removed Errors reveals the code for that step in the formula bar.
You can right-click on any step to see a menu of options for managing your query steps. Figure 8-14 illustrates the following options:
Power Query data is in no way connected to the source data used to extract it. A Power Query data table is merely a snapshot. In other words, as the source data changes, Power Query doesn’t automatically keep up with the changes; you need to intentionally refresh your query.
If you chose to load your Power Query results to an Excel table in the existing workbook, you can manually refresh by right-clicking on the table and selecting the Refresh option.
If you chose to load your Power Query data to the Internal Data Model, you need to open the Power Pivot window, select your Power Query data, and then click the Refresh command on the Home tab of the Power Query window.
To get a bit more automated with the refreshing of queries, you can configure your data sources to automatically refresh the Power Query data. To do so, follow these steps:
From the Data tab on the Excel Ribbon, select the Connections command.
The Workbook Connections dialog box appears.
Select the Power Query data connection you want to refresh and then click the Properties button.
The Properties dialog box opens.
These refresh options are useful when you want to ensure that your customers are working with the latest data. Of course, setting these options does not preclude the ability the manually refresh the data using the Refresh command on the Home tab.
As you add various queries to a workbook, you need a way to manage them. Excel accommodates this need by offering the Workbook Queries pane, which enables you to edit, duplicate, refresh, and generally manage all existing queries in the workbook. Open the Workbook Queries pane by selecting the Show Queries command on the Data tab of the Excel ribbon.
You need to find the query you want to work with and then right-click it to take any one of the actions described in the following list (see Figure 8-15):
The Workbook Queries pane is especially useful when your workbook contains several queries. Think of it as a kind of table of contents that allows you to easily find and interact with the queries in your workbook.
Right-clicking a column in the Query Editor opens a context menu that shows a full list of the actions you can take. You can also apply certain actions to multiple columns at one time by selecting two or more columns before right-clicking. Figure 8-16 shows the available column-level actions, and Table 8-1 describes their purpose, as well as a few other actions that are available only on the Query Editor Ribbon.
Table 8-1 Column-Level Actions
Action |
Purpose |
Available with Multiple Columns? |
Remove |
Remove the selected column from the Power Query data. |
Yes |
Remove Other Columns |
Remove all non-selected columns from the Power Query data. |
Yes |
Duplicate Column |
Create a duplicate of the selected column as a new column placed on the far right end of the table. The name given to the new column is Copy of X, where X is the name of the original column. |
No |
Remove Duplicates |
Remove all rows from the selected column where the values duplicate earlier values. The row with the first occurrence of a value isn’t removed. |
Yes |
Remove Errors |
Remove rows containing errors in the selected column. |
Yes |
Change Type |
Change the data type of the selected column to any of these types: Binary, Date, Date/Time, Date/Time/Timezone, Duration, Logical, Number, Text, Time, or Using Locale (which localizes data types to the country you specify). |
Yes |
Transform |
Change the way values in the column are rendered. You can choose from the following options: Lowercase, Uppercase, Capitalize Each Word, Trim, Clean, Length, JSON, and XML. If the values in the column are date/time values, the options are Date, Time, Day, Month, Year, or Day of Week. If the values in the column are number values, the options are Round, Absolute Value, Factorial, Base-10 Logarithm, Natural Logarithm, Power, and Square Root. |
Yes |
Replace Values |
Replace one value in the selected column with another specified value. |
Yes |
Replace Errors |
Replace unsightly error values with your own, friendlier text. |
Yes |
Group By |
Aggregate data by row values. For example, you can group by state and either count the number of cities in each state or sum the population of each state. |
Yes |
Fill |
Fill empty cells in the column with the value of the first non-empty cell. You have the option to fill up or fill down. |
Yes |
Unpivot Columns |
Transpose the selected columns from column-oriented to row-oriented or vice versa. |
Yes |
Rename |
Rename the selected column to a name you specify. |
No |
Move |
Move the selected column to a different location in the table. You have these choices for moving the column: Left, Right, To Beginning, and To End. |
Yes |
Drill Down |
Navigate to the contents of the column. This option is used with tables that contain metadata representing embedded information. |
No |
Add as New Query |
Create a new query with the content of the column, by referencing the original query in the new one. The name of the new query is the same as the column header of the selected column. |
No |
Split Column (Ribbon only) |
Split the value of a single column into two or more columns, based on a number of characters or a given delimiter, such as a comma, semicolon, or tab. |
No |
Merge Column (Ribbon only) |
Merge the values of two or more columns into a single column that contains a specified delimiter, such as a comma, semicolon, or tab. |
Yes |
While you’re in the Query Editor, Power Query lets you apply certain actions to an entire data table. You can see the available table-level actions by clicking the Table Actions icon, shown in Figure 8-17.
Table 8-2 lists the table-level actions and describes the primary purpose of each one.
Table 8-2 Table-Level Actions
Action |
Purpose |
Use First Row as Headers |
Replace each table header name with the values in the first row of each column. |
Add Custom Column |
Insert a new column after the last column of the table. The values in the new column are determined by the value or formula you define. |
Add Index Column |
Insert a new column containing a sequential list of numbers starting from 1, 0, or another specified value you define. |
Choose Columns |
Choose the columns you want to keep in the query results. |
Remove Duplicates |
Remove all rows where the values in the selected columns duplicate earlier values. The row with the first occurrence of a value set isn’t removed. |
Keep Top Rows |
Remove all but the top N number of rows. You specify the number threshold. |
Keep Bottom Rows |
Remove all but the bottom N number of rows. You specify the number threshold. |
Keep Range of Rows |
Remove all rows except the ones that fall within a range you specify. |
Remove Top Rows |
Remove the top N rows from the table. |
Remove Bottom Rows |
Remove the bottom N rows from the table. |
Remove Alternate Rows |
Remove alternate rows from the table, starting at the first row to remove and specifying the number of rows to remove and the number of rows to keep. |
Remove Errors |
Remove rows containing errors in the selected columns. |
Merge Queries |
Create a new query that merges the current table with another query in the workbook by matching specified columns. |
Append Queries |
Create a new query that appends the results of another query in the workbook to the current table. |