Chapter 9
In This Chapter
Extracting data from files
Getting data from external databases
Importing from other nonstandard data systems
Changing data source settings
Microsoft has invested a great deal of time and resources in ensuring that Power Query has the ability to connect to a wide array of data sources. Whether you need to pull data from an external website, a text file, a database system, Facebook, or a web service, Power Query can accommodate most, if not all, of your source data needs.
You can see all available connection types by clicking on the New Query drop-down arrow on the Data tab. As Figure 9-1 illustrates, Power Query offers the ability to pull from a wide array of data sources, as described in this list:
In this chapter, I help you explore the various connection types that can be leveraged to import external data.
Organizational data is often stored in files such as text files, CSV files, and even other Excel workbooks. It’s not uncommon to use these kinds of files as data sources for data analysis. Power Query offers several connection types that enable the importing of data from external files.
You can import data from other Excel workbooks by selecting Data ⇒ New Query ⇒ From File ⇒ From Workbook from the Excel Ribbon.
Excel opens the Import Data dialog box, shown in Figure 9-2. Use this dialog box to browse for the Excel file you want to work with. Note that you can import any kind of Excel file, including macro-enabled workbooks and template workbooks.
After you’ve selected a file, the Navigator pane activates (see Figure 9-3), showing you all the data sources available in the workbook.
The idea here is to select the data source you want and then either load or edit the data using the buttons at the bottom of the Navigator pane. Click the Load button to skip any editing and import your targeted data as is. Click the Edit button if you want to transform or shape before completing the import.
In terms of Excel workbooks, a data source is either a worksheet or a defined named range. The icons next to each data source let you distinguish which sources are worksheets and which are named ranges. In Figure 9-3, the source named MyNamedRange is a defined named range, and the source named National Parks is a worksheet.
You can import multiple sources at a time by selecting the Select Multiple Items check box and then placing a check mark next to each worksheet and named range that you want imported.
Text files are commonly used to store and distribute data because of their inherent ability to hold many thousands of bytes of data without having an inflated file size. Text files can do this by foregoing all the fancy formatting, leaving only the text.
A comma-separated value (CSV) file is a kind of text file that contains commas to delimit (separate) values into columns of data.
To import a text file, select Data ⇒ New Query ⇒ From File ⇒ From Text on the Excel Ribbon. Excel opens the Import Data dialog box, where you can browse for, and select, a text file.
Power Query opens the Query Editor to show you the contents of the text file you just imported. As you can see in Figure 9-4, text files are imported as a table with one column containing a row for each line in the text file.
The idea here is to apply any changes you want to make to the data and then click the Close & Load command on the Home tab to complete the import.
To import a CSV file, go to the Excel Ribbon and select Data ⇒ New Query ⇒ From File ⇒ From CSV. Excel opens the Import Data dialog box, where you can browse for and select your target CSV file.
Power Query opens the Query Editor to show you the contents of the CSV file you just imported. Power Query excels at recognizing the correct delimiters in CSV files and typically does a good job of importing the data correctly.
For example, Row 5 in the sample CSV file illustrated in Figure 9-5 contains the value Johnson, Kimberly. Power Query contains the intelligence to know that the comma in that value is not an actual delimiter. So all the columns are separated correctly.
You can use the Query Editor to apply any edits you may need and then click the Close & Load command on the Home tab to import the CSV data.
XML files are a family of text files that contain data wrapped in markup (tags that denote structure and meaning). These tags essentially make XML files machine-readable, which essentially means that any application or web-based solution designed to read XML files can discern the structure and content of the data within.
The markup tags found in XML files are quite robust and make importing XML files a bit tricky. Without getting too geeky, some XML files are attribute-based: They contain simple markup defining a table of columns and rows. Other XML files are element-based: They contain a wide array of complex markup defining intricate hierarchical tables with many levels of data.
Power Query has the built-in intelligence to handle any kinds of XML file just fine, but you often need to dig for the data you want imported.
Depending on the markup within, Power Query starts you off with either the Navigator pane or the Query Editor. Although this can be a bit jarring at first, you can quickly get the gist of drilling into the data you need.
You can start importing an XML file by going to the Excel Ribbon and selecting Data ⇒ New Query ⇒ From File ⇒ From XML. Excel opens the Import Data dialog box, where you can browse for, and select, the target XML file.
www.datapigtechnologies.com/FinalXMLOutput.xml
If the XML file you’ve pointed to is attribute-based, Power Query opens the Query Editor and shows the highest-level content it finds. You often see a single row of high-level attributes, similar to Figure 9-6.
You can get to the data within the XML file in one of two ways:
After you drill into the data you need, you can click the Close & Load command on the Home tab to complete the import.
If the XML file you’ve pointed to is element-based, Power Query opens the Navigator pane. As you can see in Figure 9-8, you need to drill into a varying number of elements to get to the element that contains the data you’re looking for.
After you find the correct data layer, you can select the data source you want and then use the buttons at the bottom of the Navigator pane to either load or edit the data.
Power Query has the ability to use the Windows file system as a data source, enabling you to import a list of folder contents for a specified directory. This comes in handy when you need to create a list of all the files in a particular folder.
From the Excel Ribbon, select Data ⇒ New Query ⇒ From File ⇒ From Folder. The dialog box shown in Figure 9-9 opens, asking you to enter or browse for the folder (directory) you want to use.
Power Query opens the Query Editor with a new table containing the contents of the specified folder. As Figure 9-10 illustrates, this new table details the key attributes for each file, such as filename, file extension, date created, and date modified. You can even click the Expand icon in the Attributes field and choose to display some of the more advanced attributes for each file.
After you have all the attributes you need, you can click the Close & Load command on the Home tab to complete the import.
In smart organizations, the task of data management is not performed by Excel; rather, it’s performed primarily by database systems such as Microsoft Access and SQL Server. Databases like these not only store millions of rows of data, but also ensure data integrity and allow for the rapid search and retrieval of data by way of queries and views.
Power Query offers options to connect to a wide array of database types. Microsoft has been keen to add connection types for as many commonly used databases as it can.
Choose Data ⇒ New Query ⇒ From Database and you see the list of databases shown in Figure 9-11. Power Query has the ability to connect to virtually any database commonly used today: SQL Server, Microsoft Access, Oracle, MySQL, etc.
If your organization has a Microsoft Azure cloud database or a subscription to Microsoft Azure Marketplace, an entire set of connection types is designed to import data from Azure databases (see Figure 9-12). You can get to these connection types by choosing Data ⇒ New Query ⇒ From Azure.
If you’re using a unique, nonstandard database system that isn’t listed under From Database (refer to Figure 9-11) or From Azure (refer to Figure 9-12), not to worry: As long as your database system can be connected to via an ODBC connection string, Power Query can connect to it.
Choose Data ⇒ New Query ⇒ From Other Data Sources to see a list of other connection types. Click the From ODBC option shown in Figure 9-13 to start a connection to your unique database via an ODBC connection string.
In addition to ODBC, Figure 9-13 illustrates other kinds of data systems that can be leveraged by Power Query.
Some of these data systems (SharePoint, Dynamics CRM, Salesforce, and Microsoft Exchange) are popular systems that are used in many organizations to store data, track sales opportunities, and manage emails. Other systems, such as OData Feeds and Hadoop, are less-common services used to work with very large volumes of data. These are often mentioned in conversations about big data. And of course, the From Web option (demonstrated in Chapter 8) is an integral connection type for any analyst who leverages data from the internet.
Clicking any of these connections opens a set of dialog boxes customized for the selected connection. These dialog boxes ask for the basic parameters that Power Query needs in order to connect to the specified data source; parameters such as file path, URL, server name, and credentials.
Each connection type requires its own, unique set of parameters, so each of their dialog boxes is different. Luckily, Power Query rarely needs more than a handful of parameters to connect to any single data source, so the dialog boxes are relatively intuitive and hassle-free.
It would be redundant to walk through the process of connection to every type of database available. However, it would be useful to walk through the basic steps of connecting a database.
Here are the steps for connecting to one of the more ubiquitous database systems — Microsoft Access:
Browse for your target database. You can use the Facility Services.accdb database, found in the sample files for this book.
After Power Query connects to the database, the Navigator pane, shown in Figure 9-14, activates. There, you see all database objects available to you, including tables and views (or queries, in Access lingo).
Click the Sales_By_Employee view.
The Navigator pane displays a preview of the Sales_By_Employee data. If you want to transform or shape this data, click the Edit button. In this case, the data looks fine as is.
Click the Load button to complete the import.
After a bit of processing, Power Query loads the data to a new Excel worksheet and adds the new query to the Workbook Queries pane, as shown in Figure 9-15.
It’s a best practice to use views whenever possible. Views are often cleaner data sets because they’re already optimized to include only the columns and data that are necessary. (This improves query performance and helps minimize the workbook’s file size.) In addition, you don’t need to have an intimate knowledge of the database architecture. Someone with that knowledge has already done the work for you — joined the correct tables, applied the appropriate business rules, and optimized output, for example.
Every time you connect to any web-based data source or data source that requires some level of credentials, Power Query caches (stores) the settings for that data source.
Suppose that you connect to a SQL Server database, enter all your credentials, and import the data you need. At the moment of successful connection, Power Query caches information about that connection in a file located on your local PC. It includes the connection string, username, password, and privacy settings, for example.
The purpose of all this caching is so that you don’t have to reenter credentials every time you need to refresh your queries. That’s nifty, but what happens when your credentials are changed? Well, the short answer is those queries will fail until the data source settings are updated.
You can edit data source settings by activating the Data Source Settings dialog box. To do so, choose Data ⇒ New Query ⇒ Data Source Settings, as demonstrated in Figure 9-16.
The Data Source Settings dialog box, shown in Figure 9-17, contains a list of all credentials-based data sources previously used in queries. Select the data source you need to change, and then click the Edit button.
Another dialog box opens — this time, specific to the data source you selected (see Figure 9-18). This dialog box enables you to edit credentials as well as other data privacy settings.
Click the Edit button to make changes to the credentials for the data source. The credentials editing screen will differ based on the data source you’re working with, but again, the input dialog boxes are relatively intuitive and easy to update.
Power Query caches data source settings in a file located on your local PC. Even though you may have deleted a particular query, the data source setting is retained for possible future use. This can lead to a cluttered list of old and current data sources. You can clean out old items by selecting the data source in the Data Source Settings dialog box and clicking the Delete button.