Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

CHAPTER 18
Multivariate Data Analysis

Photo illustration of a young man working on his computer at his desk. His desk is filled with computers, headphone, laptops, audio system among other items.

This chapter continues our review of methods for data analysis, ways and tools that we can use to get data to give up the insights that we are seeking and that will enable us to make better decisions. The tools we will discuss in this chapter are parsimonious in output but huge in insight. Just a few statistics from each tool give us the direction we need. Used properly, these tools are very efficient and powerful.

Multivariate Analysis Procedures

The term multivariate analysis refers to the simultaneous analysis of multiple measurements on each individual or object being studied.¹ Some experts consider any simultaneous statistical analysis of more than two variables to be multivariate analysis. Multivariate analysis procedures are extensions of the univariate and bivariate statistical procedures discussed in Chapters 16 and 17.

A number of techniques fall under the heading of multivariate analysis procedures. In this chapter, we will consider five of these techniques:

Multiple regression analysis
Multiple discriminant analysis
Cluster analysis
Factor analysis
Conjoint analysis

Practicing Marketing Research

Data Scientist: The Sexiest Job of the Twenty-First Century²

Statisticians have been in high demand for many years due to the rapid increase in corporate databases throughout the world. However, the rise of social media and mobile applications has created a flood of information content that overwhelms traditional data storage and analyses. In May 2013, independent research organization SINTEF stated that “a full 90 percent of all the data in the world has been generated over the last two years. The Internet companies are awash with data that can be grouped and utilized.”³ Companies such as Yahoo, Google, Facebook, and eBay have had to invent new ways of capturing and processing the massive amount of information they receive on a daily basis. Much of this information lacks the form and structure of traditional databases, requiring new types of analysts and new methods of analysis. These analysts are referred to as data scientists because they work with a variety of different data elements in an exploratory fashion, much like an engineer or scientist. What once was megabytes and gigabytes of facts and figures is now terabytes, petabytes, and exabytes of data that include images, text, video, tweets, blogs, online and offline behaviors, global positioning coordinates, and so on.

One of the key tasks of a data scientist is to make sense of these disparate sources that cannot easily be linked together. All the raw data in the world, in and of itself, is not very useful until patterns and relationships can be identified. Data scientists must use cutting-edge tools, open-source programs, and custom programming to manipulate available information. Beyond these technical skills, they also must understand their company’s underlying business model in order to separate meaningful relationships from random patterns and coincidental correlations. Furthermore, data scientists need to be able to prepare clear and concise visuals that connect all these pieces into a coherent story that senior management can comprehend and act upon. Universities have not yet caught up to the requirements of this new role, so corporations are often training employees from within and drawing from other disciplines such as computer science, math, economics, biology, and other sciences. Introverts with limited social skills may be excellent database administrators or statisticians, but they may lack the communication skills required to interact with customers and various departments throughout an organization on their way to uncovering meaningful opportunities and solutions.

Google, for example, employs data scientists to continuously find new ways to deliver the right ads to the most receptive people to maximize advertising effectiveness and ad revenue. LinkedIn studies existing relationships and connections to recommend the best groups for members to join and to identify individuals they should add to their networks. Amazon is constantly revising its algorithms for product recommendations, ad placement and special offers to provide maximum value to its customers. All these companies and thousands others seek data scientists to address these questions and to find opportunities to address needs that are not yet served and often unrecognized at the time of their discovery.

As many data scientists point out, however, the use of statistics is not without its challenges. If not carefully analyzed, the massive amount of data available can overwhelm statistical models, and even strong statistical correlations in the data do not necessarily indicate a causal relationship. Still, as the explosion of available data continues, the ability to identify mathematically abnormal relationships in the data creates a wealth of opportunities. Organizations need to be able to properly explain these abnormalities though, and that’s why they need the best analysts and communicators. These skill sets are quickly becoming the kind you can take to the bank!

Questions

Where have you recently seen the use of statistics and data analysis where you might not have expected it? How was it being used?
Top companies are finding that many talented analysts and statisticians actually have backgrounds in other disciplines such as economics, mathematics, and computer sciences. How do you think these disciplines relate to and inform the approach to data analysis?

You may have been exposed to multiple regression analysis in introductory statistics courses. The remaining procedures are less widely studied. Summary descriptions of the techniques are provided in Exhibit 18.1.

EXHIBIT 18.1 Brief Descriptions of Multivariate Analysis Procedures

Multiple regression analysis	Enables the researcher to predict the level of magnitude of a dependent variable based on the levels of more than one independent variable.
Multiple discriminant analysis	Enables the researcher to predict group membership on the basis of two or more independent variables.
Cluster analysis	Is a procedure for identifying subgroups of individuals or items that are homogeneous within subgroups and different from other subgroups.
Factor analysis	Permits the analyst to reduce a set of variables to a smaller set of factors or composite variables by identifying underlying dimensions in the data.
Conjoint analysis	Provides a basis for estimating the utility that consumers associate with different product features or attributes.

Although awareness of multivariate techniques is far from universal, they have been around for decades and have been widely used for a variety of commercial purposes. Fair Isaac & Co. has built a $740 million business around the commercial use of multivariate techniques.⁴ The firm and its clients have found that they can predict with surprising accuracy who will pay their bills on time, who will pay late, and who will not pay at all. The federal government uses secret formulas, based on the firm’s analyses, to identify tax evaders. Fair Isaac has also shown that results from its multivariate analyses help in identifying the best sales prospects.

Multivariate Software

The computational requirements for the various multivariate procedures discussed in this chapter are substantial. As a practical matter, running the various types of analyses presented requires a computer and appropriate software. Until the late 1980s, most types of multivariate analyses discussed in this chapter were done on mainframe or minicomputers because personal computers were limited in power, memory, storage capacity, and range of software available. Those limitations are in the past. Personal computers have the power to handle just about any problem that a marketing researcher might encounter. Most problems can be solved in a matter of seconds, and a wide variety of outstanding software is available for multivariate analysis. SPSS is the most widely used by professional marketing researchers.

SPSS includes a full range of software modules for integrated database creation and management, data transformation and manipulation, graphing, descriptive statistics, and multivariate procedures. It has an easy-to-use, graphical interface. Additional information on the SPSS product line can be found at http://www.spss.com/software/statistics and http://www.spss.com/software/modeler. A number of other useful resources are available at the SPSS site:

Technical support, product information, FAQs (frequently asked questions), various downloads, and product reviews
Examples of successful applications of multivariate analysis to solve real business problems
Discussions of data mining and data warehousing applications

As we move into discussing analytical techniques, don’t forget that we’ve got to capture the data first to feed our models. This is often the bigger challenge, as discussed in the Practicing Marketing Research feature from the banking industry on page 446. This is particularly true when we move into the realm of Big Data. We employ many of the same analytical techniques discussed in this chapter, but first we’ve got to capture the data and get the information in a form we can use. More on this issue later in the chapter.

Practicing Marketing Research

Mastering Data Management⁵

Certain banks are farther along than others in being able to drive customer acquisition through data and analytics, and those institutions have started to master their own data management, says Chandan Sharma, global managing director of financial services marketing for Verizon Enterprise Solutions.

“They’ve elevated data management to a high-level . . . they also recognize the importance of creating these functions around data management in the right place within their organization,” he observes. “It’s then easy to have a cross-enterprise view of the customer.”

Sioux Falls, South Dakota–based Great Western Bank ($9 billion in assets) is one example of an institution that has done extensive work around data management, and is now using that work as a stage to launch new customer acquisition and marketing initiatives.

“Our biggest challenge in using data and analytics for marketing and customer acquisition has been data quality, making sure that the codes are there for different documents and records,” says Ron Van Zanten, the bank’s vice president of data quality.

To address this issue, the bank created a data committee with members from different teams across the organization, Van Zanten shares. The committee, which reports up into Great Western’s business intelligence operations council, created standard definitions that teams across the organization now use for different tiers, pricing, and terms on accounts. Those definitions are now standardized across Great Western’s various systems, Van Zanten reports.

That data quality work has served to gain buy-in from employees across the organization by building trust in the bank’s data, Van Zanten adds. “If you have people in your organization who look at a report, and they say, ‘My loan numbers don’t match up,’ and that’s true, then it brings about doubt across the organization in what you’re doing,” he explains. “We can now validate our data, and the work that we’re doing with it.”

Gaining that buy-in across the organization isn’t always easy. Van Zanten’s team has been working to centralize Great Western’s data in its data warehouse. Sometimes parts of an organization can be reluctant to give up their data, he notes.

“We’ve taken the ‘source of truth’ from different silos and put it in our warehouse . . . some people have to give up the keys to their kingdoms. But this is now freeing up our staff to do new things, instead of producing the same old reports,” he remarks.

With this clean, well-defined data in hand, Great Western has started to cast a net for more profitable customers to help grow the bank. Great Western used to give away free checking to new customers in its branches, but now with the knowledge gained through its data management, the bank is working to entice customers who will have a more sticky relationship with the bank and buy value-add services, Van Zanten says.

Recently, Great Western started purchasing demographic data from Experian to add to its own data and started to build profiles of what profitable customers look like and how to market to them, Van Zanten reports.

“We’ve built a new system so when a new customer opens an account, we can see what similar customers like. If they open a debit card, we can offer things like direct deposit and push e-statements . . . and entice them to a more sticky relationship,” he adds.

The bank has also worked to better price its products for profitability by factoring in the cost of expenses on products into their price. “We can bake in the cost of funds transfers and operational costs, and take direct income and expenses on accounts, and then post those against savings and loans,” Van Zanten explains.

Great Western can now assign a numeric digit for onboarding a particular account, such as a consumer checking or a small business account, and fully understand the cost of servicing that account.

Multiple Regression Analysis

Researchers use multiple regression analysis when their goal is to examine the relationship between two or more metric predictor (independent) variables and one metric dependent (criterion) variable.⁶ Under certain circumstances, described later in this section, nominal predictor variables can be used if they are recoded as binary variables.

Multiple regression analysis is an extension of bivariate regression, discussed in Chapter 17. Instead of fitting a straight line to observations in a two-dimensional space, multiple regression analysis fits a plane to observations in a multidimensional space. The output obtained and the interpretation are essentially the same as for bivariate regression. The general equation for multiple regression is as follows:

For example, consider the following regression equation (in which values for a, b₁, and b₂ have been estimated by means of regression analysis):

This equation indicates that sales increase by 17 units for every $1 increase in advertising and 22 units for every one-unit increase in number of salespersons.

Applications of Multiple Regression Analysis

There are many possible applications of multiple regression analysis in marketing research:

Estimating the effects of various marketing mix variables on sales or market share.
Estimating the relationship between various demographic or psychographic factors and the frequency with which certain service businesses are visited.
Determining the relative influence of individual satisfaction elements on overall satisfaction.
Quantifying the relationship between various classification variables, such as age and income, and overall attitude toward a product or service.
Determining which variables are predictive of sales of a particular product or service.

Multiple regression analysis can serve one or a combination of two basic purposes: (1) predicting the level of the dependent variable, based on given levels of the independent variables, and (2) understanding the relationship between the independent variables and the dependent variable.

Multiple Regression Analysis Measures

In the discussion of bivariate regression in Chapter 17, a statistic referred to as the coefficient of determination, or R², was identified as one of the outputs of regression analysis. This statistic can assume values from 0 to 1 and provides a measure of the percentage of the variation in the dependent variable that is explained by variation in the independent variables. For example, if R² in a given regression analysis is calculated to be .75, this means that 75 percent of the variation in the dependent variable is explained by variation in the independent variables. The analyst would always like to have a calculated R² close to 1. Frequently, variables are added to a regression model to see what effect they have on the R² value.

As models get larger, more independent or predictor variables, it is wise to look at a variation of the R² statistic called adjusted R², as the measure of fit for a regression model. The standard R² value tends to increase with every predictor variable that is added to the model, regardless of whether that variable truly adds to the explanatory power of the model. The adjusted R² corrects the coefficient of determination based on the relationship between the number of predictor variables and the overall sample size, producing a more rational estimate of model fit when several independent variables are included. The adjusted R² will always be less than or equal to R², being similar to the standard measure when the amount of sample per independent variable is large and producing a negative result when the sample size is very small and there are many predictors included in the model.

The b values, or regression coefficients, are estimates of the effect of individual independent variables on the dependent variable. It is appropriate to determine the likelihood that each individual b value is the result of chance. This calculation is part of the output provided by virtually all statistical software packages. Typically, these packages compute the probability of incorrectly rejecting the null hypothesis of b_n= 0.

Dummy Variables

In some situations, the analyst needs to include nominally scaled independent variables such as gender, marital status, occupation, and race in a multiple regression analysis.

Photo illustration of three garden cleaners working to clean a garden. — Multiple regression analysis can be used to estimate the relationship between various demographic or psychographic factors and the frequency with which a service business is hired.

Dummy variables can be created for this purpose. Dichotomous, nominally scaled independent variables can be transformed into dummy variables by coding one value (e.g., female) as 0 and the other (e.g., male) as 1. For nominally scaled independent variables that can assume more than two values, a slightly different approach is required. If there are K categories, K − 1 dummy variables are needed to uniquely identify every category. (Including K categories would over identify the model since the last category is represented by “0s” on the previous K − 1 variables.) Consider a question regarding racial group with three possible answers: African American, Hispanic, or Caucasian. Binary or dummy variable coding of responses requires the use of two dummy variables, X₁ and X ₂, which might be coded as follows:

	X ₁	X ₂
If person is African American	1	0
If person is Hispanic	0	1
If person is Caucasian	0	0

Potential Use and Interpretation Problems

The analyst must be sensitive to certain problems that may be encountered in the use and interpretation of multiple regression analysis results. These problems are summarized in the following sections.

Collinearity

One of the key assumptions of multiple regression analysis is that the independent variables are not correlated (collinear) with each other.⁷ If they are correlated, the predicted Y value is unbiased, and the estimated B values (regression coefficients) will have inflated standard errors and will be inaccurate and unstable. Larger than expected coefficients for some b values are compensated for by smaller than expected coefficients for others. This is why you still produce reliable estimates of Y and why you can get sign reversals and wide variations in b values with collinearity, but still produce reliable estimates of Y.

The simplest way to check for collinearity is to examine the matrix showing the correlations between each variable in the analysis. One rule of thumb is to look for correlations between independent variables of .30 or greater. If correlations of this magnitude exist, then the analyst should check for distortions of the b values. One way to do this is to run regressions with the two or more collinear variables included and then run regressions again with the individual variables. The b values in the regression with all variables in the equation should be similar to the b values computed for the variables run separately.

A number of strategies can be used to deal with collinearity. Two of the most commonly used strategies are (1) to drop one of the variables from the analysis if two variables are heavily correlated with each other and (2) to combine the correlated variables in some fashion (e.g., create an index or use factor analysis to combined related variables) to form a new composite independent variable, which can be used in subsequent regression analyses.

Causation

Although regression analysis can show that variables are associated or correlated with each other, it cannot prove causation. Causal relationships can be confirmed only by other means (see Chapter 9). A strong logical or theoretical basis must be developed to support the idea that a causal relationship exists between the independent variables and the dependent variable. However, even a strong logical base and supporting statistical results demonstrating correlation are only indicators of causation.

Standardizing Regression Coefficients

The magnitudes of the regression coefficients associated with the various independent variables can be compared directly only if the scaling of coefficients is in the same units or if the data have been standardized. Consider the following example:

At first glance, it appears that an additional dollar spent on advertising and another salesperson added to the salesforce have equal effects on sales. However, this is not true because X₁ and X₂ are measured in different kinds of units. Direct comparison of regression coefficients requires that all independent variables be measured in the same units (e.g., dollars or thousands of dollars) or that the data be standardized. Standardization is achieved by taking each number in a series, subtracting the mean of the series from the number, and dividing the result by the standard deviation of the series. This process converts any set of numbers to a new set with a mean of 0 and a standard deviation of 1. The formula for the standardization process is as follows:

Sample Size

The value of R² is influenced by the number of predictor variables relative to sample size.⁸ Several different rules of thumb have been proposed; they suggest that the number of observations should be equal to at least 10 to 15 times the number of predictor variables. For the preceding example (sales volume as a function of advertising expenditures and number of salespersons) with two predictor variables, a minimum of 20 to 30 observations would be required.

Multiple Discriminant Analysis

Although multiple discriminant analysis is similar to multiple regression analysis,⁹ there are important differences. In the case of multiple regression analysis, the dependent variable must be metric; in multiple discriminant analysis, the dependent variable is nominal or categorical in nature. For example, the dependent variable might be usage status for a particular product or service. A particular respondent who uses the product or service might be assigned a code of 1 for the dependent variable, and a respondent who does not use it might be assigned a code of 2. Independent variables might include various metric measures, such as age, income, and number of years of education. The goals of multiple discriminant analysis are as follows:

Determine if there are statistically significant differences between the average discriminant score profiles of two (or more) groups (in this case, users and nonusers).
Establish a model for classifying individuals or objects into groups on the basis of their values on the independent variables. The resulting matrix is called a classification matrix.
Determine how much of the difference in the average score profiles of the groups is accounted for by each independent variable.

The general discriminant analysis equation follows:

The discriminant score, usually referred to as the Z score, is the score derived for each individual or object by means of the equation. This score is the basis for predicting the group to which the particular object or individual belongs. Discriminant weights, often referred to as discriminant coefficients, are computed by means of the discriminant analysis program. The size of the discriminant weight (or coefficient) associated with a particular independent variable is determined by the variance structure of the variables in the equation. Independent variables with large discriminatory power (large differences between groups) have large weights and those with little discriminatory power have small weights.

The goal of discriminant analysis is the prediction of a categorical variable. The analyst must decide which variables would be expected to be associated with the probability of a person or object falling into one of two or more groups or categories. In a statistical sense, the problem of analyzing the nature of group differences involves finding a linear combination of independent variables (the discriminant function) that shows large differences in group means. Multiple discriminant analysis outperforms multiple regression analysis in some applications where they are both appropriate.

Applications of Multiple Discriminant Analysis

Discriminant analysis can be used to answer many questions in marketing research:

How are consumers who purchase various brands different from those who do not purchase those brands?
How do we target likely buyers for a new product from our database of existing customers in order to conduct the most effective prelaunch marketing campaign?
How do consumers who frequent one fast-food restaurant differ in demographic and lifestyle characteristics from those who frequent another fast-food restaurant?
How do consumers who have chosen either indemnity insurance, HMO coverage, or PPO coverage differ from one another in regard to healthcare use, perceptions, and attitudes?

Cluster Analysis

The term cluster analysis generally refers to statistical procedures used to identify objects or people that are similar in regard to certain variables or measurements. The purpose of cluster analysis is to classify objects or people into some number of mutually exclusive and exhaustive groups so that those within a group are as similar as possible to one another (this is true in general, but techniques such as fuzzy clustering compute probabilities of membership rather than assigning records uniquely to a single group).¹⁰ In other words, clusters should be homogeneous internally (within cluster) and heterogeneous externally (between clusters).

Procedures for Clustering

A number of different procedures (based on somewhat different mathematical and computer routines) are available for clustering people or objects. Some examples of clustering techniques include K-means, two-stage, nearest neighbor, decision trees, ensemble analysis, random forest, BIRCH, and self-organizing neural networks. However, the general approach underlying all of these procedures involves measuring the similarities among people or objects in regard to their values on the variables used for clustering.¹¹ Similarities among the people or objects being clustered are normally determined on the basis of some type of distance measure. This approach is best illustrated graphically. Suppose an analyst wants to group, or cluster, consumers on the basis of two variables: monthly frequency of eating out and monthly frequency of eating at fast-food restaurants. Observations on the two variables are plotted in a two-dimensional graph in Exhibit 18.2. Each dot indicates the position of one consumer in regard to the two variables. The distance between any pair of points is positively related to how similar the corresponding individuals are when the two variables are considered together (the closer the dots, the more similar the individuals). In Exhibit 18.2, consumer X is more like consumer Y than like either Z or W.

Photo illustration of Cluster Analysis Based on Two Variables. — EXHIBIT 18.2 Cluster Analysis Based on Two Variables

Inspection of Exhibit 18.2 suggests that three distinct clusters emerge on the basis of simultaneously considering frequency of eating out and frequency of eating at fast-food restaurants:

Cluster 1 includes those people who do not frequently eat out or frequently eat at fast-food restaurants.
Cluster 2 includes consumers who frequently eat out but seldom eat at fast-food restaurants.
Cluster 3 includes people who frequently eat out and also frequently eat at fast-food restaurants.

The fast-food company can see that its best targets are to be found among those who, in general, eat out frequently and eat at fast-food restaurants specifically. To provide more insight for the client, the analyst should develop demographic, psychographic, and behavioral profiles of consumers in cluster 3.

Photo illustration of a couple seated near a window dining at a restaurant. — Clustering people according to how frequently and where they eat out is a way of identifying a particular consumer base. An upscale restaurant can see that its customers fall into cluster 2 and possibly cluster 3 in **Exhibit 18.2**.

As shown in Exhibit 18.2, clusters can be developed from scatterplots. However, this time-consuming, trial-and-error procedure becomes more tedious as the number of variables used to develop the clusters or the number of objects or persons being clustered increases. You can readily visualize a problem with two variables and fewer than 100 objects. Once the number of variables increases to three and the number of observations increases to 500 or more, visualization becomes impossible. Fortunately, computer algorithms are available to perform this more complex type of cluster analysis. The mechanics of these algorithms are complicated and beyond the scope of this discussion. The basic idea behind most of them is to start with some arbitrary cluster boundaries and modify the boundaries until a point is reached where the average interpoint distances within clusters are as small as possible relative to average distances between clusters.

Additional discussion of using cluster analysis for market segmentation is provided in the accompanying Practicing Marketing Research feature.

Practicing Marketing Research

How to Segment a Market Using Cluster Analysis

Mike Foytik, Chief Technical Officer, DSS Research

Although many alternative statistical techniques have come and gone over the years, traditional K-means cluster analysis has proven to be a reliable and efficient way of segmenting any market. Cluster analysis is the one approach that is readily available in all statistical packages and provides satisfactory results in all but the most extreme circumstances.

Before running cluster analysis, or any other form of segmentation, you must first determine what “basis” variables should be chosen to define the segments. Your segmentation solution can only be as good as the variables you choose to build that solution. In order to select the right basis variables, review the overall objectives of the segmentation. Whatever insights you hope to uncover in the marketplace or characteristics you are trying to reveal must be present in your basis variables. If the objective is to identify the best customers for a new product you are offering, product preference must be included as a basis variable along with attitudes or demographic characteristics that might be correlated with their product preference.

Research objectives, past experience, knowledge of the market, qualitative research, and analysis of the available data may all be used to identify the best basis variable candidates. Unless you have a very specific, narrowly defined objective for segmenting your market, it is generally better to start with a wide array of basis variables and narrow the selection through analyses and testing.

Once your basis variables are selected, various data manipulation and transformations are needed to prepare your dataset for cluster analysis. If using basis variables with different rating scales, standardization is in order to prevent a larger scale from dominating the clusters. Dummy variables, log and nonlinear transformations, and composite variables may need to be created to accentuate patterns in the data or to rescale items with large magnitude. You should look at the response distribution of potential basis variables to ensure there is enough variation on each item to effectively differentiate respondents.

The more items included as basis variables related to a common theme, the more the resulting solution will be focused on that theme. To avoid including too many related items as basis variables, factor analysis or correlation analysis can be used to select a representative subset from a large array of potential basis variables. We have found it more beneficial to select a subset of raw variables from a factor solution rather than using the factor scores themselves as basis variables.

Once your basis variables are determined, running a cluster analysis is very easy. Just input the basis variables, select the number of clusters you wish to identify in your data and you are on your way. However, evaluating the results and identifying the best solution is a skill that can only be developed through trial and error. The better you understand what you are looking for, the more likely you are to arrive at a solution that meets your needs. There are different metrics that can help you determine how many clusters/segments should be in your final solution, but there is no single criterion that gives the absolute best results every time.

You can narrow your set of potential solutions by using heuristics such as requiring a minimum segment size that is at least 10 percent of your overall sample and opting for the solution with the fewest segments when there does not appear to be a clear cut advantage between two or more options. If an extremely small segment (usually less than 1 percent of your sample) keeps appearing in your output, consider treating that segment as outliers and removing them from the analysis. To modify an existing solution that does not quite meet your objectives, try swapping some basis variables with other correlated measures, simply remove some items that do not appear to be contributing to the overall results, add related items to strengthen the impact of a particular characteristic you believe is underrepresented in the solution or search for new basis variables to realign the segments along a dimension not showing up in the current solutions.

Analysis of variance (ANOVA) is an excellent tool for evaluating the results of any segmentation solution being considered. First, use ANOVA to ensure you have sufficient variation among all your basis variables. Then apply ANOVA to all relevant survey questions and external data points to determine how well the solution differentiates respondents on all items of interest. By highlighting the highest and lowest items on each survey question that product significant variation across segments (high F-value on ANOVA test), you can quickly focus your attention on the items that best differentiate and define each segment. Once you can attach a meaningful name or persona to each segment and those segments address your overall objectives, you have a solution worth considering.

Don’t give up too quickly if a meaningful solution is not immediately identified. Try lots of data runs, rethink your basis variables and look for ways to pull parts of one solution you like together with segments from another solution that has positive traits.

Factor Analysis

The purpose of factor analysis is data simplification.¹² The objective is to summarize the information contained in a large number of metric measures (e.g., rating scales) with a smaller number of summary measures, called factors. As with cluster analysis, there is no dependent variable.

Many phenomena of interest to marketing researchers are actually composites, or combinations, of a number of measures. These concepts are often measured by means of rating questions. For instance, in assessing consumer response to a new automobile, a general concept such as “luxury” might be measured by asking respondents to rate different cars on attributes such as “quiet ride,” “smooth ride,” or “plush carpeting.” The product designer wants to produce an automobile that is perceived as luxurious but knows that a variety of features probably contribute to this general perception. Each attribute rated should measure a slightly different facet of luxury. The set of measures should provide a better representation of the concept than a single global rating of “luxury.”

Several measures of a concept can be added together to develop a composite score or to compute an average score on the concept. Exhibit 18.3 shows data on six consumers who each rated an automobile on four characteristics. You can see that those respondents who gave higher ratings on “smooth ride” also tended to give higher ratings on “quiet ride.” A similar pattern is evident in the ratings of “acceleration” and “handling.” These four measures can be combined into two summary measures by averaging the pairs of ratings. The resulting summary measures might be called “luxury” and “performance” (see Exhibit 18.4).

EXHIBIT 18.3 Importance Ratings of Luxury Automobile Features

Respondent	Smooth Ride	Quiet Ride	Acceleration	Handling
Bob	5	4	2	1
Roy	4	3	2	1
Hank	4	3	3	2
Janet	5	5	2	2
Jane	4	3	2	1
Ann	5	5	3	2
Average	4.50	3.83	2.33	1.50

EXHIBIT 18.4 Average Ratings of Two Factors

Respondent	Luxury	Performance
Bob	4.5	1.5
Roy	3.5	1.5
Hank	3.5	2.5
Janet	5.0	2.0
Jane	3.5	1.5
Ann	5.0	2.5
Average	4.25	1.92

Factor Scores

Factor analysis produces one or more factors, or composite variables, when applied to a number of variables. A factor, technically defined, is a linear combination of variables. It is a weighted summary score of a set of related variables, similar to the composite derived by averaging the measures. However, in factor analysis, each measure is first weighted according to how much it contributes to the variation of each factor.

In factor analysis, a factor score is calculated on each factor for each subject in the data set. For example, in a factor analysis with two factors, the following equations might be used to determine factor scores:

With these formulas, two factor scores can be calculated for each respondent by substituting the ratings she or he gave on variables A₁ through A₄ into each equation. The coefficients in the equations are the factor scoring coefficients to be applied to each respondent’s ratings. For example, Bob’s factor scores (see Exhibit 18.4) are computed as follows:

In the first equation, the factor scoring coefficients, or weights, for A₁ and A₂ (.40 and .30) are large, whereas the weights for A₃ and A₄ are small. The small weights on A₃ and A₄ indicate that these variables contribute little to score variations on factor 1 (F₁). Regardless of the ratings a respondent gives to A₃ and A₄, they have little effect on his or her score on F₁. However, variables A₃ and A₄ make a large contribution to the second factor score (F₂), whereas A₁ and A₂ have little effect. These two equations show that variables A₁ and A₂ are relatively independent of A₃ and A₄ because each variable takes on large values in only one scoring equation.

The relative sizes of the scoring coefficients are also of interest. Variable A₁ (with a weight of .40) is a more important contributor to factor 1 variation than is A₂ (with a smaller weight of .30). This finding may be very important to the product designer when evaluating the implications of various design changes. For example, the product manager might want to improve the perceived luxury of the car through product redesign or advertising. The product manager may know, based on other research, that a certain expenditure on redesign will result in an improvement of the average rating on “smooth ride” from 4.3 to 4.8. This research may also show that the same expenditure will produce a half-point improvement in ratings on “quiet ride.” The factor analysis shows that perceived luxury will be enhanced to a greater extent by increasing ratings on “smooth ride” than by increasing ratings on “quiet ride” by the same amount.

Factor Loadings

The nature of the factors derived can be determined by examining the factor loadings. Using the scoring equations presented earlier, a pair of factor scores (F₁ and F₂) are calculated for each respondent. Factor loadings are determined by calculating the correlation (from −1 to +1) between each factor (F₁ and F₂) score and each of the original ratings variables. Each correlation coefficient represents the loading of the associated variable on the particular factor. If A₁ is closely associated with factor 1, the loading or correlation will be high, as shown for the sample problem in Exhibit 18.5. Because the loadings are correlation coefficients, values near −1 or +1 indicate a close positive or negative association. Variables A₁ and A₂ are closely associated (highly correlated) with scores on factor 1, and variables A₃ and A₄ are closely associated with scores on factor 2.

EXHIBIT 18.5 Factor Loadings for Two Factors

	Correlation with
Variable	Factor 1	Factor 2
A₁	.85	.10
A₂	.76	.06
A₃	.06	.89
A₄	.04	.79

Stated another way, variables A₁ and A₂ have high loadings on factor 1 and serve to define the factor; variables A₃ and A₄ have high loadings on and define factor 2.

Naming Factors

Once each factor’s defining variables have been identified, the next step is to name the factors. This is a somewhat subjective step, combining intuition and knowledge of the variables with an inspection of the variables that have high loadings on each factor. Usually, a certain consistency exists among the variables that load highly on a given factor. For instance, it is not surprising to see that the ratings on “smooth ride” and “quiet ride” both load on the same factor. Although we have chosen to name this factor “luxury,” another analyst, looking at the same result, might decide to name the factor “prestige.”

Number of Factors to Retain

In factor analysis, the analyst is confronted with a decision regarding how many factors to retain. The final result can include from one factor to as many factors as there are variables. The decision is often made by looking at the percentage of the variation in the original data that is explained by each factor.

There are many different decision rules for choosing the number of factors to retain. Probably the most appropriate decision rule is to stop factoring when additional factors no longer make sense. The first factors extracted are likely to exhibit logical consistency; later factors are usually harder to interpret, for they are more likely to contain a large amount of random variation.

Conjoint Analysis

Conjoint analysis is a popular multivariate procedure used by marketers to help determine what features a new product or service should include and how it should be priced. It can be argued that conjoint analysis has become popular because it is a more powerful, more flexible, and often less expensive way to address these important issues than is the traditional concept testing approach.¹³

Conjoint analysis is not a completely standardized procedure.¹⁴ A typical conjoint analysis application involves presenting various product or service combinations in a carefully controlled exercise, then estimating the relative value of each feature tested based on how people reacted to the different combinations presented. “Reactions” may be captured as rankings, rating, likelihood to purchase or by some other means depending on the approach being used. The type of conjoint approach (e.g., ratings-based, discrete choice, graded pairs, dual choice, full profile, partial profile, adaptive choice, etc.) affects how the exercise is presented and what statistical procedures are most appropriate for analyzing the results. Fortunately, conjoint analysis is not difficult to understand conceptually, as we demonstrate in the following example concerning the attributes of golf balls.

Example of Conjoint Analysis

Put yourself in the position of a product manager for Titleist, a major manufacturer of golf balls. From focus groups recently conducted, past research studies of various types, and your own personal experience as a golfer, you know that golfers tend to evaluate golf balls in terms of three important features or attributes: average driving distance, average ball life, and price.

You also recognize a range of feasible possibilities for each of these features or attributes, as follows:

Average driving distance
- 10 yards more than the golfer’s average
- Same as the golfer’s average
- 10 yards less than the golfer’s average
Average ball life
- 54 holes
- 36 holes
- 18 holes
Price per ball
- $2.00
- $2.50
- $3.00

From the perspective of potential purchasers, these attributes have a natural order (i.e., longer distance and longer ball life are always preferred over shorter options), so we can easily identify the ideal configuration. This is not always the case when dealing with attributes such as brand, physical appearance, or color. For this example, the consumer’s ideal golf ball would have the following characteristics:

Average driving distance—10 yards above average
Average ball life—54 holes
Price—$2.00

From the manufacturer’s perspective, which is based on manufacturing cost, the ideal golf ball would probably have these characteristics:

Average driving distance—10 yards below average
Average ball life—18 holes
Price—$3.00

Photo illustration of a golf ball near the hole. A gold player hands lifted above is seen in the background. He is holding a golf club in his right hand. — Conjoint analysis could be used by a manufacturer of golf balls to determine the relative importance of these three features of a golf ball and to see which ball meets the most needs of both consumer and manufacturer.

This golf ball profile is based on the fact that it costs less to produce a ball that travels a shorter distance and has a shorter life. The company confronts the eternal marketing dilemma: the company would sell a lot of golf balls, but would go broke if it produced and sold the ideal ball from the golfer’s perspective. However, the company would sell very few balls if it produced and sold the ideal ball from the manufacturer’s perspective. As always, the “best” golf ball from a business perspective lies somewhere between the two extremes.

A traditional approach to this problem might produce information of the type displayed in Exhibit 18.6. As you can see, this information does not provide new insights regarding which ball should be produced. The preferred driving distance is 10 yards above average and the preferred average ball life is 54 holes. These results are obvious without any additional research.

EXHIBIT 18.6 Traditional Nonconjoint Rankings of Distance and Ball Life Attributes

Average Driving Distance		Average Ball Life
Rank	Level	Rank	Level
1	275 yards	1	54 holes
2	250 yards	2	36 holes
3	225 yards	3	18 holes

Considering Features Conjointly

In conjoint analysis, rather than having respondents evaluate features individually, the analyst asks them to evaluate features conjointly or in combination so that advantages for one attribute can only be chosen at the expense of another attribute. The results of asking two different golfers to rank different combinations of “average driving distance” and “average ball life” conjointly are shown in Exhibits 18.7 and 18.8.

EXHIBIT 18.7 Conjoint Rankings of Combinations of Distance and Ball Life for Golfer 1

	Ball Life
Distance	54 holes	36 holes	18 holes
275 yards	1	2	4
250 yards	3	5	7
225 yards	6	8	9

EXHIBIT 18.8 Conjoint Rankings of Combinations of Distance and Ball Life for Golfer 2

	Ball Life
Distance	54 holes	36 holes	18 holes
275 yards	1	3	6
250 yards	2	5	8
225 yards	4	7	9

As expected, both golfers agree on the most and least preferred balls. However, analysis of their second through eighth rankings makes it clear that the first golfer is willing to trade off ball life for distance (accept a shorter ball life for longer distance), while the second golfer is willing to trade off distance for longer ball life (accept shorter distance for a longer ball life).

This type of information is the essence of the special insight offered by conjoint analysis. The technique permits marketers to see which product attribute or feature potential customers are willing to trade off (accept less of) to obtain more of another attribute or feature. People make these kinds of purchasing decisions every day (e.g., they may choose to pay a higher price for a product at a local market for the convenience of shopping there).

Estimating Utilities

The next step is to calculate a set of values, or utilities, for the three levels of price, the three levels of driving distance, and the three levels of ball life in such a way that, when they are combined in a particular mix of price, ball life, and driving distance, they predict each golfer’s rank order for that combination. Estimated utilities for golfer 1 are shown in Exhibit 18.9. As you can readily see, this set of numbers perfectly predicts the original rankings. The relationship among these numbers or utilities is fixed, though there is some arbitrariness in their magnitude or scale. In other words, the utilities shown in Exhibit 18.9 can be multiplied or divided by any constant and the same relative results will be obtained. Utilities for this simple example can be computed using ordinary least squares regression, but the exact procedures for estimating utilities of more complex exercises are beyond the scope of this discussion. They are normally calculated by using procedures related to regression, analysis of variance, linear programming, logic, or hierarchical Bayes analysis.

EXHIBIT 18.9 Ranks (in parentheses) and Combined Metric Utilities for Golfer 1—Distance and Ball Life

	Ball Life
Distance	54 holes	36 holes	18 holes
275 yards	(1)	(2)	(4)
	150	125	100
250 yards	(3)	(5)	(7)
	110	85	60
225 yards	(6)	(8)	(9)
	50	25	0

The trade-offs that golfer 1 is willing to make between “ball life” and “price” are shown in Exhibit 18.10. This information can be used to estimate a set of utilities for “price” that can be added to those for “ball life” to predict the rankings for golfer 1, as shown in Exhibit 18.11.

EXHIBIT 18.10 Conjoint Rankings of Combinations of Price and Ball Life for Golfer 1

	Ball Life
Price	54 holes	36 holes	18 holes
$2.00	1	2	4
$2.50	3	5	7
$3.00	6	8	9

EXHIBIT 18.11 Ranks (in parentheses) and Combined Metric Utilities for Golfer 1—Price and Ball Life

	Ball Life
Price	54 holes	36 holes	18 holes
$2.00	(1)	(2)	(4)
	70	45	20
$2.50	(3)	(5)	(7)
	55	30	5
$3.00	(6)	(8)	(9)
	50	25	0

This step produces a complete set of utilities for all levels of the three features or attributes that successfully capture golfer 1’s trade-offs. These utilities are shown in Exhibit 18.12.

EXHIBIT 18.12 Complete Set of Estimated Utilities for Golfer 1

Distance		Ball Life		Price
Level	Utility	Level	Utility	Level	Utility
$275 yards	100	54 holes	50	$2.00	20
250 yards	60	36 holes	25	$2.50	5
225 yards	0	18 holes	0	$3.00	0

Simulating Buyer Choice

For various reasons, the firm might be in a position to produce only 2 of the 27 golf balls that are possible with each of the three levels of the three attributes. The possibilities are shown in Exhibit 18.13. If the calculated utilities for golfer 1 are applied to the two golf balls the firm is able to make, then the results are the total utilities shown in Exhibit 18.14. These results indicate that golfer 1 will prefer the ball with the longer life over the one with the greater distance because it has a higher total utility. The analyst need to only repeat this process for a representative sample of golfers to estimate potential market shares for the two balls. In addition, the analysis can be extended to cover other golf ball combinations.

EXHIBIT 18.13 Ball Profiles for Simulation

Attribute	Distance Ball	Long-Life Ball
Distance	275	250
Life	18	54
Price	$2.50	$3.00

EXHIBIT 18.14 Estimated Total Utilities for the Two Sample Profiles

Attribute	Distance Ball		Long-Life Ball
Attribute	Level	Utility	Level	Utility
Distance	275	100	250	60
Life	18	0	54	50
Price	$2.50	5	$3.00	0
Total utility		105		110

The three steps discussed here—collecting trade-off data, using the data to estimate buyer preference structures, and predicting choice—are the basis of any conjoint analysis application. Although the trade-off matrix approach is simple, useful for explaining conjoint analysis, and effective for problems with small numbers of attributes, it is seldom used in the real world.

One of the most common approaches to conducting conjoint analysis is the use of a discrete choice or choice-based conjoint exercise. Two or more products are shown side-by-side with details provided on each key attribute being tested. Respondents are asked to select a single product from among the options shown. The exercise is repeated multiple times in order to present a wide variety of product designs, but no individual sees more than a fraction of the sometimes thousands or even millions of possible product combinations.

Computer-driven exercises might further adapt the exercise to each respondent, based on prior answers and personal demographics, to spend more time on the factors that seem to be driving product choice. Menu-based conjoint analysis can replicate the choices consumers make when choosing between “value meals” and a la carte items from a restaurant menu. Other computer-driven exercises allow respondents to design their own product with appropriate design constraints and pricing factored in to each option chosen, much the way consumers configure their own Dell computer online or select upgrades for a new car. These and many other approaches can be used to capture the information needed for estimating respondent utilities when designed, executed, and analyzed properly.

As suggested earlier, there is much more to conjoint analysis than has been discussed in this section. However, if you understand this simple example, then you understand the basic concepts that underlie conjoint analysis.

Limitations of Conjoint Analysis

Like many research techniques, conjoint analysis suffers from a certain degree of artificiality. Respondents may be more deliberate in their choice processes in this context than in a real situation. The survey may provide more product information than respondents would get in a real market situation. If key attributes or popular options within key attributes are excluded from the study, demand estimates could be severely impacted. Testing too many attributes or features will diminish the amount of attention that can be given to each individual’s most desired features, reducing measurement precision. The presentation of information (e.g., the order in which attributes are listed; whether pictures are used for some attributes, but not others; how price is displayed; etc.) can greatly impact what features respondents focus on and, ultimately, how they make their decisions. It is important to either be as neutral as possible in the presentation of a conjoint exercise or else try to replicate how the product or service is actually evaluated and compared in the marketplace in order to avoid biasing results.

Finally, it is important to remember that the advertising and promotion of any new product or service can lead to consumer perceptions that are very different from those created via factual descriptions used in a survey. Also keep in mind that consumers can’t purchase something they don’t know exists, so conjoint analysis operates under the assumptions of full awareness, unrestricted access, and complete knowledge of all product features.

Big Data and Hadoop

Big Data is the term used to describe very large and complex data sets. Companies have been collecting transaction-based information since the beginning of the computer age. However, the sheer volume of information has grown exponentially in recent years and the types of information now being generated does not easily fit into traditional hierarchical database structures. Big Data describes the new data capture and management approaches that are designed to handle the higher volume, faster acquisition rates, and broader array of data types. Most of the tools for Big Data are still evolving, and individuals with the skills to capitalize on them are in short supply.

Hadoop is an open-source platform distributed by Apache for managing large amounts of information across hundreds or thousands of networked computers. Each computer works independently on a small portion of the total dataset so that a task such as clustering several billion records can be handled in a fraction of the time taken for more conventional database structures. There are numerous backup copies of each data chunk, so that any failure can be immediately picked up by another computer with access to the same information. Google and Yahoo have had a hand in developing the platform and underlying technology for Hadoop as they sought ways to store and access the vast array of search information they were collecting.

Today, many companies that deal with Big Data—such as Amazon, eBay, Facebook, Google, IBM, LinkedIn, Spotify, Twitter, and Yahoo—use Hadoop to manage their information.

Predictive Analytics¹⁵

Predictive analytics describes a wide array of tools and techniques that are used to extract and analyze information from data sets. Statistics, machine learning, database management, and computer programming all play a part in identifying patterns and transforming data into insights. It is an increasingly important set of tools for businesses to transform the exponentially growing quantities of digital data into business intelligence as firms seek informational advantages to improve efficiency and effectiveness. Predictive analytics can apply to Big Data or traditional databases, observational data like loyalty card usage, Internet sources like social media text, and Web tracking data or primary survey research results. Fraud detection, trend analyses, targeted direct marketing, predicting heavy users, and identifying likely buyers are just some of the applications for predictive analytics.

Practicing Marketing Research

A Framework for Practical Pricing Research¹⁶

“How much should I charge for my product?” This is one of the most important and difficult questions facing a marketer. Charging too much (thus attracting too few customers) could be as costly as charging too little (leaving money on the table). The good news is research can help understand consumers’ willingness to pay (WTP) for a product. But a practical issue researchers face is deciding on the right pricing approach to use. So many survey-based approaches are available—monadic, sequential monadic, conjoint analysis, and so on—that it is easy to be confused. What would be useful for a practical researcher is a simple framework for thinking about pricing research.

At its core, what a researcher is really trying to understand is what a consumer is willing to pay for a given product. Broadly speaking, this can be approached in two ways—direct and indirect elicitation. In the direct approach, a product description of some kind is provided (without a price), and consumers are asked what price they are willing to pay for that product. In the indirect approach, a product is provided (with a price) and consumers are asked about their likelihood to purchase that product. Pretty much all survey-based pricing approaches are some variation of these two approaches.

Which one should a researcher use and when? There is a bit more nuance and detail involved in that decision, so let’s take a closer look.

Direct Elicitation Pricing

In the simplest variant of direct elicitation, a single product (or service) description is provided and consumers are asked (in an open-ended manner) for the price they are willing to pay. Though the implication is that they are willing to pay a certain price to purchase the product, it is not stated explicitly, hence the focus is squarely on the price. This can subtly place more emphasis on price rather than the value provided by the product. But an advantage of this method is that it requires very little effort from the consumer. Since an open-ended question is asked, a wide range of responses are possible. Calculating the WTP as the average price is straightforward. Alternatively, guidance could be provided in the form of a range of prices for the consumer to choose from.

There is research to show that direct elicitation tends to provide biased responses (perhaps because of the focus on price). So, the situations where this approach is used should be chosen with care. The most obvious case is early in the product development process. A company may have developed a new product concept and is interested in the price the market would pay. The concept may not be fully developed and hence the features would be ill-defined. Trying to get a precise measure of WTP would not be appropriate, so a method that gets at ballpark pricing would be sufficient. In such cases, direct elicitation of WTP would be simple to administer and also efficient from a research cost perspective.

Since the direct approaches focus on the price of the product, there is no information on what happens when features change, the impact of competition or how WTP translates to sales. Given these drawbacks, the direct approaches are more useful in the early exploratory stage of product development where the priority is to get a ballpark price range.

Indirect Elicitation Pricing

The most straightforward change that would make an approach indirect is to attach a price to the product description and ask how likely a consumer is to purchase that product. This small change shifts the focus from price to the value inherent in the product. Further, it provides a better (although still biased) view of sales this product will likely garner and therefore is a more useful metric for the marketer. Asking about likelihood to purchase makes less sense early in the product development cycle. The more clearly defined the features are (i.e., the further the product is in development), the easier it will be for the consumer to provide a realistic answer.

Though it moves the focus from price to purchase (while still providing pricing information for the researcher), this method still suffers from the other flaws mentioned before. But it is at least possible to get at the issue of price sensitivity by using a monadic approach (sometimes called A/B testing). Here, the same product is shown to two similar groups of consumers at different prices and demand is estimated. When more than two groups are used it can produce a nice (downward-sloping) demand curve with a useful property—identification of potential kinks or nonlinearities that can suggest interesting price points. The downside is that it comes at a cost in terms of the sample size required across all the groups in order to get robust results.

One variant used in practice is called sequential monadic or price laddering (similar to the idea of contingent valuation in the academic literature). Here a single cell is used and if a respondent indicates a low willingness to buy at the given price, a lower price (or two) is offered. The increase in demand across the prices indicates sensitivity, though of course, the later price estimates are biased because of prior exposure.

None of the methods mentioned so far get at the root of the problem: the relative realism of the pricing research. Direct elicitation of WTP is the least realistic and provides the least information. Use of purchase likelihood is somewhat better but does not take competition into account (as happens in real life). To get over this hurdle one could place the target product in a competitive setting (such as a simulated grocery shelf) and record how often it is chosen. But now there are several additional variables introduced into the mix and we cannot be certain about their impact on the demand for our product. What is really needed is an approach that maintains this realistic setting but still provides pricing information in a systematic and effective manner.

That is exactly what conjoint analysis does.

Conjoint Analysis and Pricing

This is really a family of techniques but the most popular variant is called discrete choice. As the name implies, consumers are shown sets of products and asked to choose the one they are most likely to buy. This is quite similar to the behavior they would exhibit in a real buying situation. To make the process even more realistic, the choice task usually includes a “None” or no-choice option, which can help increase the accuracy of demand and hence price estimation.

In a typical conjoint exercise, products are described by attributes (often including brand and price), each with two or more levels. By combining various attribute levels, products can be formed and displayed to consumers as choice options. Choices made by consumers provide information on what is important to them. The choice tasks are created using an experimental design so as to extract maximum information. For example, a high-quality, high-price product might be shown with a low-quality, low-price product. There is no obvious “right” answer and hence the choice made by the consumer provides information on what she or he values. But if the choice had been between a high-quality, low-price product and a low-quality, high-price product then the information value of the choice is minimal (as everyone would choose the former). By providing the consumer with a series of such choices and forcing her or him to think and trade-off between features, conjoint analysis is able to gather information on what is truly important to consumers.

Price is one of the features included in the exercise but not the only one. It is combined with other features and together they are displayed as a set of complete products, thus reducing the focus on price as compared to the direct elicitation methods. Thus, the demand estimated at various price points through this approach tends to be more accurate. The output is usually provided in the form of product shares which can be easily understood by all constituents. A simulator can generate what-if scenarios when product features and prices are changed, thus providing the kind of marketplace simulation not possible with any other pricing approach.

The conjoint approach does have some disadvantages. Multiple screens of products need to be shown to respondents and if the number of attributes is large it can make the exercise tedious. Though the method is robust and is shown to have practical value, it is complex in terms of design and analysis and usually requires specialized support. Hence, it is not as simple as using direct elicitation and reporting a single WTP number.

Recommendations for Pricing Research

We started with this question: “How much should I charge for my product?” To identify the appropriate research needed to answer this question, we first need to understand where the product is in the development process. Early in development, when the product is mostly conceptual and fuzzy, accurate pricing information is neither attainable nor desirable. So a direct elicitation approach may be best, while being cognizant of the ballpark nature of the pricing. This also has the advantage of keeping the research simple, quick and economical.

If it is later in product development when the features are firmed up, conjoint analysis (generally discrete choice) would not only provide good information on pricing but also identify attractiveness of various product features. In fact, there are two ways in which conjoint analysis can be used, if we can consider the middle and final stages of product development to be distinct.

In the middle stage, conjoint analysis can be focused on the features and price of the target product and not on brand name and competitive dynamics. Survey respondents make trade-offs and identify important features, indicating their willingness to pay. If the company is planning on introducing a new or modified feature, this stage can identify whether it has inherent value and how much. Longer lists of features may be more appropriate in this stage.

Ultimately, there are many ways of doing pricing research using surveys but thinking about it systematically and using the product development framework can help a researcher choose the right approach.

Questions

Under what circumstances is conjoint analysis a good choice for pricing research? Why do you say that?
What are some of the methods for direct elicitation of possible pricing for a product? What are the principal disadvantages of the direct elicitation approach?

Using Predictive Analytics

Acquiring a Data Set

Before applying predictive analytics, an organization must assemble a target data set relevant to the problem of interest. Predictive analytics can only uncover patterns and relationships that exist in the available data. Typically, the data set must be large enough to include all the patterns and combinations that are likely to be found in the real world.

In the past, assembling such large data sets was very costly and time consuming. Today, most companies capture terabytes of information on their customers as a normal course of business, and many social media companies provide access to massive amounts of data in real time for anyone to tap into. In addition, third-party vendors provide a wide variety of data elements that can be purchased for just about any household or company in the United States.

Pre-processing

Once assembled, the data set must be cleaned in a process where observations that contain excessive noise, errors and missing data are edited or excluded. Data transformations may be used to smooth out irregular distributions and minimize extreme values. Imputing missing values from comparable records and building predictive models to fill-in missing information is often used. Linking multiple data sets is also part of pre-processing available data.

Modeling

A variety of techniques may be employed as part of the modeling process:

Clustering. This is a task of discovering groups and structures in the data that are similar on certain, selected sets of variables. These are groupings that are not obvious and are not based on a single set of variables or small number of items. Clustering normally requires evaluating numerous solutions before finding the best option. Cluster analysis, one of the techniques discussed earlier in this chapter, is commonly used to reveal hidden groupings or identify unexpected associations.
Classification. Readily available information such as demographics and geography might be used to classify individuals on key behaviors such as purchase frequency or product preference. Proprietary information such as online ads viewed or previous products purchased can be very effective at predictive future behaviors whenever such information is available. Customer segments identified through clustering might also be modeled in order to predict which segment new customers and prospects belong. Successful models can be applied to new customers and records that could not be processed directly due to missing data.
Estimation. Calculations such as risk scores, fraud detection, retention rates, lifetime value, and likelihood to purchase rates may be calculated for individuals or groups. These calculations can be used to predict future outcomes based on limited present-day data. They can also be used to monitor individuals or groups in order to detect changes in behavior that allow the organization to react before customers or revenues are lost.

Validating Results

A final step of knowledge discovery from the target data and modeling is to attempt to verify the patterns produced by the predictive modeling algorithms in a wider data set. Not every pattern and relationship identified in previous steps turns out to be valid in the real world. In the evaluation process, the patterns or models identified in the wider data set are applied to a test data set that was not used to develop the predictive modeling algorithm. The resulting output is compared to the desired output.

For example, an algorithm developed to predict those most likely to respond to a mail offer would be developed or trained on certain past mail offers. Once developed or trained, the algorithm developed from the test mailings would be applied to other mailings not used in the development of the algorithm or to actual results from a mailing recently completed. If the predictive model does not meet the desired accuracy standards, then it is necessary to go through the previous steps again in order to develop an algorithm or model with the desired level of accuracy.

Applying the Results

Once the models and calculations are in place and have been validated, they are applied to existing and future customer records to improve the efficiency and effectiveness of marketing efforts. For example, specific information captured from a new sales inquiry can be used to classify an individual into the correct market segment. Based on their market segment, the most appropriate product offering can be prepared and the marketing messages can be adjusted to most resonate with that individual. Purchasing prospect lists with specific information appended to each record allows an organization to avoid wasting marketing dollars on unlikely purchasers (based on applied predictive models) and focus resources on the most likely buyers and those with the greatest potential lifetime value.

The Practicing Marketing Research feature below provides on example of how predictive modeling is used by a major retailer and also touches on the privacy issues discussed in the next section.

Practicing Marketing Research

How Target Figured Out a Teen Girl Was Pregnant Before Her Father Did¹⁷

Every time you go shopping, you share intimate details about your consumption patterns with retailers. And many of those retailers are studying those details to figure out what you like, what you need, and which coupons are most likely to make you happy. Target, for example, has figured out how to data-mine its way into your womb, to figure out whether you have a baby on the way long before you need to start buying diapers.

Charles Duhigg outlines in the New York Times how Target tries to hook parents-to-be at that crucial moment before they turn into rampant—and loyal—buyers of all things pastel, plastic, and miniature. He talked to Target statistician Andrew Pole—before Target freaked out and cut off all communications—about the clues to a customer’s impending bundle of joy. Target assigns all customers a Guest ID number, tied to their credit card, name, or e-mail address, that becomes a bucket that stores a history of everything they’ve bought and any demographic information Target has collected from them or bought from other sources. Using that, Pole looked at historical buying data for all the ladies who had signed up for Target baby registries in the past.

He ran different analyses, analyzing the data, and some useful patterns emerged. Lots of people buy lotion, but one of Pole’s colleagues noticed that women on the baby registry were buying larger quantities of unscented lotion around the beginning of their second trimester. Another analyst noted that sometime in the first 20 weeks, pregnant women loaded up on supplements like calcium, magnesium, and zinc. Many shoppers purchase soap and cotton balls, but when someone suddenly starts buying lots of scent-free soap and extra-big bags of cotton balls, in addition to hand sanitizers and washcloths, it signals they could be getting close to their delivery date.

As Pole’s computers crawled through the data, he was able to identify about 25 products that, when analyzed together, allowed him to assign each shopper a “pregnancy prediction” score. More important, he could also estimate her due date to within a small window, so Target could send coupons timed to very specific stages of her pregnancy.

Take a fictional Target shopper named Jenny Ward, who is 23, lives in Atlanta and in March bought cocoa-butter lotion, a purse large enough to double as a diaper bag, zinc and magnesium supplements, and a bright blue rug. There’s, say, an 87 percent chance that she’s pregnant, and that her delivery date is sometime in late August.

And perhaps that it’s a boy, based on the color of that rug?

So Target started sending coupons for baby items to customers according to their pregnancy scores. An angry man went into a Target outside of Minneapolis, demanding to talk to a manager. “My daughter got this in the mail!” he said. “She’s still in high school, and you’re sending her coupons for baby clothes and cribs? Are you trying to encourage her to get pregnant?”

The manager didn’t have any idea what the man was talking about. He looked at the mailer. Sure enough, it was addressed to the man’s daughter and contained advertisements for maternity clothing, nursery furniture, and pictures of smiling infants. The manager apologized, and then called a few days later to apologize again. On the phone, though, the father was somewhat abashed. “I had a talk with my daughter,” he said. “It turns out there’s been some activities in my house I haven’t been completely aware of. She’s due in August. I owe you an apology.”

What Target discovered fairly quickly is that it creeped people out that the company knew about their pregnancies in advance.

“If we send someone a catalog and say, ‘Congratulations on your first child!’ and they’ve never told us they’re pregnant, that’s going to make some people uncomfortable,” Pole told me. “We are very conservative about compliance with all privacy laws. But even if you’re following the law, you can do things where people get queasy.”

So Target got sneakier about sending the coupons. The company can create personalized booklets; instead of sending people with high pregnancy scores books of coupons solely for diapers, rattles, strollers, and the Go the F*** to Sleep book, they more subtly spread them about:

“Then we started mixing in all these ads for things we knew pregnant women would never buy, so the baby ads looked random. We’d put an ad for a lawn mower next to diapers. We’d put a coupon for wineglasses next to infant clothes. That way, it looked like all the products were chosen by chance.

“And we found out that as long as a pregnant woman thinks she hasn’t been spied on, she’ll use the coupons. She just assumes that everyone else on her block got the same mailer for diapers and cribs. As long as we don’t spook her, it works.”

So the Target philosophy toward expecting parents might be similar to the first-date philosophy: Even if you’ve fully stalked the person on Facebook and Google beforehand, pretend like you know less than you do so as not to creep the person out.

Duhigg suggests that Target’s gangbusters revenue growth in the 2000s—$44 billion in 2002, when Pole was hired, to $67 billion in 2010—is attributable to Pole’s helping the retail giant corner the baby-on-board market, citing company president Gregg Steinhafel boasting to investors about the company’s “heightened focus on items and categories that appeal to specific guest segments such as mom and baby.”

Privacy Concerns and Ethics

Most believe that predictive modeling is ethically neutral. However, the ways in which data are collected for predictive modeling and the types of data acquired can raise questions regarding privacy, legality, and ethics. For example, monitoring telephone calls and Internet usage for national security or law enforcement purposes has raised privacy concerns.

Commercial Predictive Modeling Software and Applications

Database providers such as Oracle and Microsoft provide tools optimized for their platform. Popular Big Data platform, Hadoop, has a variety of open-source and commercial tools available. There are an increasing number of highly integrated packages for predictive modeling, including:

Angoss KnowledgeSTUDIO
Clarabridge
RapidMiner
SAS Enterprise Miner
SPSS Modeler
STATISTICA Data Miner

Summary

Multivariate analysis refers to the simultaneous analysis of multiple measurements on each individual or object being studied. Some of the more popular multivariate techniques include multiple regression analysis, multiple discriminant analysis, cluster analysis, factor analysis, and conjoint analysis.

Multiple regression analysis enables the researcher to predict the magnitude of a dependent variable based on the levels of more than one independent variable. Multiple regression fits a plane to observations in a multidimensional space. One statistic that results from multiple regression analysis is called the coefficient of determination, or R². The value of this statistic ranges from 0 to 1. It provides a measure of the percentage of the variation in the dependent variable that is explained by variation in the independent variables. The b values, or regression coefficients, indicate the effect of the individual independent variables on the dependent variable.

Whereas multiple regression analysis requires that the dependent variable be metric, multiple discriminant analysis uses a dependent variable that is nominal or categorical in nature. Discriminant analysis can be used to determine if statistically significant differences exist between the average discriminant score profiles of two (or more) groups. The technique can also be used to establish a model for classifying individuals or objects into groups on the basis of their scores on the independent variables. Finally, discriminant analysis can be used to determine how much of the difference in the average score profiles of the groups is accounted for by each independent variable. The discriminant score, called a Z score, is derived for each individual or object by means of the discriminant equation.

Cluster analysis enables a researcher to identify subgroups of individuals or objects that are homogeneous within the subgroup, yet different from other subgroups. Cluster analysis requires that all independent variables be metric, but there is no specification of a dependent variable. Cluster analysis is an excellent means for operationalizing the concept of market segmentation.

The purpose of factor analysis is to simplify massive amounts of data. The objective is to summarize the information contained in a large number of metric measures such as rating scales with a smaller number of summary measures called factors. As in cluster analysis, there is no dependent variable in factor analysis. Factor analysis produces factors, each of which is a weighted composite of a set of related variables. Each measure is weighted according to how much it contributes to the variation of each factor. Factor loadings are determined by calculating the correlation coefficient between factor scores and the original input variables. By examining which variables load heavily on a given factor, the researcher can subjectively name that factor.

Perceptual maps can be produced by means of factor analysis, multidimensional scaling, discriminant analysis, or correspondence analysis. The maps provide a visual representation of how brands, products, companies, and other objects are perceived relative to each other on key features such as quality and value. All the approaches require, as input, consumer evaluations or ratings of the objects in question on some set of key characteristics.

Conjoint analysis is a technique that can be used to measure the trade-offs potential buyers make on the basis of the features of each product or service available to them. The technique permits the researcher to determine the relative value of each level of each feature. These estimated values are called utilities and can be used as a basis for simulating consumer choice.

Predictive modeling draws on statistics, machine learning, artificial intelligence, and computer programming to identify patterns in market data sets. It is becoming increasingly important as the available data grow exponentially.

Key Terms

causation

classification matrix

cluster analysis

coefficient of determination

collinearity

conjoint analysis

discriminant coefficient

discriminant score

dummy variables

factor

factor analysis

factor loading

K-means cluster analysis

metric scale

multiple discriminant analysis

multiple regression analysis

multivariate analysis

nominal or categorical

regression coefficients

scaling of coefficients

utilities

Questions for Review & Critical Thinking

Distinguish between multiple discriminant analysis and cluster analysis. Give several examples of situations in which each might be used.
What purpose does multiple regression analysis serve? Give an example of how it might be used in marketing research. How is the strength of multiple regression measures of association determined?
What is a dummy variable? Give an example using a dummy variable.
Describe the potential problem of collinearity in multiple regression. How might a researcher test for collinearity? If collinearity is a problem, what should the researcher do?
A sales manager examined age data, education level, a personality factor that indicated level of introvertedness/extrovertedness, and level of sales attained by the company’s 120-person salesforce. The technique used was multiple regression analysis. After analyzing the data, the sales manager said, “It is apparent to me that the higher the level of education and the greater the degree of extrovertedness a salesperson has, the higher will be an individual’s level of sales. In other words, a good education and being extroverted cause a person to sell more.” Would you agree or disagree with the sales manager’s conclusions? Why?

The factors produced and the results of the factor loadings from factor analysis are mathematical constructs. It is the task of the researcher to make sense out of these factors. The following table lists four factors produced from a study of cable TV viewers. What label would you put on each of these four factors? Why?

		Factor Loading
Factor 1	I don’t like the way cable TV movie channels repeat the movies over and over.	.79
	The movie channels on cable need to spread their movies out (longer times between repeats).	.75
	I think the cable movie channels just run the same things over and over and over.	.73
	After a while, you’ve seen all the pay movies, so why keep cable service.	.53
Factor 2	I love to watch love stories.	.76
	I like a TV show that is sensitive and emotional.	.73
	Sometimes I cry when I watch movies on TV.	.65
	I like to watch “made for TV” movies.	.54
Factor 3	I like the religious programs on TV (negative correlation).	−.76
	I don’t think TV evangelism is good.	.75
	I do not like religious programs.	.61
Factor 4	I would rather watch movies at home than go to the movies.	.63
	I like cable because you don’t have to go out to see the movies.	.55
	I prefer cable TV movies because movie theaters are too expensive.	.46

The following table is a discriminant analysis that examines responses to various attitudinal questions from cable TV users, former cable TV users, and people who have never used cable TV. Looking at the various discriminant weights, what can you say about each of the three groups?

		Discriminant Weights
		Users	Formers	Nevers
Users
A19	Easygoing on repairs	−.40
A18	No repair service	−.34
A7	Breakdown complainers	+.30
A5	Too many choices	−.27
A13	Antisports	−.24
A10	Antireligious	+.17
Formers
A4	Burned out on repeats		+.22
A18	No repair service		+.19
H12	Card/board game player		+.18
H1	High-brow		−.18
H3	Party hog		+.15
A9	DVD preference		+.16
Nevers
A7	Breakdown complainer			−.29
A19	Easygoing on repairs			+.26
A5	Too many choices			+.23
A13	Antisports			+.21
A10	Antireligious			−.19

The following table shows regression coefficients for two dependent variables. The first dependent variable is willingness to spend money for cable TV. The independent variables are responses to attitudinal statements. The second dependent variable is stated desire never to allow cable TV in their homes. By examining the regression coefficients, what can you say about persons willing to spend money for cable TV and those who will not allow cable TV in their homes?
Explain what predictive analytics encompasses. Provide examples of some marketing problems to which you might apply predictive analytics.
Describe the steps in the predictive analytics process.

What is Hadoop? How does it relate to Big Data?

	Regression Coefficients
Willing to Spend Money for Cable TV
Easygoing on cable repairs	−3.04
Cable movie watcher	2.81
Comedy watcher	2.73
Early to bed	−2.62
Breakdown complainer	2.25
Lovelorn	2.18
Burned out on repeats	−2.06
Never Allow Cable TV in Home
Antisports	0.37
Object to sex	0.47
Too many choices	0.88

Working the Net

A good discussion of cluster analysis can be found at http://faculty.darden.virginia.edu/GBUS8630/doc/M-0748.pdf.
For some easy to digest and comprehensive information on multivariate analysis, including how to run these analyses in SPSS, visit http://core.ecu.edu/psyc/wuenschk/spss/SPSS-MV.htm.

Real-Life Research

18.1 Satisfaction Research for Pizza Quik

The Problem Pizza Quik is a regional chain of pizza restaurants operating in seven states in the Midwest. Pizza Quik has adopted a total quality management (or TQM) orientation.¹⁸ As part of this orientation, the firm is committed to the idea of market-driven quality. That is, it intends to conduct a research project to address the issue of how its customers define quality and to learn from the customers themselves what they expect in regard to quality.

Research Objectives The objectives of the proposed research are:

To identify the key determinants of customer satisfaction.
To measure current customer satisfaction levels on those key satisfaction determinants.
To determine the relative importance of each key satisfaction determinant in deriving overall satisfaction.
To provide recommendations to management regarding where to direct the company’s efforts.

Methodology The first objective was met by means of qualitative research. A series of focus groups were conducted with customers to determine which attributes of Pizza Quik’s product and service are most important to them. Based on this analysis, the following attributes were identified:

Overall quality of food
Variety of menu items
Friendliness of Pizza Quik’s employees
Provision of good value for the money
Speed of service

In the second stage of the research, central-location telephone interviews were conducted with 1,200 randomly selected individuals who had purchased or eaten at a Pizza Quik restaurant (in the restaurant or take-out) in the past 30 days. Key information garnered in the survey included:

Overall rating of satisfaction with Pizza Quik on a 10-point scale (1 = poor and 10 = excellent).
Rating of Pizza Quik on the five key satisfaction attributes identified in the qualitative research, using the same 10-point scale as for overall satisfaction.
Demographic characteristics.

Results And Analysis Extensive cross tabulations and other traditional statistical analyses were conducted. A key part of the analysis was to estimate a regression model with overall satisfaction as the dependent variable and satisfaction with key product and service attributes as the predictors. The results of this analysis were:

where

Average ratings on the 10-point scale for overall satisfaction and the five key attributes were:

The regression coefficients provide estimates of the relative importance of the different attributes in determining overall satisfaction. The results show that X₅ (rating of speed of service) is the most important driver of overall satisfaction. The results also indicate that a one-unit increase in average rating on speed of service will produce an increase of .57 in average satisfaction rating. For example, the current average rating on speed of service is 8.2. If, by providing faster service, Pizza Quik could increase this rating to 9.2, then it would expect the average satisfaction rating to increase to 7.87. X₁ (rating of food quality) and X₄ (rating of value) are not far behind speed of service in their effect on overall satisfaction according to the regression estimates. At the other extreme, X₂ (rating of variety of menu) is least important in determining overall satisfaction, and X₃ (rating of friendliness of employees) is in between in importance.

The performance ratings provide a different picture. According to the average ratings, customers believe Pizza Quik is doing the best job on X₃ (friendliness of employees) and the worst job on X₁ (food quality).

Questions

Plot the importance and performance scores in a matrix. One axis would be importance from low to high and the other would be performance from low to high.
Which quadrant should you pay the most attention to? Why?
Which quadrant or quadrants should you pay the least attention to? Why?
Based on your analysis, where would you advise the company to focus its effort? What is the rationale behind this advice?

18.2 Acme Car Wash Systems

Acme Car Wash Systems franchises car washes throughout the United States. Currently, 872 car washes franchised by Acme are in operation. As part of its service to franchisees, Acme runs a national marketing and advertising campaign.

Carl Bahn is the senior vice president in charge of marketing for Acme. He is currently in the process of designing the marketing and advertising campaigns for the upcoming year. Bahn believes that it is time for Acme to take a more careful look at user segments in the market. Based on other analysis, he and his associates at Acme have decided that the upcoming campaign should target the heavy user market. Also, by reference to other research, Acme has defined heavy car wash users as those individuals that have their cars washed at a car wash facility three or more times per month on average. Light users are defined as those that use such a facility less than three times a month but at least four times a year. Nonusers are defined as those that use such a facility less than four times per year. Bahn and his associates are currently in the process of attempting to identify those factors that discriminate between heavy and light users. In the first state of this analysis, they conducted interviews with 50 Acmeers at 100 of their locations for a total of 5,000 interviews. Cross tabulation of the classification variables by frequency of use suggests that four variables may be predictive of usage heaviness: vehicle owner age, annual income of vehicle owner, age of vehicle, and socioeconomic status of vehicle owner (based on an index of socioeconomic variables).

Acme retained a marketing research firm called Marketing Metrics to do further analysis for the company. Marketing Metrics evaluated the situation and decided to use multiple discriminant analysis to further analyze the survey results and identify the relative importance of each of the four variables in determining whether a particular individual is a heavy or light user. The firm obtained the following results:

Questions

What would you tell Bahn about the importance of each of the predictor variables?
What recommendations would you make to him about the type of people Acme should target based on its interest in communicating with heavy users?

Appendix: Role of Marketing Research in the Organization and Ethical Issues

Marketing Research across the Organization

The question of data interpretation is not fully resolved in business today. Someone must still look at the data and decide what they really mean. Often this is done by the people in marketing research. Defend the proposition that persons in engineering, finance, and production should interpret all marketing research data when the survey results affect their operations. What are the arguments against this position?
Marketing research data analysis for a large electric utility found that confidence in the abilities of the repairperson is the customers’ primary determinant of their satisfaction or dissatisfaction with the electric utility. Armed with these findings, the utility embarked on a major advertising campaign extolling the heroic characteristics of the electric utility repairperson. The repair people hated the campaign. They knew that they couldn’t live up to the customer expectations created by the advertising. What should have been done differently?
When marketing research is used in strategic planning, it often plays a role in determining long-term opportunities and threats in the external environment. Threats, for example, may come from competitors’ perceived future actions, new competitors, governmental policies, changing consumer tastes, or a variety of other sources. Management’s strategic decisions will determine the long-term profitability, and perhaps even the survival, of the firm. Most top managers are not marketing researchers or statisticians; therefore, they need to know how much confidence they can put into the data. Stated differently, when marketing researchers present statistical results, conclusions, and recommendations, they must understand top management’s tolerance for ambiguity and imprecision. Why? How might this understanding affect what marketing researchers present to management? Under what circumstances might the level of top management’s tolerance for ambiguity and imprecision shift?

Ethical Dilemma: Branding the Black Box in Marketing Research

Marketing research discovered branding in the mid-1980s and it experienced phenomenal growth in the 1990s, which continues today. Go to virtually any large marketing research firm’s website, and you’ll see a vast array of branded research products for everything from market segmentation to customer value analysis—all topped off with a diminutive^{SM, TM}, or^®. Here’s just a sample: MARC’s Designor^SM, Market Facts’ Brand Vision^®, Maritz Research’s 80/203 Relationship Manager, and Total Research’s TRBC^TM, a scale bias correction algorithm.

A common denominator across some of these products is that they are proprietary, which means the firms won’t disclose exactly how they work. That’s why they’re also known pejoratively as black boxes. A black box method is proprietary—a company is able to protect its product development investment. And if customers perceive added value in the approach, suppliers can charge a premium price to boot. (Black boxes and brand names are not synonymous. Almost all proprietary methods have a clever brand name, but there are also brand names attached to research methods that are not proprietary.)

At least two factors have given rise to this branding frenzy. First, competitive pressures force organizations to seek new ways to differentiate their product offerings from those of their competitors. Second, many large research companies are publicly held, and publicly held companies are under constant pressure to increase sales and profits each quarter. One way to do this is to charge a premium price for services. If a company has a proprietary method for doing a marketing segmentation study, presumably it can charge more for this approach than another firm using publicly available software such as SPSS or SAS. Ironically, it is possible that some black boxes are perfectly standard software such as SPSS and SAS; but if their proponents won’t say how they work, or which techniques are used, these methods are still black boxes.

Questions

Is the use of branded black box models unethical?
Should marketing research suppliers be forced to explain to clients how their proprietary models work?
Should firms be required by law to conduct validity and reliability tests on their models to demonstrate that they are better than nonproprietary models?

Source: Terry Grapentine, “You Can’t Take Human Nature Out of Black Boxes,” Marketing Research (Winter 2001), p. 21.

SPSS Exercises for Chapter 18

Exercise 1: Multivariate Regression

This exercise uses multivariate regression to explain and predict how many movies a respondent attends in a month.

Go to the website for the text and download the Movie database.
Open the database in SPSS and view the variables under Variable View. We will be using the independent variables Q2 Q4 Q6 Q8a Q8b Q8c Q8d Q9 Q10 Q12 and Q13 to predict the dependent variable Q3.
We are including the variables Q4 and Q6 as is. Strictly speaking, is this proper? What might you want to do instead and why? Why might you decide to leave a variable in bins instead? Would it ever be proper to use a variable like Q11 as is?
Go to Analyze → Descriptive Statistics → Descriptives and move Q3 Q2 Q4 Q6 Q8a Q8b Q8c Q8d Q9 Q10 Q12 and Q13 to the Variable(s) box and click OK. Multivariate techniques require that every variable have a legitimate value. If a respondent did not answer every question, then the analyst must either ignore the observation entirely or impute estimates for the missing values. The default for statistical software is to ignore those observations automatically. We will not do imputation for this exercise.
1. What will the sample size be for later multivariate techniques?
2. Is this sample size large enough for multivariate regression?
3. What would some possible problems be if the sample size were not large enough?
4. Are the minimum and maximum values for each variable within the proper range? A value that is out of range would indicate either a data input error or a user-defined missing value like “Refused” or “Don’t Know.” Data input errors should be corrected or deleted. User-defined missing values should be declared in SPSS.
5. Are all the variables within the proper range?
- Go to Analyze → Regression → Linear.
- Move Q3 to Dependent.
- Move Q2 Q4 Q6 Q8a Q8b Q8c Q8d Q9 Q10 Q12 Q13 to Independent(s).
- Change Method to Stepwise.
- Click OK.
1. Which independent variables did the stepwise regression select? Why not the rest?
2. Is each variable chosen significant?
3. Are the variables that have not been chosen necessarily insignificant?
4. Is the model significant?
5. Does this method guarantee that you get the “best” model?
Go to Analyze → Descriptive Statistics → Descriptives and remove Q6 Q8a Q8b Q8c Q8d Q9 Q10 and Q12 from the Variable(s) box, so that only Q3 Q2 Q4 and Q13 remain in the box and then click OK. What is the sample size now?
- Go to Analyze → Regression → Linear.
- Move Q3 to Dependent.
- Remove Q6 Q8a Q8b Q8c Q8d Q9 Q10 and Q12 from Independent(s) so that only Q2 Q4 and Q13 remain.
- Change Method to Enter.
- Click OK.
1. How and why does this model differ from the model based on stepwise regression?
2. Which model is better?

Interpretation

How does stated importance affect the number of times one attends movies?
How does spending money on snacks affect the number of times one attends movies?
How does student classification affect the number of times one attends movies?
If a sophomore thought that going to the movies was somewhat important and typically spent $12 on snacks, how many times per month would he or she attend movies, based on this model?
Do any of the variables, according to the results, appear to have an effect on the number of times one attends movies, or does it seem that other factors not covered in this survey are driving movie attendance?

Exercise 2: Factor Analysis

This exercise uses factor analysis to explore how survey respondents consider various aspects of a theater visit.

Go to the website for the text and download the Movie database.
Open the database in SPSS and view the variables under Variable View. Notice that Question 5 has 9 importance rating items.
Go to Analyze → Descriptive Statistics → Descriptives and move Q5a through Q5i to the Variable(s) box and click OK.
1. Which item is the most important?
2. Which item is the least important?
  Multivariate techniques require that every variable have a legitimate value. If a respondent did not answer every question, then the analyst must either ignore the observation entirely or impute estimates for the missing values. The default for statistical software is to ignore those observations automatically. We will not get involved with imputation for this exercise.
1. What will the sample size be for later multivariate techniques?
2. Is this sample size large enough for factor analysis?
3. What would some possible problems be if the sample size were not large enough?
4. It is a good idea to check that the minimum and maximum values for each variable are within the proper range. A value that is out of range indicates either a data input error or a user-defined missing value such as “Refused” or “Don’t Know.” Data input errors should be corrected or deleted. User-defined missing values should be declared in SPSS.
5. Are all the variables within the proper range?
Go to Analyze → Descriptive Statistics → Descriptives and move Q5a through Q5i to the Variables box and click OK.
Examine the resulting correlations matrix.
1. Other than the 1’s down the main diagonal of the matrix, what is the highest correlation in absolute value?
2. Does any variable “just not fit” with the others?
3. Does multicollinearity appear to exist among some of the items?
Go to Analyze → Data Reduction → Factor.
Move Q5a through Q5i to the Variables box.
Click the Rotation button, place a check in front of “Varimax,” and click Continue.
Click the Options button.
Place a check in front of “Sorted by size.”
Place a check in front of “Suppress absolute values less than” and set the value after it to .25.
Click Continue.
Click OK.
SPSS produces a lot of output Factor Analysis. It is possible to create much more output than we have generated here by setting various subcommands and options.
1. How many factors did SPSS create?
2. Why did it stop at that number?
3. How could you change the defaults to create a different number of factors?
4. Go to the output entitled Total Variance Explained. How much variance was explained in this Factor Analysis?
5. Go to the output entitled Rotated Component Matrix. Why are some elements in this matrix blank?
6. Do the components or factors make sense?
7. Can you identify a common theme for each component or factor?

Interpretation

Is this a good factor solution? Why do you say that?
How might you create a better factor solution?
What understanding has this analysis helped you gain about how moviegoers perceive their movie-going experience?
What recommendations would you give a manager of a movie house based on this analysis?

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for CHAPTER 18: Multivariate Data Analysis

Create new playlist

Sign In

Sign Up

Multivariate Analysis Procedures

Multivariate Software

Multiple Regression Analysis

Applications of Multiple Regression Analysis

Multiple Regression Analysis Measures

Dummy Variables

Potential Use and Interpretation Problems

Collinearity

Causation

Standardizing Regression Coefficients

Sample Size

Multiple Discriminant Analysis

Applications of Multiple Discriminant Analysis

Cluster Analysis

Procedures for Clustering

Factor Analysis

Factor Scores

Factor Loadings

Naming Factors

Number of Factors to Retain

Conjoint Analysis

Example of Conjoint Analysis

Considering Features Conjointly

Estimating Utilities

Simulating Buyer Choice

Limitations of Conjoint Analysis

Big Data and Hadoop

Predictive Analytics15

Using Predictive Analytics

Acquiring a Data Set

Pre-processing

Modeling

Validating Results

Applying the Results

Privacy Concerns and Ethics

Commercial Predictive Modeling Software and Applications

Summary

Key Terms

Questions for Review & Critical Thinking

Working the Net

Table of Contents for
CHAPTER 18: Multivariate Data Analysis

Predictive Analytics¹⁵