
Data Sets
Smaller Data Sets
Large Case Data Sets

At the end of most chapters, we have several exercises usually using (a) data set(s) used in the chapter or perhaps in another chapter. The main purpose of these exercises is to improve and/or expand on the mechanics of the chapter techniques. On the other hand, in doing so, we believe we are not improving the most difficult aspect of the statistical problem-solving process—deciding whether a technique is appropriate or not. We have taken a twofold approach to address this critical problem-solving step.

First, we have eight smaller data sets that can be assigned at the end of Chapters 2‒9. Second, we also provide six rich case data sets—either with numerous observations and/or with numerous variables. These data sets, in general, would require more time to analyze and could be appropriate for semester-long assignments. In either case, both types of data sets are from real-world data. In some cases, the data set is not appropriate for the technique(s) covered in the chapter. For other situations, it is very appropriate, and the results may or may not provide any real benefit. We leave it up to the instructor as to how to assign these data sets.

In the next section, we provide a brief description of each data set.

Smaller Data Sets

City Ranking


We see it all the time—a listing of the top ten cities to live in. How do they come up with these rankings? We have 367 metropolitan areas and several descriptive variables: Population, Cost of Living, % Creative Class, Median household income (Med HH Income), percentage income growth (% Inc Growth), State, and metropolitan area (Metro Area). How do these variables impact a city becoming ideal? And how do we rank the cities, from best to worse?

Credit Card Statistics

File: Credit Card

We have 423,000 observations from a credit card company on their customers’ purchasing patterns by month for the year 2009. The variables include:

accountnumber, Year_Mo, sum, productname, segmentdescription, categoryid, and merchantname. Can you find any patterns or relationships that affect credit card sales?

Crime Data


The FBI compiles an annual report of the volume and rate of crime offenses for the nation, the states, and individual agencies. This report also includes arrest, clearance, and law enforcement employee data. The entire list of variables is shown in the Table below.



Murder Rate


Total (Violent+Property)





Rape Rate







Robbery Rate







Agg-Aslt Rate





Total Rate


Burglary Rate




New Violent

Violent Rate


Larceny Rate




New Property

Property Rate


MVTheft Rate





The data set has this annual data from 1973 to 1999. Can you find any patterns or relationships in this crime data that will be helpful to the FBI and/or local police?


File: Freshman

A major issue that most universities face is minimizing the number of students who drop out, leave, or transfer. The data set contains information on 100 college students that have just completed their freshmen year. Variables in the file include College GPA, Miles from Home, College (within the university), Accommodations (dormitory or off-campus housing), Years Off (time off between high school and college), Part-time Work Hours, Attends Office Hours, and High School GPA. The university hopes to understand which variables contribute to whether a student will fail during freshman year and leave the school or succeed and return for their sophomore year.

Health Care Trends

File: Healthtrends

Trends in health care are extremely useful to public policy agencies and health care industry companies such as pharmaceutical companies. Several companies are repositories of the data from different aspects—e.g., physician visits or prescriptions—of the health care industry. The data set contains 3129 records, which are at a snapshot macro level data set of physician visits for three months. The variables included are medical procedure (Procedures), medical diagnosis (Diagnosis), and patient count by week. Given this data set, answer the following questions:

1. Why patients are being treated—diagnoses

2. How are patients being treated—procedures

3. Are there trends in treatment

Massachusetts Housing

File: MassHousing

Although lagging behind business, federal and state governments are increasing their use of business analytics. In this data set, we have, for 506 towns in the metropolitan Boston area, the crime rates and the following associated variables:

crim: per capita crime rate by town

zn: proportion of residential land zoned for lots over 25,000 sq. ft.

indus: proportion of non-retail business acres per town

chas: Charles River dummy variable (1 if tract bounds river; 0 otherwise)

nox: nitric oxide concentration (parts per 10 million)

rooms: average number of rooms per dwelling

age: proportion of owner-occupied units built prior to 1940

distance: weighted distances to five Boston employment centers

radial: index of accessibility to radial highways

tax: full-value, property-tax rate per $10,000

pt: pupil-teacher ratio by town

b: 1000(Bk – 0.63)2, where Bk is the proportion of blacks by town

lstat: % lower status of the population

mvalue: median value of owner-occupied homes is $1000

The state of Massachusetts is interested in understanding which factors may have an impact on crime rates.

Equality Promotion

File: Promotion

To be eligible for promotion to lieutenant and captain, according to a fire department’s union-weighted score, 60% of the candidates are assigned a written exam and 40% an oral. An overall score of 70% must be achieved. The data set has the results for all 118 firefighters that took the exam: 77 for promotion to lieutenant, and the remaining 41 for captain. The variables in the file are:

Race: W = white, H=Hispanic, B=black

Position: Captain or lieutenant

Oral: Oral exam score

Written: Written exam score

Combine: Weighted total score, with 60% written and 40% oral

At the time of the exam, 8 lieutenant and 7 captain positions were available. First, assist the fire department management and identify the top 10 candidates for each position. Second, the exams were not certified, so there may be concern about fairness and, in particular, concern about reverse discrimination.

Titanic Survivors

File: Titanic Passengers

What may have affected the survival of passengers of the Titanic? The data set contains information on 1309 individual passengers (not the crew, only passengers). The passenger variables are:

Passenger Class, Survived, Name, Sex, Age, Siblings and Spouses, Parents and Children, Ticket #, Fare, Cabin, Port, Lifeboat, Body, Home/Destination, and Midpoint age.

The question to be addressed is: Did any of the passenger variables have an effect on their survival?

Large Case Data Sets



Over the past several years, the fresh fruit and vegetable industries have experienced significant increase in the propensity of consumers to purchase local and organic food. In particular, from 1997 to 2008, the consumption of organic foods and beverages increased from $3.6 billion to $21.1 billion. Several interrelated factors have been the driving forces behind these trends. Examples are: concern for healthy foods, desire for better tasting foods, concern over chemicals/pesticides in food, and simply providing support to local industry. One of the most frequently produced locally and purchased fruits are apples. So, the study focused on apples.

An online survey of adult residents of Pennsylvania was conducted in 2009 and resulted in 1224 completed surveys. Due to Pennsylvania’s diversity of urban, suburban, and rural environments, industrial and agricultural commerce, and additionally, since the state is a major producer and consumer of apples, the state is viewed as a good representative sample. The major objectives of the survey were to evaluate the market opportunities and profitability of organic farming and to identify the factors that influence consumer purchasing of organic apples. In the file Applesurveya.xlsx, are two worksheets, one with the survey results (survey responses) and one that describes the survey questions (Survey questions). The survey results are also in In the Word file Organic Apple Surveya, is the survey instrument.

Note: There is no question 53; Also, 1—Yes and 2—No; and 1—checked and 0—not checked.

Bank Churn

File: CC

The highly competitive bank industry has been a leader in using business analytics. Two areas of application have been to use their data to identify characteristics in retaining their current customers and in obtaining new customers. Davenport and Harris noted in their book Competing on Analytics1, that Capital One is one of industries’ leaders in using their data. Common measures used in the industry are churn or churn rate. Churn is either yes or no: yes if you leave the current service and no if you stay. Churn rate is the percentage of customers that leave/stop using a service during a certain period of time. In the JMP file CC Churn.jmp2 is a bank’s customer data with 245,465 observations, which contain several descriptive characteristics: Churn Flag, cust id, Average Daily Balance, Interest Paid, Cash Advances, Balance Transferred, Marital Status, Occupation Group, Age of Account (Months), Age Group, LTV Group (life time value group), Bill Cycle, Customer Type, Gender, Customer Value, and Credit Limit.

Can this information be used to predict churn?



University admissions offices have dramatically changed over recent years in how they contact, communicate, and uncover perspective students. For most students, their initial contacts with the university are over the Internet. Let it be by the university’s website, Facebook, or even Second Gen. The old days of just visiting certain high schools and going to college fairs is simply just not enough. Every admissions office is exploring ways to obtain a competitive advantage. The data set has 39,441 observations for Fall 2006 to Fall 2010. This undergraduate admissions data is from a small (total student population, undergraduate and graduate, of just less than 9000) university. The variables are:

Academic Period


Gender Description


Merit-Based Financial Aid

Unique ID


Secondary School


Residency Indicator

State Province


High School GPA


Common Application- Paper

Nation Description


Act English


Common Application

Student Level


Act Math


Saint Joseph’s Online Application

Student Population


Act Reading


Common Application Upload

Application Date


Act Science Reasoning


Saint Joseph’s Paper Application

Admissions Population Description


Act Composite



Residency Description


Sat Verbal



College Description


Sat Mathematics



Major Description


Sat Total Score





ACRK Index





Institutional Aid Offered




Class Rank

Legacy Description

Class Size

Citizenship Description

Class Rank Percentile

Religion Description

Nation Of Birth Description


Admissions Athlete


Need-Based Financial Aid

More university admissions offices are analyzing their admissions data to better understand not only the students that do enroll at the university, but, also, those that do not enroll. Can you help them?

Home Equity


The banking industry is one of the leaders in applying business analytics as part of their operation. The data available is from 5960 customers in the file hmeq.jmp3. The variables are:

Default: 1 = defaulted on loan; 0 = paid load in full

Loan: Amount of loan requested

Mortgage: Amount due on existing mortgage

Value: Current value of property

Reason: Reason for the loan request—HomeImp = home improvement;

DebtCon = debt consolidation

Job: Six occupational categories

YOJ: Years at present job

Derogatories: Number of major derogatory reports

Delinquencies: Number of delinquent credit lines

CLAge: Age of oldest credit line in months

Inquiries: Number of recent credit inquiries

CLNo: Number of existing credit lines

DEBTINC: Debt-to-income ratio

In this scenario, a bank would like to use their customer information to assist them in determining whether or not they should approve a home equity loan.


File: Pharm.xls

EMD Research is one of the leading healthcare market research companies providing pharmaceutical companies and investors with forecasts and trends of drug usage. In the file Pharm.xls are weekly data for five cholesterol drugs (Crestor, Lipitor, Vytorin, Zetia, and Zocor) and for several strengths (i.e., dosage levels) for 2 ¼ years4. Besides trying to possibly improve their 3-month forecasts, can you discover any significant trends or findings in the cholesterol drug market?

Cell Phone Churn


Cell phone companies are constantly bombarding us with new phones, programs, and other offerings so that we would leave our current company and change to their company. A term describing such behavior is churn. In the file, we have a data set from a cell phone company with data describing 3,333 customers and their cell phone characteristics5:

Account Length


Eve Mins

Area Code


Eve Calls



Eve Charge

Int’l Plan


Night Mins

VMail Plan


Night Calls



Night Charge



Intl Mins

VMail Message


Intl Calls

Day Mins


Intl Charge

Day Calls


CustServ Calls

Day Charge



Using this data, can you assist this cell phone company in predicting churn?


1 Competing for Analytics: The Science of Winning, Thomas H. Davenport and Jeanne G. Harris, Harvard
Business School Press, 2007.

2 Thanks to Chuck Pirrello of SAS for providing the data set.

3 Thanks to Tom Bohannon of SAS for providing the data set.

4 Competing for Analytics: The Science of Winning, Thomas H. Davenport and Jeanne G. Harris, Harvard
Business School Press, 2007.

5 Thanks to Tom Bohannon of SAS for providing the data set.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.