Appendix
Data Sets
Smaller Data Sets
Large Case Data Sets
At the end of most chapters, we have several exercises usually using (a) data set(s) used in the chapter or perhaps in another chapter. The main purpose of these exercises is to improve and/or expand on the mechanics of the chapter techniques. On the other hand, in doing so, we believe we are not improving the most difficult aspect of the statistical problem-solving process—deciding whether a technique is appropriate or not. We have taken a twofold approach to address this critical problem-solving step.
First, we have eight smaller data sets that can be assigned at the end of Chapters 2‒9. Second, we also provide six rich case data sets—either with numerous observations and/or with numerous variables. These data sets, in general, would require more time to analyze and could be appropriate for semester-long assignments. In either case, both types of data sets are from real-world data. In some cases, the data set is not appropriate for the technique(s) covered in the chapter. For other situations, it is very appropriate, and the results may or may not provide any real benefit. We leave it up to the instructor as to how to assign these data sets.
In the next section, we provide a brief description of each data set.
Smaller Data Sets
City Ranking
File: CityRanking.jmp
We see it all the time—a listing of the top ten cities to live in. How do they come up with these rankings? We have 367 metropolitan areas and several descriptive variables: Population, Cost of Living, % Creative Class, Median household income (Med HH Income), percentage income growth (% Inc Growth), State, and metropolitan area (Metro Area). How do these variables impact a city becoming ideal? And how do we rank the cities, from best to worse?
Credit Card Statistics
File: Credit Card Stats.jmp
We have 423,000 observations from a credit card company on their customers’ purchasing patterns by month for the year 2009. The variables include:
accountnumber, Year_Mo, sum, productname, segmentdescription, categoryid, and merchantname. Can you find any patterns or relationships that affect credit card sales?
Crime Data
File: Crimedata.jmp
The FBI compiles an annual report of the volume and rate of crime offenses for the nation, the states, and individual agencies. This report also includes arrest, clearance, and law enforcement employee data. The entire list of variables is shown in the Table below.
Region |
|
Murder Rate |
|
Total (Violent+Property) |
|
Burglary |
State |
|
Rape Rate |
|
Violent |
|
Larceny |
Year |
|
Robbery Rate |
|
Property |
|
MVTheft |
Population |
|
Agg-Aslt Rate |
|
Murder |
|
NewTotal |
Total Rate |
|
Burglary Rate |
|
Rape |
|
New Violent |
Violent Rate |
|
Larceny Rate |
|
Robbery |
|
New Property |
Property Rate |
|
MVTheft Rate |
|
Agg_aslt |
|
|
The data set has this annual data from 1973 to 1999. Can you find any patterns or relationships in this crime data that will be helpful to the FBI and/or local police?
Retention
File: Freshman
A major issue that most universities face is minimizing the number of students who drop out, leave, or transfer. The data set contains information on 100 college students that have just completed their freshmen year. Variables in the file include College GPA, Miles from Home, College (within the university), Accommodations (dormitory or off-campus housing), Years Off (time off between high school and college), Part-time Work Hours, Attends Office Hours, and High School GPA. The university hopes to understand which variables contribute to whether a student will fail during freshman year and leave the school or succeed and return for their sophomore year.
Health Care Trends
File: Healthtrends
Trends in health care are extremely useful to public policy agencies and health care industry companies such as pharmaceutical companies. Several companies are repositories of the data from different aspects—e.g., physician visits or prescriptions—of the health care industry. The data set contains 3129 records, which are at a snapshot macro level data set of physician visits for three months. The variables included are medical procedure (Procedures), medical diagnosis (Diagnosis), and patient count by week. Given this data set, answer the following questions:
1. Why patients are being treated—diagnoses
2. How are patients being treated—procedures
3. Are there trends in treatment
Massachusetts Housing
File: MassHousing
Although lagging behind business, federal and state governments are increasing their use of business analytics. In this data set, we have, for 506 towns in the metropolitan Boston area, the crime rates and the following associated variables:
crim: per capita crime rate by town
zn: proportion of residential land zoned for lots over 25,000 sq. ft.
indus: proportion of non-retail business acres per town
chas: Charles River dummy variable (1 if tract bounds river; 0 otherwise)
nox: nitric oxide concentration (parts per 10 million)
rooms: average number of rooms per dwelling
age: proportion of owner-occupied units built prior to 1940
distance: weighted distances to five Boston employment centers
radial: index of accessibility to radial highways
tax: full-value, property-tax rate per $10,000
pt: pupil-teacher ratio by town
b: 1000(Bk – 0.63)2, where Bk is the proportion of blacks by town
lstat: % lower status of the population
mvalue: median value of owner-occupied homes is $1000
The state of Massachusetts is interested in understanding which factors may have an impact on crime rates.
Equality Promotion
File: Promotion
To be eligible for promotion to lieutenant and captain, according to a fire department’s union-weighted score, 60% of the candidates are assigned a written exam and 40% an oral. An overall score of 70% must be achieved. The data set has the results for all 118 firefighters that took the exam: 77 for promotion to lieutenant, and the remaining 41 for captain. The variables in the file are:
Race: W = white, H=Hispanic, B=black
Position: Captain or lieutenant
Oral: Oral exam score
Written: Written exam score
Combine: Weighted total score, with 60% written and 40% oral
At the time of the exam, 8 lieutenant and 7 captain positions were available. First, assist the fire department management and identify the top 10 candidates for each position. Second, the exams were not certified, so there may be concern about fairness and, in particular, concern about reverse discrimination.
Titanic Survivors
File: Titanic Passengers
What may have affected the survival of passengers of the Titanic? The data set contains information on 1309 individual passengers (not the crew, only passengers). The passenger variables are:
Passenger Class, Survived, Name, Sex, Age, Siblings and Spouses, Parents and Children, Ticket #, Fare, Cabin, Port, Lifeboat, Body, Home/Destination, and Midpoint age.
The question to be addressed is: Did any of the passenger variables have an effect on their survival?
Large Case Data Sets
Apples
File: Applessurvey.jmp
Over the past several years, the fresh fruit and vegetable industries have experienced significant increase in the propensity of consumers to purchase local and organic food. In particular, from 1997 to 2008, the consumption of organic foods and beverages increased from $3.6 billion to $21.1 billion. Several interrelated factors have been the driving forces behind these trends. Examples are: concern for healthy foods, desire for better tasting foods, concern over chemicals/pesticides in food, and simply providing support to local industry. One of the most frequently produced locally and purchased fruits are apples. So, the study focused on apples.
An online survey of adult residents of Pennsylvania was conducted in 2009 and resulted in 1224 completed surveys. Due to Pennsylvania’s diversity of urban, suburban, and rural environments, industrial and agricultural commerce, and additionally, since the state is a major producer and consumer of apples, the state is viewed as a good representative sample. The major objectives of the survey were to evaluate the market opportunities and profitability of organic farming and to identify the factors that influence consumer purchasing of organic apples. In the file Applesurveya.xlsx, are two worksheets, one with the survey results (survey responses) and one that describes the survey questions (Survey questions). The survey results are also in Applessurvey.jmp. In the Word file Organic Apple Surveya, is the survey instrument.
Note: There is no question 53; Also, 1—Yes and 2—No; and 1—checked and 0—not checked.
Bank Churn
File: CC Churn.jmp
The highly competitive bank industry has been a leader in using business analytics. Two areas of application have been to use their data to identify characteristics in retaining their current customers and in obtaining new customers. Davenport and Harris noted in their book Competing on Analytics1, that Capital One is one of industries’ leaders in using their data. Common measures used in the industry are churn or churn rate. Churn is either yes or no: yes if you leave the current service and no if you stay. Churn rate is the percentage of customers that leave/stop using a service during a certain period of time. In the JMP file CC Churn.jmp2 is a bank’s customer data with 245,465 observations, which contain several descriptive characteristics: Churn Flag, cust id, Average Daily Balance, Interest Paid, Cash Advances, Balance Transferred, Marital Status, Occupation Group, Age of Account (Months), Age Group, LTV Group (life time value group), Bill Cycle, Customer Type, Gender, Customer Value, and Credit Limit.
Can this information be used to predict churn?
Enrollment
File: Enrollment.jmp
University admissions offices have dramatically changed over recent years in how they contact, communicate, and uncover perspective students. For most students, their initial contacts with the university are over the Internet. Let it be by the university’s website, Facebook, or even Second Gen. The old days of just visiting certain high schools and going to college fairs is simply just not enough. Every admissions office is exploring ways to obtain a competitive advantage. The data set Enrollment.jmp has 39,441 observations for Fall 2006 to Fall 2010. This undergraduate admissions data is from a small (total student population, undergraduate and graduate, of just less than 9000) university. The variables are:
Academic Period |
|
Gender Description |
|
Merit-Based Financial Aid |
Unique ID |
|
Secondary School |
|
Residency Indicator |
State Province |
|
High School GPA |
|
Common Application- Paper |
Nation Description |
|
Act English |
|
Common Application |
Student Level |
|
Act Math |
|
Saint Joseph’s Online Application |
Student Population |
|
Act Reading |
|
Common Application Upload |
Application Date |
|
Act Science Reasoning |
|
Saint Joseph’s Paper Application |
Admissions Population Description |
|
Act Composite |
|
Pre-Dental |
Residency Description |
|
Sat Verbal |
|
Pre-Law |
College Description |
|
Sat Mathematics |
|
Pre-Med |
Major Description |
|
Sat Total Score |
|
Pre-Veterinarian |
Applied |
|
ACRK Index |
|
|
Admitted |
|
Institutional Aid Offered |
|
|
Enrolled |
|
Class Rank |
|
|
Legacy Description |
|
Class Size |
|
|
Citizenship Description |
|
Class Rank Percentile |
|
|
Religion Description |
|
Nation Of Birth Description |
|
|
SOC |
|
Admissions Athlete |
|
|
Ethnicity |
|
Need-Based Financial Aid |
|
|
More university admissions offices are analyzing their admissions data to better understand not only the students that do enroll at the university, but, also, those that do not enroll. Can you help them?
Home Equity
File: hmeq.jmp
The banking industry is one of the leaders in applying business analytics as part of their operation. The data available is from 5960 customers in the file hmeq.jmp3. The variables are:
Default: 1 = defaulted on loan; 0 = paid load in full
Loan: Amount of loan requested
Mortgage: Amount due on existing mortgage
Value: Current value of property
Reason: Reason for the loan request—HomeImp = home improvement;
DebtCon = debt consolidation
Job: Six occupational categories
YOJ: Years at present job
Derogatories: Number of major derogatory reports
Delinquencies: Number of delinquent credit lines
CLAge: Age of oldest credit line in months
Inquiries: Number of recent credit inquiries
CLNo: Number of existing credit lines
DEBTINC: Debt-to-income ratio
In this scenario, a bank would like to use their customer information to assist them in determining whether or not they should approve a home equity loan.
Pharmaceuticals
File: Pharm.xls
EMD Research is one of the leading healthcare market research companies providing pharmaceutical companies and investors with forecasts and trends of drug usage. In the file Pharm.xls are weekly data for five cholesterol drugs (Crestor, Lipitor, Vytorin, Zetia, and Zocor) and for several strengths (i.e., dosage levels) for 2 ¼ years4. Besides trying to possibly improve their 3-month forecasts, can you discover any significant trends or findings in the cholesterol drug market?
Cell Phone Churn
File: churn.jmp
Cell phone companies are constantly bombarding us with new phones, programs, and other offerings so that we would leave our current company and change to their company. A term describing such behavior is churn. In the file churn.jmp, we have a data set from a cell phone company with data describing 3,333 customers and their cell phone characteristics5:
Account Length |
|
Eve Mins |
Area Code |
|
Eve Calls |
Phone |
|
Eve Charge |
Int’l Plan |
|
Night Mins |
VMail Plan |
|
Night Calls |
E_VMAIL_PLAN |
|
Night Charge |
D_VMAIL_PLAN |
|
Intl Mins |
VMail Message |
|
Intl Calls |
Day Mins |
|
Intl Charge |
Day Calls |
|
CustServ Calls |
Day Charge |
|
Churn? |
Using this data, can you assist this cell phone company in predicting churn?
(Endnotes)
1 Competing for Analytics: The Science of Winning, Thomas H. Davenport and Jeanne G. Harris, Harvard
Business School Press, 2007.
2 Thanks to Chuck Pirrello of SAS for providing the data set.
3 Thanks to Tom Bohannon of SAS for providing the data set.
4 Competing for Analytics: The Science of Winning, Thomas H. Davenport and Jeanne G. Harris, Harvard
Business School Press, 2007.