inside_graphic

Appendix
Data Sets
Smaller Data Sets
Large Case Data Sets

At the end of most chapters, we have several exercises usually using (a) data set(s) used in the chapter or perhaps in another chapter. The main purpose of these exercises is to improve and/or expand on the mechanics of the chapter techniques. On the other hand, in doing so, we believe we are not improving the most difficult aspect of the statistical problem-solving process—deciding whether a technique is appropriate or not. We have taken a twofold approach to address this critical problem-solving step.

First, we have eight smaller data sets that can be assigned at the end of Chapters 2‒9. Second, we also provide six rich case data sets—either with numerous observations and/or with numerous variables. These data sets, in general, would require more time to analyze and could be appropriate for semester-long assignments. In either case, both types of data sets are from real-world data. In some cases, the data set is not appropriate for the technique(s) covered in the chapter. For other situations, it is very appropriate, and the results may or may not provide any real benefit. We leave it up to the instructor as to how to assign these data sets.

In the next section, we provide a brief description of each data set.

Smaller Data Sets

City Ranking

File: CityRanking.jmp

We see it all the time—a listing of the top ten cities to live in. How do they come up with these rankings? We have 367 metropolitan areas and several descriptive variables: Population, Cost of Living, % Creative Class, Median household income (Med HH Income), percentage income growth (% Inc Growth), State, and metropolitan area (Metro Area). How do these variables impact a city becoming ideal? And how do we rank the cities, from best to worse?

Credit Card Statistics

File: Credit Card Stats.jmp

We have 423,000 observations from a credit card company on their customers’ purchasing patterns by month for the year 2009. The variables include:

accountnumber, Year_Mo, sum, productname, segmentdescription, categoryid, and merchantname. Can you find any patterns or relationships that affect credit card sales?

Crime Data

File: Crimedata.jmp

The FBI compiles an annual report of the volume and rate of crime offenses for the nation, the states, and individual agencies. This report also includes arrest, clearance, and law enforcement employee data. The entire list of variables is shown in the Table below.

Region

 

Murder Rate

 

Total (Violent+Property)

 

Burglary

State

 

Rape Rate

 

Violent

 

Larceny

Year

 

Robbery Rate

 

Property

 

MVTheft

Population

 

Agg-Aslt Rate

 

Murder

 

NewTotal

Total Rate

 

Burglary Rate

 

Rape

 

New Violent

Violent Rate

 

Larceny Rate

 

Robbery

 

New Property

Property Rate

 

MVTheft Rate

 

Agg_aslt

 

 

The data set has this annual data from 1973 to 1999. Can you find any patterns or relationships in this crime data that will be helpful to the FBI and/or local police?

Retention

File: Freshman

A major issue that most universities face is minimizing the number of students who drop out, leave, or transfer. The data set contains information on 100 college students that have just completed their freshmen year. Variables in the file include College GPA, Miles from Home, College (within the university), Accommodations (dormitory or off-campus housing), Years Off (time off between high school and college), Part-time Work Hours, Attends Office Hours, and High School GPA. The university hopes to understand which variables contribute to whether a student will fail during freshman year and leave the school or succeed and return for their sophomore year.

Health Care Trends

File: Healthtrends

Trends in health care are extremely useful to public policy agencies and health care industry companies such as pharmaceutical companies. Several companies are repositories of the data from different aspects—e.g., physician visits or prescriptions—of the health care industry. The data set contains 3129 records, which are at a snapshot macro level data set of physician visits for three months. The variables included are medical procedure (Procedures), medical diagnosis (Diagnosis), and patient count by week. Given this data set, answer the following questions:

1. Why patients are being treated—diagnoses

2. How are patients being treated—procedures

3. Are there trends in treatment

Massachusetts Housing

File: MassHousing

Although lagging behind business, federal and state governments are increasing their use of business analytics. In this data set, we have, for 506 towns in the metropolitan Boston area, the crime rates and the following associated variables:

crim: per capita crime rate by town

zn: proportion of residential land zoned for lots over 25,000 sq. ft.

indus: proportion of non-retail business acres per town

chas: Charles River dummy variable (1 if tract bounds river; 0 otherwise)

nox: nitric oxide concentration (parts per 10 million)

rooms: average number of rooms per dwelling

age: proportion of owner-occupied units built prior to 1940

distance: weighted distances to five Boston employment centers

radial: index of accessibility to radial highways

tax: full-value, property-tax rate per $10,000

pt: pupil-teacher ratio by town

b: 1000(Bk – 0.63)2, where Bk is the proportion of blacks by town

lstat: % lower status of the population

mvalue: median value of owner-occupied homes is $1000

The state of Massachusetts is interested in understanding which factors may have an impact on crime rates.

Equality Promotion

File: Promotion

To be eligible for promotion to lieutenant and captain, according to a fire department’s union-weighted score, 60% of the candidates are assigned a written exam and 40% an oral. An overall score of 70% must be achieved. The data set has the results for all 118 firefighters that took the exam: 77 for promotion to lieutenant, and the remaining 41 for captain. The variables in the file are:

Race: W = white, H=Hispanic, B=black

Position: Captain or lieutenant

Oral: Oral exam score

Written: Written exam score

Combine: Weighted total score, with 60% written and 40% oral

At the time of the exam, 8 lieutenant and 7 captain positions were available. First, assist the fire department management and identify the top 10 candidates for each position. Second, the exams were not certified, so there may be concern about fairness and, in particular, concern about reverse discrimination.

Titanic Survivors

File: Titanic Passengers

What may have affected the survival of passengers of the Titanic? The data set contains information on 1309 individual passengers (not the crew, only passengers). The passenger variables are:

Passenger Class, Survived, Name, Sex, Age, Siblings and Spouses, Parents and Children, Ticket #, Fare, Cabin, Port, Lifeboat, Body, Home/Destination, and Midpoint age.

The question to be addressed is: Did any of the passenger variables have an effect on their survival?

Large Case Data Sets

Apples

File: Applessurvey.jmp

Over the past several years, the fresh fruit and vegetable industries have experienced significant increase in the propensity of consumers to purchase local and organic food. In particular, from 1997 to 2008, the consumption of organic foods and beverages increased from $3.6 billion to $21.1 billion. Several interrelated factors have been the driving forces behind these trends. Examples are: concern for healthy foods, desire for better tasting foods, concern over chemicals/pesticides in food, and simply providing support to local industry. One of the most frequently produced locally and purchased fruits are apples. So, the study focused on apples.

An online survey of adult residents of Pennsylvania was conducted in 2009 and resulted in 1224 completed surveys. Due to Pennsylvania’s diversity of urban, suburban, and rural environments, industrial and agricultural commerce, and additionally, since the state is a major producer and consumer of apples, the state is viewed as a good representative sample. The major objectives of the survey were to evaluate the market opportunities and profitability of organic farming and to identify the factors that influence consumer purchasing of organic apples. In the file Applesurveya.xlsx, are two worksheets, one with the survey results (survey responses) and one that describes the survey questions (Survey questions). The survey results are also in Applessurvey.jmp. In the Word file Organic Apple Surveya, is the survey instrument.

Note: There is no question 53; Also, 1—Yes and 2—No; and 1—checked and 0—not checked.

Bank Churn

File: CC Churn.jmp

The highly competitive bank industry has been a leader in using business analytics. Two areas of application have been to use their data to identify characteristics in retaining their current customers and in obtaining new customers. Davenport and Harris noted in their book Competing on Analytics1, that Capital One is one of industries’ leaders in using their data. Common measures used in the industry are churn or churn rate. Churn is either yes or no: yes if you leave the current service and no if you stay. Churn rate is the percentage of customers that leave/stop using a service during a certain period of time. In the JMP file CC Churn.jmp2 is a bank’s customer data with 245,465 observations, which contain several descriptive characteristics: Churn Flag, cust id, Average Daily Balance, Interest Paid, Cash Advances, Balance Transferred, Marital Status, Occupation Group, Age of Account (Months), Age Group, LTV Group (life time value group), Bill Cycle, Customer Type, Gender, Customer Value, and Credit Limit.

Can this information be used to predict churn?

Enrollment

File: Enrollment.jmp

University admissions offices have dramatically changed over recent years in how they contact, communicate, and uncover perspective students. For most students, their initial contacts with the university are over the Internet. Let it be by the university’s website, Facebook, or even Second Gen. The old days of just visiting certain high schools and going to college fairs is simply just not enough. Every admissions office is exploring ways to obtain a competitive advantage. The data set Enrollment.jmp has 39,441 observations for Fall 2006 to Fall 2010. This undergraduate admissions data is from a small (total student population, undergraduate and graduate, of just less than 9000) university. The variables are:

Academic Period

 

Gender Description

 

Merit-Based Financial Aid

Unique ID

 

Secondary School

 

Residency Indicator

State Province

 

High School GPA

 

Common Application- Paper

Nation Description

 

Act English

 

Common Application

Student Level

 

Act Math

 

Saint Joseph’s Online Application

Student Population

 

Act Reading

 

Common Application Upload

Application Date

 

Act Science Reasoning

 

Saint Joseph’s Paper Application

Admissions Population Description

 

Act Composite

 

Pre-Dental

Residency Description

 

Sat Verbal

 

Pre-Law

College Description

 

Sat Mathematics

 

Pre-Med

Major Description

 

Sat Total Score

 

Pre-Veterinarian

Applied

 

ACRK Index

 

 

Admitted

 

Institutional Aid Offered

 

 

Enrolled

Class Rank

Legacy Description

Class Size

Citizenship Description

Class Rank Percentile

Religion Description

Nation Of Birth Description

SOC

Admissions Athlete

Ethnicity

Need-Based Financial Aid

More university admissions offices are analyzing their admissions data to better understand not only the students that do enroll at the university, but, also, those that do not enroll. Can you help them?

Home Equity

File: hmeq.jmp

The banking industry is one of the leaders in applying business analytics as part of their operation. The data available is from 5960 customers in the file hmeq.jmp3. The variables are:

Default: 1 = defaulted on loan; 0 = paid load in full

Loan: Amount of loan requested

Mortgage: Amount due on existing mortgage

Value: Current value of property

Reason: Reason for the loan request—HomeImp = home improvement;

DebtCon = debt consolidation

Job: Six occupational categories

YOJ: Years at present job

Derogatories: Number of major derogatory reports

Delinquencies: Number of delinquent credit lines

CLAge: Age of oldest credit line in months

Inquiries: Number of recent credit inquiries

CLNo: Number of existing credit lines

DEBTINC: Debt-to-income ratio

In this scenario, a bank would like to use their customer information to assist them in determining whether or not they should approve a home equity loan.

Pharmaceuticals

File: Pharm.xls

EMD Research is one of the leading healthcare market research companies providing pharmaceutical companies and investors with forecasts and trends of drug usage. In the file Pharm.xls are weekly data for five cholesterol drugs (Crestor, Lipitor, Vytorin, Zetia, and Zocor) and for several strengths (i.e., dosage levels) for 2 ¼ years4. Besides trying to possibly improve their 3-month forecasts, can you discover any significant trends or findings in the cholesterol drug market?

Cell Phone Churn

File: churn.jmp

Cell phone companies are constantly bombarding us with new phones, programs, and other offerings so that we would leave our current company and change to their company. A term describing such behavior is churn. In the file churn.jmp, we have a data set from a cell phone company with data describing 3,333 customers and their cell phone characteristics5:

Account Length

 

Eve Mins

Area Code

 

Eve Calls

Phone

 

Eve Charge

Int’l Plan

 

Night Mins

VMail Plan

 

Night Calls

E_VMAIL_PLAN

 

Night Charge

D_VMAIL_PLAN

 

Intl Mins

VMail Message

 

Intl Calls

Day Mins

 

Intl Charge

Day Calls

 

CustServ Calls

Day Charge

 

Churn?

Using this data, can you assist this cell phone company in predicting churn?

(Endnotes)

1 Competing for Analytics: The Science of Winning, Thomas H. Davenport and Jeanne G. Harris, Harvard
Business School Press, 2007.

2 Thanks to Chuck Pirrello of SAS for providing the data set.

3 Thanks to Tom Bohannon of SAS for providing the data set.

4 Competing for Analytics: The Science of Winning, Thomas H. Davenport and Jeanne G. Harris, Harvard
Business School Press, 2007.

5 Thanks to Tom Bohannon of SAS for providing the data set.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset