R refers to categorical variables as factors, and the cut()
function enables us to break a continuous numerical variable into ranges, and treat the ranges as factors or categorical variables, or to classify a categorical variable into a larger bin.
clinical.trial<- data.frame(patient = 1:1000, age = rnorm(1000, mean = 50, sd = 5), year.enroll = sample(paste("19", 80:99, sep = ""), 1000, replace = TRUE)) >clinical.trial<- data.frame(patient = 1:1000, + age = rnorm(1000, mean = 50, sd = 5), + year.enroll = sample(paste("19", 80:99, sep = ""), + 1000, replace = TRUE)) >summary(clinical.trial) patient age year.enroll Min. : 1.0 Min. :31.14 1995 : 61 1st Qu.: 250.8 1st Qu.:46.77 1989 : 60 Median : 500.5 Median :50.14 1985 : 57 Mean : 500.5 Mean :50.14 1988 : 57 3rd Qu.: 750.2 3rd Qu.:53.50 1990 : 56 Max. :1000.0 Max. :70.15 1991 : 55 (Other):654 >ctcut<- cut(clinical.trial$age, breaks = 5)> table(ctcut) ctcut (31.1,38.9] (38.9,46.7] (46.7,54.6] (54.6,62.4] (62.4,70.2] 15 232 558 186 9
The reference for the preceding data can be found at: http://www.r-bloggers.com/r-function-of-the-day-cut/.
Here is the equivalent of the earlier explained cut()
function in pandas (only applies to Version 0.15+):
In [79]: pd.set_option('precision',4) clinical_trial=pd.DataFrame({'patient':range(1,1001), 'age' : np.random.normal(50,5,size=1000), 'year_enroll': [str(x) for x in np.random.choice(range(1980,2000),size=1000,replace=True)]}) In [80]: clinical_trial.describe() Out[80]: age patient count 1000.000 1000.000 mean 50.089 500.500 std 4.909 288.819 min 29.944 1.000 25% 46.572 250.750 50% 50.314 500.500 75% 53.320 750.250 max 63.458 1000.000 In [81]: clinical_trial.describe(include=['O']) Out[81]: year_enroll count 1000 unique 20 top 1992 freq 62 In [82]: clinical_trial.year_enroll.value_counts()[:6] Out[82]: 1992 62 1985 61 1986 59 1994 59 1983 58 1991 58 dtype: int64 In [83]: ctcut=pd.cut(clinical_trial['age'], 5) In [84]: ctcut.head() Out[84]: 0 (43.349, 50.052] 1 (50.052, 56.755] 2 (50.052, 56.755] 3 (43.349, 50.052] 4 (50.052, 56.755] Name: age, dtype: category Categories (5, object): [(29.91, 36.646] < (36.646, 43.349] < (43.349, 50.052] < (50.052, 56.755] < (56.755, 63.458]] In [85]: ctcut.value_counts().sort_index() Out[85]: (29.91, 36.646] 3 (36.646, 43.349] 82 (43.349, 50.052] 396 (50.052, 56.755] 434 (56.755, 63.458] 85 dtype: int64