Problems

A patient is tested for having a virus V. The accuracy of the test is 98%. This virus V is currently present in 4 out of 100 people in the region of the patient:

a) What is the probability that a patient suffers from the virus V if they tested positive?
b) What is the probability that a patient can still suffer from the disease if the result of the test was negative?

Apart from assessing the patients for suffering from the virus V (in question 2.1.), by using the test, a doctor usually also checks for other symptoms. According to a doctor, about 85% of patients with symptoms such as fever, nausea, abdominal discomfort, and malaise suffer from the virus V:

a) What is the probability that a patient is suffering from the virus V if they have the symptoms mentioned above and their test result for the virus V is positive?
b) How likely is it the patient is suffering from the virus V if they have the symptoms mentioned above, but the result of the test is negative?

On a certain island, 1 in 2 tsunamis are preceded by an earthquake. There have been 4 tsunamis and 6 earthquakes in the past 100 years. A seismological station recorded an earthquake in the ocean near the island. What is the probability that it will result in a tsunami?

Patients are tested with four independent tests on whether they have a certain illness:

Test1 positive	Test2 positive	Test3 positive	Test4 positive	Illness
Yes	Yes	Yes	No	Yes
Yes	Yes	No	Yes	Yes
No	Yes	No	No	No
Yes	No	No	No	No
No	No	No	No	No
Yes	Yes	Yes	Yes	Yes
Yes	No	Yes	Yes	Yes
No	Yes	No	No	No
No	No	Yes	Yes	No
Yes	Yes	No	Yes	Yes
Yes	No	Yes	No	Yes
Yes	No	No	Yes	Yes
No	Yes	Yes	No	?

We have taken a new patient, for whom the second and third tests are positive and the first and fourth are negative. What is the probability that they suffer from the illness?

We are given the following table of which words an email contains and whether it is spam or not:

Money	Free	Rich	Naughty	Secret	Spam
No	No	Yes	No	Yes	Yes
Yes	Yes	Yes	No	No	Yes
No	No	No	No	No	No
No	Yes	No	No	No	Yes
Yes	No	No	No	No	No
No	Yes	No	Yes	Yes	Yes
No	Yes	No	Yes	No	Yes
No	No	No	Yes	No	Yes
No	Yes	No	No	No	No
No	No	No	No	Yes	No
Yes	Yes	Yes	No	Yes	Yes
Yes	No	No	No	Yes	Yes
No	Yes	Yes	No	No	No
Yes	No	Yes	No	Yes	?

a) What is the result of the naive Bayes algorithm when given an email that contains the words money, rich, and secret, but does not contain the words free and naughty?
b) Do you agree with the result of the algorithm? Is the naive Bayes algorithm, as used here, a good method to classify email? Justify your answers.

Gender classification. Assume we are given the following data about 10 people:

Height in cm	Weight in kg	Hair length	Gender
180	75	Short	Male
174	71	Short	Male
184	83	Short	Male
168	63	Short	Male
178	70	Long	Male
170	59	Long	Female
164	53	Short	Female
155	46	Long	Female
162	52	Long	Female
166	55	Long	Female
172	60	Long	?

What is the probability that the 11th person with a height of 172cm, weight of 60kg, and long hair is a man?

Analysis:

Before the patient is given the test, the probability that he suffers from the virus is 4%, P(virus)=4%=0.04. The accuracy of the test is test_accuracy=98%=0.98. We apply the formula from the medical test example:

P(test_positive)=P(test_positive|virus)*P(virus)+P(test_positive|virus)*P(no_virus)

= test_accuracy*P(virus)+(1-test_accuracy)*(1-P(virus))

= 2*test_accuracy*P(virus)+1-test_accuracy-P(virus)

Therefore, we have the following:

a) P(virus|test_positive)=P(test_positive|virus)*P(virus)/P(test_positive)

=test_accuracy*P(virus)/P(test_positive)

=test_accuracy*P(virus)/[2*test_accuracy*P(virus)+1-test_accuracy-P(virus)]

=0.98*0.04/[2*0.98*0.04+1-0.98-0.04]=0.67123287671~67%

Therefore, there is a probability of about 67% that a patient suffers from the virus V if the result of the test is positive:

b) P(virus|test_negative)=P(test_negative|virus)*P(virus)/P(test_negative)

=(1-test_accuracy)*P(virus)/[1-P(test_positive)]

=(1-test_accuracy)*P(virus)/[1-2*test_accuracy*P(virus)-1+test_accuracy+P(virus)]

=(1-test_accuracy)*P(virus)/[test_accuracy+P(virus)-2*test_accuracy*P(virus)]

=(1-0.98)*0.04/[0.98+0.04-2*0.98*0.04]=0.000849617672~0.08%

If the test is negative, a patient can still suffer from the virus V with a probability of 0.08%.

Here, we can assume that symptoms and a positive test result are conditionally independent events given that a patient suffers from virus V. The variables we have are the following:

P(virus)=0.04

test_accuracy=0.98

symptoms_accuracy=85%=0.85

Since we have two independent random variables, we will use an extended Bayes' theorem:

a) Let R=P(test_positive|virus)*P(symptoms|virus)*P(virus)

=test_accuracy*symptoms_accuracy*P(virus)

=0.98*0.85*0.04=0.03332

~R=P(test_positive|~virus)*P(symptoms|~virus)*P(~virus)

=(1-test_accuracy)*(1-symptoms_accuracy)*(1-P(virus))

=(1-0.98)*(1-0.85)*(1-0.04)=0.00288

Then P(virus|test_positive,symptoms) = R/[R+~R]

=0.03332/[0.03332+0.00288]=0.92044198895~92%.

So, the patient with the symptoms for virus V and the positive test result for virus V suffers from the virus with a probability of approximately 92%.

Note that in the previous question, we learnt that a patient suffers from the disease with the probability of only about 67% given that the result of the test was positive. But after adding another independent random variable, the confidence increased to 92% even though the symptom assessment was reliable only on 85%. This implies that usually it is a very good idea to add as many independent random variables as possible to calculate the posterior probability with a higher accuracy and confidence.

b) Here, the patient has the symptoms for the virus V, but the result of the test is negative. Thus we have the following:

R=P(test_negative|virus)*P(symptoms|virus)*P(virus)

=(1-test_accuracy)*symptoms_accuracy*P(virus)

=(1-0.98)*0.85*0.04=0.00068

~R=P(test_negative|~virus)*P(symptoms|~virus)*P(~virus)

=test_accuracy*(1-symptoms_accuracy)*(1-P(virus))

=0.98*(1-0.85)*(1-0.04)=0.14112

Thus P(virus|test_negative,symptoms)=R/[R+~R]

=0.00068/[0.00068+0.14112]=0.0047954866~0.48%

Thus, a patient tested negative on the test, but with symptoms of virus V, has a probability of 0.48% of having the virus.

We apply the basic form of Bayes' theorem:

P(tsunami|earthquake)=P(earthquake|tsunami)*P(tsunami)/P(earthquake)

~0.5*(4/(365*100))/(6/(365*100))

~0.5*4/6~1/3=33%

There is a chance of 33% that there will be a tsunami following the recorded earthquake.

Note that here we set P(tsunami) to be the probability of a tsunami happening on some particular day out of the days in the past 100 years. We used a day as a unit to calculate the probability P(earthquake) as well. If we changed the unit to an hour, week, month, and so on for both P(tsunami) and P(earthquake), the result would still be the same. What is important in the calculation is the ratio P(tsunami):P(earthquake)=4:6=2/3:1, that is, that a tsunami is 2/3 times more likely to happen than an earthquake.

We put the data into the program for calculating the posterior probability from the observations and get the following answer:

[['No', 'Yes', 'Yes', 'No', {'Yes': 0.0, 'No': 1.0}]]

By this calculation, a patient tested should not suffer from the illness. However, the probability of No seems quite high. It may be a good idea to get more data to get a more precise estimate of with what probability the patient is healthy.

a) The result of the algorithm is as follows:

[['Yes', 'No', 'Yes', 'No', 'Yes', {'Yes': 0.8459918784779665, 'No': 0.15400812152203341}]]

So, according to the naive Bayes algorithm, when applied to the data in the table, the email is spam with the probability of about 85%.

b) This method may not be as good since the occurrence of certain words in a spam email is not independent. For example, spam emails containing the word money would likely try to convince that a victim of a spam could somehow get the money from the spammer and thus other words such as rich, secret, or free are more likely to be present in such an email as well. A nearest neighbor algorithm would seem to perform better at spam email classification. One could verify the actual methods using cross-validation.

For this problem, we will use the extended Bayes' theorem for both continuous and discrete random variables:

P(male|height=172cm,weight=60kg,hair=long)=R/[R+~R]

where R=P(height=172cm|male)*P(weight=60kg|male)*P(hair=long|male)*P(male)

~R=P(height=172cm|female)*P(weight=60kg|female)*P(hair=long|female)*P(female)

Let us summarize the given information in the following tables:

Gender	Mean of height	Variance of height
Male	176.8	37.2
Female	163.4	30.8

Gender	Mean of weight	Variance of weight
Male	72.4	53.8
Female	53	22.5

From this data, let us determine other values needed to determine the final probability of the person being male:

P(height=172cm|male)=0.04798962999

P(weight=60kg|male)=exp[-(60- 72.4)2/(2*53.8)]/[sqrt(2*53.8*π)]=0.01302907931

P(hair=long|male)=0.2

P(male)=0.5 by assumption

P(height=172cm|female)=0.02163711333

P(weight=60kg|female)=exp[-(60- 53)2/(2*22.5)]/[sqrt(2*22.5*π)]=0.02830872899 P(hair=long|female)=0.8

P(female)=0.5 by assumption, Hence, we have the following:

R=0.04798962999*0.01302907931*0.2*0.5=0.00006252606

~R=0.02163711333*0.02830872899*0.8*0.5=0.00024500767

P(male|height=172cm,weight=60kg,hair=long)

=0.00006252606/[0.00006252606+0.00024500767]=0.2033144787~20.3%

Therefore, the person with height 172 cm, weight 60 kg, and long hair is a male with a probability of 20.3%. Thus, they are more likely to be female.

Table of Contents for Problems

Create new playlist

Sign In

Sign Up

Table of Contents for
Problems