Problems

  1. A patient is tested for having a virus V. The accuracy of the test is 98%. This virus V is currently present in 4 out of 100 people in the region of the patient:
  • a) What is the probability that a patient suffers from the virus V if they tested positive?
  • b) What is the probability that a patient can still suffer from the disease if the result of the test was negative?
  1. Apart from assessing the patients for suffering from the virus V (in question 2.1.), by using the test, a doctor usually also checks for other symptoms. According to a doctor, about 85% of patients with symptoms such as fever, nausea, abdominal discomfort, and malaise suffer from the virus V:
  • a) What is the probability that a patient is suffering from the virus V if they have the symptoms mentioned above and their test result for the virus V is positive?
  • b) How likely is it the patient is suffering from the virus V if they have the symptoms mentioned above, but the result of the test is negative?
  1. On a certain island, 1 in 2 tsunamis are preceded by an earthquake. There have been 4 tsunamis and 6 earthquakes in the past 100 years. A seismological station recorded an earthquake in the ocean near the island. What is the probability that it will result in a tsunami?
  1. Patients are tested with four independent tests on whether they have a certain illness:

Test1 positive

Test2 positive

Test3 positive

Test4 positive

Illness

Yes

Yes

Yes

No

Yes

Yes

Yes

No

Yes

Yes

No

Yes

No

No

No

Yes

No

No

No

No

No

No

No

No

No

Yes

Yes

Yes

Yes

Yes

Yes

No

Yes

Yes

Yes

No

Yes

No

No

No

No

No

Yes

Yes

No

Yes

Yes

No

Yes

Yes

Yes

No

Yes

No

Yes

Yes

No

No

Yes

Yes

No

Yes

Yes

No

?

We have taken a new patient, for whom the second and third tests are positive and the first and fourth are negative. What is the probability that they suffer from the illness?

  1. We are given the following table of which words an email contains and whether it is spam or not:

Money

Free

Rich

Naughty

Secret

Spam

No

No

Yes

No

Yes

Yes

Yes

Yes

Yes

No

No

Yes

No

No

No

No

No

No

No

Yes

No

No

No

Yes

Yes

No

No

No

No

No

No

Yes

No

Yes

Yes

Yes

No

Yes

No

Yes

No

Yes

No

No

No

Yes

No

Yes

No

Yes

No

No

No

No

No

No

No

No

Yes

No

Yes

Yes

Yes

No

Yes

Yes

Yes

No

No

No

Yes

Yes

No

Yes

Yes

No

No

No

Yes

No

Yes

No

Yes

?

  • a) What is the result of the naive Bayes algorithm when given an email that contains the words money, rich, and secret, but does not contain the words free and naughty?
  • b) Do you agree with the result of the algorithm? Is the naive Bayes algorithm, as used here, a good method to classify email? Justify your answers.
  1. Gender classification. Assume we are given the following data about 10 people:

Height in cm

Weight in kg

Hair length

Gender

180

75

Short

Male

174

71

Short

Male

184

83

Short

Male

168

63

Short

Male

178

70

Long

Male

170

59

Long

Female

164

53

Short

Female

155

46

Long

Female

162

52

Long

Female

166

55

Long

Female

172

60

Long

?

What is the probability that the 11th person with a height of 172cm, weight of 60kg, and long hair is a man?

Analysis:

  1. Before the patient is given the test, the probability that he suffers from the virus is 4%, P(virus)=4%=0.04. The accuracy of the test is test_accuracy=98%=0.98. We apply the formula from the medical test example:

P(test_positive)=P(test_positive|virus)*P(virus)+P(test_positive|virus)*P(no_virus)

= test_accuracy*P(virus)+(1-test_accuracy)*(1-P(virus))

= 2*test_accuracy*P(virus)+1-test_accuracy-P(virus)

Therefore, we have the following:

  • a) P(virus|test_positive)=P(test_positive|virus)*P(virus)/P(test_positive)

=test_accuracy*P(virus)/P(test_positive)

=test_accuracy*P(virus)/[2*test_accuracy*P(virus)+1-test_accuracy-P(virus)]

=0.98*0.04/[2*0.98*0.04+1-0.98-0.04]=0.67123287671~67%

Therefore, there is a probability of about 67% that a patient suffers from the virus V if the result of the test is positive:

  • b) P(virus|test_negative)=P(test_negative|virus)*P(virus)/P(test_negative)

=(1-test_accuracy)*P(virus)/[1-P(test_positive)]

=(1-test_accuracy)*P(virus)/[1-2*test_accuracy*P(virus)-1+test_accuracy+P(virus)]

=(1-test_accuracy)*P(virus)/[test_accuracy+P(virus)-2*test_accuracy*P(virus)]

=(1-0.98)*0.04/[0.98+0.04-2*0.98*0.04]=0.000849617672~0.08%

If the test is negative, a patient can still suffer from the virus V with a probability of 0.08%.

  1. Here, we can assume that symptoms and a positive test result are conditionally independent events given that a patient suffers from virus V. The variables we have are the following:

P(virus)=0.04

test_accuracy=0.98

symptoms_accuracy=85%=0.85

Since we have two independent random variables, we will use an extended Bayes' theorem:

  • a) Let R=P(test_positive|virus)*P(symptoms|virus)*P(virus)

=test_accuracy*symptoms_accuracy*P(virus)

=0.98*0.85*0.04=0.03332

~R=P(test_positive|~virus)*P(symptoms|~virus)*P(~virus)

=(1-test_accuracy)*(1-symptoms_accuracy)*(1-P(virus))

=(1-0.98)*(1-0.85)*(1-0.04)=0.00288

Then P(virus|test_positive,symptoms) = R/[R+~R]

=0.03332/[0.03332+0.00288]=0.92044198895~92%.

So, the patient with the symptoms for virus V and the positive test result for virus V suffers from the virus with a probability of approximately 92%.

Note that in the previous question, we learnt that a patient suffers from the disease with the probability of only about 67% given that the result of the test was positive. But after adding another independent random variable, the confidence increased to 92% even though the symptom assessment was reliable only on 85%. This implies that usually it is a very good idea to add as many independent random variables as possible to calculate the posterior probability with a higher accuracy and confidence.
  • b) Here, the patient has the symptoms for the virus V, but the result of the test is negative. Thus we have the following:

R=P(test_negative|virus)*P(symptoms|virus)*P(virus)

=(1-test_accuracy)*symptoms_accuracy*P(virus)

=(1-0.98)*0.85*0.04=0.00068

~R=P(test_negative|~virus)*P(symptoms|~virus)*P(~virus)

=test_accuracy*(1-symptoms_accuracy)*(1-P(virus))

=0.98*(1-0.85)*(1-0.04)=0.14112

Thus P(virus|test_negative,symptoms)=R/[R+~R]

=0.00068/[0.00068+0.14112]=0.0047954866~0.48%

Thus, a patient tested negative on the test, but with symptoms of virus V, has a probability of 0.48% of having the virus.

  1. We apply the basic form of Bayes' theorem:

P(tsunami|earthquake)=P(earthquake|tsunami)*P(tsunami)/P(earthquake)

~0.5*(4/(365*100))/(6/(365*100))

~0.5*4/6~1/3=33%

There is a chance of 33% that there will be a tsunami following the recorded earthquake.

Note that here we set P(tsunami) to be the probability of a tsunami happening on some particular day out of the days in the past 100 years. We used a day as a unit to calculate the probability P(earthquake) as well. If we changed the unit to an hour, week, month, and so on for both P(tsunami) and P(earthquake), the result would still be the same. What is important in the calculation is the ratio P(tsunami):P(earthquake)=4:6=2/3:1, that is, that a tsunami is 2/3 times more likely to happen than an earthquake.
  1. We put the data into the program for calculating the posterior probability from the observations and get the following answer:

[['No', 'Yes', 'Yes', 'No', {'Yes': 0.0, 'No': 1.0}]]

By this calculation, a patient tested should not suffer from the illness. However, the probability of No seems quite high. It may be a good idea to get more data to get a more precise estimate of with what probability the patient is healthy.

  1. a) The result of the algorithm is as follows:

[['Yes', 'No', 'Yes', 'No', 'Yes', {'Yes': 0.8459918784779665, 'No': 0.15400812152203341}]]

So, according to the naive Bayes algorithm, when applied to the data in the table, the email is spam with the probability of about 85%.

b) This method may not be as good since the occurrence of certain words in a spam email is not independent. For example, spam emails containing the word money would likely try to convince that a victim of a spam could somehow get the money from the spammer and thus other words such as rich, secret, or free are more likely to be present in such an email as well. A nearest neighbor algorithm would seem to perform better at spam email classification. One could verify the actual methods using cross-validation.

  1. For this problem, we will use the extended Bayes' theorem for both continuous and discrete random variables:

P(male|height=172cm,weight=60kg,hair=long)=R/[R+~R]

where R=P(height=172cm|male)*P(weight=60kg|male)*P(hair=long|male)*P(male)

~R=P(height=172cm|female)*P(weight=60kg|female)*P(hair=long|female)*P(female)

Let us summarize the given information in the following tables:

Gender

Mean of height

Variance of height

Male

176.8

37.2

Female

163.4

30.8

Gender

Mean of weight

Variance of weight

Male

72.4

53.8

Female

53

22.5

From this data, let us determine other values needed to determine the final probability of the person being male:

P(height=172cm|male)=0.04798962999

P(weight=60kg|male)=exp[-(60- 72.4)2/(2*53.8)]/[sqrt(2*53.8*π)]=0.01302907931

P(hair=long|male)=0.2

P(male)=0.5 by assumption

P(height=172cm|female)=0.02163711333

P(weight=60kg|female)=exp[-(60- 53)2/(2*22.5)]/[sqrt(2*22.5*π)]=0.02830872899 P(hair=long|female)=0.8

P(female)=0.5 by assumption, Hence, we have the following:

R=0.04798962999*0.01302907931*0.2*0.5=0.00006252606

~R=0.02163711333*0.02830872899*0.8*0.5=0.00024500767

P(male|height=172cm,weight=60kg,hair=long)

=0.00006252606/[0.00006252606+0.00024500767]=0.2033144787~20.3%

Therefore, the person with height 172 cm, weight 60 kg, and long hair is a male with a probability of 20.3%. Thus, they are more likely to be female.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset