Reducing Epidemic Deaths via The Naive Bayes Classifier and Big Data Mining
This essay digs deep into the Naïve Bayes Classifier and Bayes’ Theorem as a technique of big data mining. It introduces an innovative way of applying these formulae to create a predictive model of quickly determining correct drug prescriptions for large-scale epidemics. It further evaluates the accuracy of the method through a mock-simulation, and provides practical computational techniques allowing its execution on a grand scale. Finally, the essay looks at room for enhancement by potentially utilising other mathematical concepts for a greater range of results, adding relevant attributes in the big data collected, and envisioning its future prospects for sustainability.
With over 12 million Americans misdiagnosed every year, and the spurting growth of mass epidemics such as Zika, Ebola and H1N1, it is getting harder and harder to diagnose patients more quickly and accurately. The development of mathematical computation techniques has enabled both statistical and predictive modelling to be used as a means of filtering, comparing, and identifying relevant information from a large database. The Naive Bayes’ classifier is a highly promising tool in calculating the probability of a certain outcome by independently considering each attribute from a set database, and thus can serve as a predictive model to resolve mass epidemics in the field of medicine.
The Naive Bayes’ method stemmed off the initial Bayes Theorem developed by 18th-century British mathematician Thomas Bayes. The Bayes Theorem computes the conditional probability of an event based on existing information about other attributes related to that event. Given class C and attribute A, the probability of class (event) C would be
P(C|A) = P(A|C)P(C)/P(A)
This theorem can also be applied over several attributes a1, a2.... Am, by substituting these values with A in the formula:
P(C|a1, a2...am) = P(a1,a2...am|C)P(C)/P(a1,a2...am)
The Naive Bayes classifier extends on the Bayes’ Theorem by considering each attribute independently while calculating the probability of class C, and by removing all denominators (which are the same in all calculations and thus have no effect on the final result).
P(A|C) = P(a1,a2...am|C) = P(a1|C)*P(a2|C)...*P(am|Ci)
m = 𝚷 P(ak|C)
The Naive Bayes classifier states that our above-calculated P(A|C), when multiplied with P(C) gives our intended outcome P(C|A).
m P (C|A) ∝ P(C) * 𝚷 P(ak|C) k=1
This formula only applies when the x attributes are categorical; However, there are variations for continuous variables.
This paper proposes a new way of utilizing this classifier in the role of drug prescriptions. Such an application is highly important in the case of epidemics, where there is rapid generation of big data with a pressing need to quickly find suitable cures for the disease through data mining.
The theory behind the proposed mechanism involves utilising the Naïve Bayes Classifier to quickly compare the probabilities of success for each drug suited to different attributes of different people. In essence, the given event (class) will be whether the drug shows improvement or not, with C1 being ‘yes’ and C2 being ‘no’. Attributes a1, a2... am will consist of patient characteristics. A mini-example simulating data produced from a pandemic of disease X is provided below:
PATIENT INFORMATION DATA FOR DISEASE X
Taking this small-scale example, let us try and find the drug with the highest probability that will cure a middle-aged male who does not smoke or drink but has high allergies, suffering from disease X. To keep calculations manageable, we have assumed that the doctors have prescribed three antibiotic medications to fight this disease: penicillin, cephalexin, and amoxicillin.
Our aim is to use Naïve Bayes to find the probability of recovery with each medication by changing the ‘drug/medication’ attribute (a5) each time. The program will then conduct a mass comparison to calculate the drug with the highest probability for recovery.
Based on the information, attributes a1, a2...am are assigned the values Attributes = A = a1, a2, a3....am
In this case: A = (age=middle-aged, sex=male, smoking/drugs=no, allergies=high, medication=penicillin),
Class = Ci = C1, C2...Cn
In this case: C1=yes, C2= no. Using the Naïve Bayes Formula:
P(C1) = 6/10 P(C2) = 4/10 P(middle aged | yes) = 1/6 = 0.167 P(male | yes) = 4/6 = 0.667 P (smoking/drugs=no | yes) = 4/6 = 0.667 P (allergies=high | yes) = 1/6 = 0.167 P (medication=penicillin | yes) = 3/6 = 0.5
Therefore, P(A | yes) = 0.167*0.167*0.667*0.667*0.5 = 0.00620375466, penicillin
Generally, when applying Naïve Bayes, P(A|C2) is also calculated to determine whether C1 or C2 will occur.
Although doing so can prove useful in determining the drug with lowest probability of failure (i.e. safety of the drug), this is not necessary as our prime aim is to determine which of the attributes a5 (i.e., the drug) has the highest probability of success given output C1 holds true.
Completing the calculations with amoxicillin and cephalexin:
P(A | yes) = 0.167*0.667*0.667*0.167*0.167 = 0.00207205405, amoxicillin P (A | yes) = 0.167*0.667*0.667*0.167*0.333 = 0.0041317006, cephalexin.
On comparing the three values, Penicillin has the highest probability of curing this particular disease X for the patient with given attributes A.
This process proves beneficial compared to mass statistical regression analysis of big data simply as it is easy to implement and is flexible (as it requires no prior data knowledge). Furthermore, it individually considers all other attributes to produce a personalized outcome detailing which medicine works best for each patient.
The snippet of data provided above exemplifies a much grander concept that can save an significant amount of lives, particularly during an epidemic crisis. With growing technology & computational speed, this calculation can be done through mass looping over a big database comprising of millions of patient attributes and thousands of drugs in a matter of seconds in order to produce the best-suited medicine(s) for each suffering patient.
This is only the beginning. With this idea, one can further improve this mathematical model by including permutations and combinations of different drugs to prove which work the best, putting in self-learning mechanisms, taking into account side effects and dosage as new attributes to minimize damage, and, in the long-term, creating a generalized database based on past records of healthcare institutions for longer-lasting diseases.