Classification

Naive Bayes

Weekly design

Pre-class video

Eng ver.

Kor ver.

Pre-class PPT pdf

Discussion

Discussion #4

Class

What is Naive Bayes?

Naive Bayes is a probabilistic classification algorithm that is based on Bayes’ theorem. It was first introduced by Thomas Bayes, an 18th-century statistician, who used it to develop a method for predicting the probability of an event based on prior knowledge of related events. The goal of the algorithm is to predict the probability of each class label given a set of observed features.

Motivation

The main motivation behind Naive Bayes is its simplicity and efficiency. It is a fast and effective algorithm that can be used for both binary and multi-class classification problems. It is particularly useful in situations where the number of features is large compared to the number of observations, as it requires a relatively small amount of training data to produce accurate predictions.

Assumption

The Naive Bayes algorithm assumes that the features are conditionally independent given the class label, which is where the “naive” in Naive Bayes comes from. This assumption allows the algorithm to simplify the calculations involved in determining the probability of each class label given the observed features.

Bayes’ Theorem

Bayes’ theorem is a fundamental theorem in probability theory that describes the relationship between the conditional probabilities of two events (here, A and B). It states that the probability of event A given event B is equal to the probability of event B given event A multiplied by the probability of event A, divided by the probability of event B. Mathematically, this can be written as:

\[ P(A|B) = \frac{P(B|A)P(A)}{P(B)} \]

where:

\(P(A|B)\) is the probability of event A given event B (known as the posterior probability)
\(P(B|A)\) is the probability of event B given event A (known as the likelihood)
\(P(A)\) is the probability of event A (known as the prior probability)
\(P(B)\) is the probability of event B (known as the evidence)

The Naive Bayes Algorithm

The Naive Bayes algorithm uses Bayes’ theorem to predict the probability of each class label given a set of observed features. The algorithm assumes that the features are conditionally independent given the class label, which allows the algorithm to simplify the calculations involved in determining the probability of each class label.

Let \(X = (X_1, X_2, ..., X_n)\) represent the set of observed features, and let Y represent the class label. The goal is to predict the probability of each class label given X, i.e. \(P(Y|X)\). Using Bayes’ theorem, we can write:

\[ P(Y|X) = \frac{P(X|Y)P(Y)}{P(X)} \]

where:

\(P(Y|X)\) is the posterior probability of Y given X
\(P(X|Y)\) is the likelihood of X given Y
\(P(Y)\) is the prior probability of Y
\(P(X)\) is the evidence

The Naive Bayes algorithm assumes that the features \(X_1, X_2, ..., X_n\) are conditionally independent given Y, which means that:

\[ P(X|Y) = P(X_1|Y) \times P(X_2|Y) \times \ldots \times P(X_n|Y) \]

Using this assumption, we can rewrite the equation for \(P(Y|X)\) as:

\[ P(Y|X) = \frac{P(Y)P(X_1|Y)P(X_2|Y) \cdots P(X_n|Y)}{P(X)} \]

The evidence \(P(X)\) is a constant for a given set of features X, so we can ignore it for the purposes of classification. Therefore, we can simplify the equation to:

\[ P(Y|X) \propto P(Y) \times P(X_1|Y) \times P(X_2|Y) \times \ldots \times P(X_n|Y) \]

The Naive Bayes algorithm calculates the likelihoods \(P(X_i|Y)\) for each feature and class label from the training data, and uses these likelihoods to predict the probability of each class label given a new set of features. The algorithm selects the class label with the highest probability as the predicted class label.

Pros & Cons

Pros:
- Simple and easy to implement
- Fast and efficient
- Works well with high-dimensional data
- Robust to irrelevant features
- Can handle both binary and multi-class classification problems
Cons:
- Assumes independence between features, which may not always be the case
- Can be sensitive to outliers and imbalanced datasets
- May not perform well if the training data is insufficient or unrepresentative of the population

Despite its limiations..

Naive Bayes remains a popular algorithm in the field of machine learning and is widely used in a variety of applications, including spam filtering, sentiment analysis, and medical diagnosis. Its simplicity, efficiency, and effectiveness make it a valuable tool in any machine learning practitioner’s toolkit.

Pop-up QZs

What is Naive Bayes?
1. A regression algorithm
2. A clustering algorithm
3. A probabilistic classification algorithm
4. An unsupervised learning algorithm
What does the “naive” in Naive Bayes refer to?
1. The fact that the algorithm is simple and easy to implement
2. The assumption that the features are conditionally independent given the class label
3. The fact that the algorithm is fast and efficient
4. The fact that the algorithm works well with high-dimensional data
What is the main motivation behind Naive Bayes?
1. Its ability to handle imbalanced datasets
2. Its ability to work well with irrelevant features
3. Its simplicity and efficiency
4. Its ability to handle both binary and multi-class classification problems
What are some pros of Naive Bayes?
1. It is simple and easy to implement
2. It is fast and efficient
3. It works well with high-dimensional data
4. All of the above
What are some cons of Naive Bayes?
1. It assumes independence between features, which may not always be the case
2. It can be sensitive to outliers and imbalanced datasets
3. It may not perform well if the training data is insufficient or unrepresentative of the population
4. All of the above

Hands-on Practice

For this example, we will be using the “Breast Cancer Wisconsin (Diagnostic)” dataset from the UCI Machine Learning Repository. The dataset contains information about breast cancer tumors, including various measurements such as radius, texture, perimeter, and area.

The dataset contains a total of 569 instances or samples, each with 32 features or attributes. The target variable is binary, with a value of either 0 (indicating a benign tumor¹) or 1 (indicating a malignant tumor). The dataset was originally created by Dr. William H. Wolberg, a physician at the University of Wisconsin Hospitals, Madison, and has been used extensively in research and machine learning competitions.

The 32 features in the dataset are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. The FNA is a non-invasive method for diagnosing breast cancer, where a thin needle is used to extract cells from the mass, which are then examined under a microscope. The features in the dataset include various characteristics of the cell nuclei present in the FNA, such as the radius, texture, area, smoothness, compactness, concavity, symmetry, and fractal dimension.

Here is an overview of the steps we will follow:

Load and preprocess the data
Split the data into training and testing sets
Train the Naive Bayes model using the training set
Use the trained model to predict the classes of the testing set
Evaluate the performance of the model

Let’s start by loading and preprocessing the data:

library(e1071) # Required library for Naive Bayes

# Load the dataset
data <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data", header = FALSE)

# Assign column names to the dataset
colnames(data) <- c("id", "diagnosis", "radius_mean", "texture_mean", "perimeter_mean", "area_mean", "smoothness_mean", "compactness_mean", "concavity_mean", "concave.points_mean", "symmetry_mean", "fractal_dimension_mean", "radius_se", "texture_se", "perimeter_se", "area_se", "smoothness_se", "compactness_se", "concavity_se", "concave.points_se", "symmetry_se", "fractal_dimension_se", "radius_worst", "texture_worst", "perimeter_worst", "area_worst", "smoothness_worst", "compactness_worst", "concavity_worst", "concave.points_worst", "symmetry_worst", "fractal_dimension_worst")

# Convert the diagnosis column to a binary variable
data$diagnosis <- ifelse(data$diagnosis == "M", 1, 0)

# Remove the ID column as it is not useful for modeling
data <- data[, -1]

head(data)

  diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean
1         1       17.99        10.38         122.80    1001.0         0.11840
2         1       20.57        17.77         132.90    1326.0         0.08474
3         1       19.69        21.25         130.00    1203.0         0.10960
4         1       11.42        20.38          77.58     386.1         0.14250
5         1       20.29        14.34         135.10    1297.0         0.10030
6         1       12.45        15.70          82.57     477.1         0.12780
  compactness_mean concavity_mean concave.points_mean symmetry_mean
1          0.27760         0.3001             0.14710        0.2419
2          0.07864         0.0869             0.07017        0.1812
3          0.15990         0.1974             0.12790        0.2069
4          0.28390         0.2414             0.10520        0.2597
5          0.13280         0.1980             0.10430        0.1809
6          0.17000         0.1578             0.08089        0.2087
  fractal_dimension_mean radius_se texture_se perimeter_se area_se
1                0.07871    1.0950     0.9053        8.589  153.40
2                0.05667    0.5435     0.7339        3.398   74.08
3                0.05999    0.7456     0.7869        4.585   94.03
4                0.09744    0.4956     1.1560        3.445   27.23
5                0.05883    0.7572     0.7813        5.438   94.44
6                0.07613    0.3345     0.8902        2.217   27.19
  smoothness_se compactness_se concavity_se concave.points_se symmetry_se
1      0.006399        0.04904      0.05373           0.01587     0.03003
2      0.005225        0.01308      0.01860           0.01340     0.01389
3      0.006150        0.04006      0.03832           0.02058     0.02250
4      0.009110        0.07458      0.05661           0.01867     0.05963
5      0.011490        0.02461      0.05688           0.01885     0.01756
6      0.007510        0.03345      0.03672           0.01137     0.02165
  fractal_dimension_se radius_worst texture_worst perimeter_worst area_worst
1             0.006193        25.38         17.33          184.60     2019.0
2             0.003532        24.99         23.41          158.80     1956.0
3             0.004571        23.57         25.53          152.50     1709.0
4             0.009208        14.91         26.50           98.87      567.7
5             0.005115        22.54         16.67          152.20     1575.0
6             0.005082        15.47         23.75          103.40      741.6
  smoothness_worst compactness_worst concavity_worst concave.points_worst
1           0.1622            0.6656          0.7119               0.2654
2           0.1238            0.1866          0.2416               0.1860
3           0.1444            0.4245          0.4504               0.2430
4           0.2098            0.8663          0.6869               0.2575
5           0.1374            0.2050          0.4000               0.1625
6           0.1791            0.5249          0.5355               0.1741
  symmetry_worst fractal_dimension_worst
1         0.4601                 0.11890
2         0.2750                 0.08902
3         0.3613                 0.08758
4         0.6638                 0.17300
5         0.2364                 0.07678
6         0.3985                 0.12440

Next, we will split the data into training and testing sets. We will use 80% of the data for training and 20% for testing.

# Set seed for reproducibility
set.seed(123)

# Split the data into training and testing sets
train.index <- sample(nrow(data), 0.8 * nrow(data))
train <- data[train.index, ]
test <- data[-train.index, ]

Now, we can train the Naive Bayes model using the training set. We will use the naiveBayes function from the e1071 library.

# Train the Naive Bayes model
model <- naiveBayes(diagnosis ~ ., data = train)

Next, we can use the trained model to predict the classes of the testing set

# Use the model to predict the classes of the testing set
predictions <- predict(model, newdata = test)

Finally, we can evaluate the performance of the model using various metrics such as accuracy, precision, and recall.

# Calculate the accuracy of the model
accuracy <- sum(predictions == test$diagnosis) / nrow(test)
cat("Accuracy:", round(accuracy, 2), "\n")

Accuracy: 0.93

# Calculate the precision and recall of the model
TP <- sum(predictions == 1 & test$diagnosis == 1)
FP <- sum(predictions == 0 & test$diagnosis == 1)
FN <- sum(predictions == 1 & test$diagnosis == 0)

precision <- TP / (TP + FP)
recall <- TP / (TP + FN)

cat("Precision:", round(precision, 2), "\n")

Precision: 0.91

cat("Recall:", round(recall, 2), "\n")

Recall: 0.94

In a binary classification problem, a confusion matrix consists of four categories:

	Positive (Actual)	Negative (Actual)
Positive (Predicted)	True Positive (TP)	False Positive (FP)
Negative (Predicted)	False Negative (FN)	True Negative (TN)

True Positive (TP): Number of (actual) positive cases that are correctly (True) classified
False Positive (FP): Number of (actual) negative cases that are incorrectly classified as (predicted) positive
True Negative (TN): Number of (actual) negative cases that are correctly classified
False Negative (FN): Number of (actual) positive cases that are incorrectly classified as (predicted) negative

Precision & Recall

The precision of the model tells us how many of the positive cases that the model predicted were actually positive,
- The accuracy of the positive predictions made by the model
  
  \[ \frac{TP}{TP+FP} \]
while the recall tells us how many of the actual positive cases were correctly predicted by the model.
- The ability of the model to correctly identify all positive instances
  
  \[ \frac{TP}{TP+FN} \]
In general, precision is more important when we want to avoid false positives. For example, in a spam classification task, we would want to have high precision to ensure that legitimate emails are not classified as spam. False positives can be costly and can lead to important messages being missed.
On the other hand, recall is more important when we want to avoid false negatives. For example, in a medical diagnosis task, we would want to have high recall to ensure that all instances of a disease are correctly identified, even if it means that some healthy individuals are identified as having the disease. False negatives can be costly and can lead to delayed treatment and potentially life-threatening consequences.

In summary, precision and recall are used differently depending on the task and the costs associated with false positives and false negatives.

Footnotes

A benign tumor is a non-cancerous growth that does not spread to other parts of the body. It is made up of cells that are similar in appearance to normal cells, and it usually grows slowly. While a benign tumor is not usually life-threatening, it can still cause health problems depending on its location and size. In some cases, a benign tumor may need to be removed if it is causing symptoms or is at risk of becoming cancerous. Unlike a malignant tumor, which is cancerous and can spread to other parts of the body, a benign tumor does not invade nearby tissues or organs or spread to other parts of the body.↩︎