Probability Theory – The Math of Intelligence #6


Hello world! It’s Siraj. And let’s laser focus on the role of probability theory in machine learning by building a spam classifier from scratch. Life is full of uncertainty: we try things which we think will probably succeed but we are not certain. Will it rain today? Is it okay for me to dance in public? Should I invest more time in this relationship? Probability theory gives us a framework to model these decisions. And by doing so, we can make them more efficiently. There exist branches of math that help us make decisions when we have perfect information but probability trains us to make decisions where there are indeed observable patterns but also a degree of uncertainty – aka. real life. It’s a measure of how likely something is to happen and the practice of analysing events, governed by probability, is called statistics! A simple example is flipping a coin. There are only two possible outcomes: heads or tails. We can model the probability of heads happening since we know two things, the number of ways that can happen and the total number of outcomes. We’ve got a 50% chance in this case, just like how often Bluetooth decides to work. This is a random variable: it denotes something about which we are uncertain – an unpredictable event. It’s not a variable in the way that algebra denotes them instead, it has a whole set of values, also called the sample space. And the probability of any one value in this set is denoted this way. They can be either discrete, so they only take certain values, or continuous, taking any value within a range. If we have two possible events, A and B, say we are tossing a coin and throwing a six sided die, We can measure their probabilities three different ways: given that a coin lands on heads, what is the probability that the die lands on 4? This is the conditional probability. We can also model the probability that both events occur like, what is the probability that the coin lands on heads and the die lands on 4? That’s the joint probability. And if we want the probability for specific outcomes, like just the coin or just the die, we call that the marginal probability. We make lots of assumptions like this in machine learning, sometimes they are wrong. Numenta. So there is this really popular formula called Baye’s theorem that’s built on top of the axioms of conditional probability. it’s called a theorem because we can prove is truth using logic. it states that for two events A and B, if we know de conditional probability of B given A and the probability of A, we can compute the conditional probability of A given B. In other words the posterior probability of A, given B can be calculated by multiplying the likelihood by the prior probability tierms and dividing their product by the evidence term The prior probability of an event, often called the prior is the probability calculated using input that is already know. The prior probability of rain of a given day, could be calculated as 0.6 if you know that 60% of the days on that same data have been rainy 100 years. We started with a prior and now we have new information that we can use to more accurate to re-estimate the same probability. As the bayesian statistician Lindley once put it, “Grab your glocks when you see.. wait wrong quote“ today’s posterior is tomorrows prior”. We can use this theorem to update probability in light of new knowledge. What is the probability the ride will crash given a wooden board breaks? So how is this used in machine learning? Theres a family of linear classifiers that are based of off Bayes theorem called Naive Bayes classifiers they tend to perform really well speacially for small sample sizes that’s where they outperform more powerful alternatives. Naive bayes classifers are used in a bunch of different fields, from diagnosing diseases to sentiment analysis to classifying emails as spam, which is what we’ll do. They make 2 big assumptions about the data, the first is that the samples are independent and identically distributed. They act as random variables that are independent from each other and are drawn from a similar probability distribution. The second assumption is conditional independence of features. That means that the likelihood of the samples can be directly estimated from the training data instead of evaluating all possibilities of X. So given an N dimensional feature vector X we can calculate the class conditional probability. that means how likely is it to observe this particular patter X given that it belongs to class Y. In practice, that assumption is violated a good amount of time. Regardless, they still perform well. To make a prediction using naive bayes, we’ll calculate probabitlies of the instance belonging to each class and select the class value with the highest one. This kind of categorical data is a great use case for naive bayes. We’ll start by loading up our data file, it’s in CSV format so we can open the file using the popular pandas data processing module and store each line in a dataframe object using its read function. Each email message is labeled either spam or ham (not spam). We can split the data into a training set to train our model and a testing set to evaluate its prediction capability. For our spam classification problem in the context of bayes theorem, we can set A to the probability that the email is spam and B as the contents of the email. So if the probability that an email is spam is greater than the probability that its not, we’ll classify it as spam else we won’t. Since bayes theorem results in the divisor of probability of B in both cases, we can remove it from the equation for our comparison. Calculating the Probability of A and the probability of not A is simple, they’re just the percentages from our training set which are spam versus not spam. The harder part is calculating the probability of B given A and the probability of B given not A. To do this, we’ll use the bag of words model. That means we treat a piece of text as a bag of unique words, with no attention paid to their ordering. For each word, we calculate the percentage of times it shows up in spam and not spam emails. And to calculate another conditional probability for an entire email we just take the product of the former conditional probability for every word in the email. This is done during classification, not training time.
With these functions we can construct our classifier function, which gets called for every email and uses our previously defined functions to classify them. That’s it! Now we can classify new emails as spam or not spam really easily. So what happens if a word in the email we’re classifying isn’t in our training set? We have to handle this edge case somehow, and the solution is to use something called Laplace smoothing, which we can insert into our code as an alpha variable. This just means we add 1 to every count so its never zero because if we didn’t it would set the probability for some word say ‘cup’ to 0 then the probability of the whole email becomes 0 regardless of how many other spammy phrases there are. Are there improvements we could make to our model? sure we could use a more efficient technique instead of bag of words and use N-grams instead of counting individual words but hey! That’s more than enough for this video. To summarize, probability theory helps us formally model the uncertainty of life, which is awesome. Bayes theorem describes the probability of an event, based on prior knowledge of conditions that might be related to the event. And Naive bayes classifiers apply bayesian theorem with independence assumptions between features. The Wizard of the Week is Hammad Shaikh. Hammad’s notebook demonstrates how to use Principal component analysis to visualize a high dimensional dataset and detect if a person has diabetes or not. I’m very impressed with the quality of the documentation, definitely check it out. And the runner up is Kristian Wichmann who had used 3 different autoencoders to visualize plant data, very cool. This weeks challenge is to write your own Naive Bayes Classifier on a text dataset with better results than my demo. Details in the README, GitHub links in the comments, winners announced next week. Please Subscribe for more programming videos and for now I’ve gotta accept uncertainty, so thanks for watching 🙂 !