Logistic Regression & K-NN Algorithm

  • 2 days ago

00:00Hello everyone, welcome to the chapter 4, where we will be studying about different
00:16algorithms used for classifications and an introduction about decision tree.
00:21So, first we will study the logistic regression and KNN algorithm which will be present in
00:27this session. Moving on, coming to the logistic regression, it comes under supervised learning
00:34algorithm and it is used for predicting the categorical dependent variable using a given
00:41set of independent variable. Now, the logistic regression here predicts the output based
00:47on the categorical dependent variable and therefore, the outcomes will definitely be
00:53categorical or discrete value which is either yes or no or 0, 1, true, false, etcetera.
01:02Now, instead of giving the exact value of 0 and 1, here it gives the probabilistic value
01:09that ranges between the value 0 and 1. In logistic regression, instead of using a regression
01:17line, a line that fits the regression line, we are using an S-shaped logistic function
01:25which predicts the two maximum values that is from 0 to 1.
01:30Now, the curve of the logistic function here indicates the likelihood of something which
01:37is like for example, the cancerous cell, whether the cell is cancerous or not, likelihood towards
01:44something is mainly specified here in logistic regression. Now, here it also provides the
01:54probabilities and classifies new data using continuous and discrete data sets. Now, you
02:01can see this, like I have said, it provides an S-shaped curve. So, here this is the S-shaped
02:10curve and you can find this is called the S-curve and there is a threshold value which
02:15is being specified and we will get the maximum and the minimum value which ranges from 0
02:22to 1. It is used to classify the observations using different types of data and we can easily
02:30determine the most effective variables which is used for this classification.
02:35Now, we are coming to the logistic function which is also called a sigmoid function. Here
02:42the sigmoid function is a mathematical function which is used to map the predicted probabilities
02:49or the predictive output which we have already obtained from the independent variable. Here
02:55the sigmoid function uses those probabilistic value and map it. It maps any real value into
03:05another value within the range of 0 to 1. Now, the value of the regression, logistic
03:11regression, it must be between 0 to 1. You should always see to it that the value comes
03:17between the range 0 to 1 which cannot go beyond this limit. So, that is why there is a S-shaped
03:27curve. Now, this S-shaped curve, that is why we are calling it as sigmoid function as well.
03:35Now, in logistic regression, we use the concept of threshold value. Threshold value which
03:40is between the value 0 to 1. Why this threshold value is given? Because we are specifying
03:48the values to be 0 to 1, we need to give a between value for between the both the values
03:560 to 1. The reason is so that when the range comes between 0 to 1, we can understand that
04:03this is the threshold limit for that function. That is why we are giving a threshold value
04:09here in the logistic function. Now, there should be certain assumptions as
04:16well which we have already learnt in the classification algorithm. So, the assumptions here for logistic
04:23regression is the dependent variable must be categorical in nature. It should be categorical
04:30in nature like I have said yes or no, true or false, the values between 0 and 1. So,
04:36either 0 or 1, this is how the categorical variable should be. Now, independent variable
04:42should not have multicollinearity. What is multicollinearity? The variables are highly
04:49dependent on each other. So, that should not be there in the case of independent variable.
04:54So, these are the assumptions done in the case of logistic regression. Now, logistic
05:01regression equation, how the equation works on it? Now, the logistic regression equation
05:06can be obtained by the linear regression equation and the mathematical function here is y is
05:13equal to 0. That is if the equation is of the straight line, y is equal to b 0 plus
05:20b 1 x 1 plus b 2 x 2 etcetera up to b n x n. Now, in logistic regression, y can be between
05:290 and 1 only. So, we divide the above equation in this form whereby by 1 minus y where y
05:39is not in the range above 1. That means we are giving the value, we have already mentioned
05:48the value should be between, the output value should be between 0 to 1. So, that is why
05:54we are giving y is equal to 0 and infinity which is y is equal to 1 since the range is
06:01between 0 to 1. Now, but we need the range from the value minus infinity to plus infinity.
06:10So, that is why we give a logarithmic equation to the above form. So, log of y by 1 minus
06:18y is equal to b 0 plus b 1 x 1 plus b 2 x 2 etcetera up to b n x n. This is the general
06:28equation form which you need to understand when it is getting converted from the linear
06:34regression equation to logistic. Now, the types of logistic regression. First
06:42one is binomial. In binomial logistic regression, there can be only two possible values that
06:48is dependent variable such as 0 to 1, then pass or fail, then true or false. This all
06:57comes under binomial regression. The next is multinomial regression. In this multinomial
07:05logistic regression, there can be three or more possible unordered types of dependent
07:12variable. Three or more and just remember the word unordered that is cats, dogs, sheep.
07:21Is there any relation between cats, dogs and sheep like where we can say that they are
07:26under ordered category? No, this is coming under unordered category.
07:32Now, next is ordinal regression, logistic regression. Ordinal logistic regression, here
07:39there could be three or more possible values and it should be of an ordered type. Three
07:47or more possible values and it should be of ordered type such as low, medium, high. We
07:54know this is how they specify. It is of an order. It is low, medium and high. So, this
08:01is the three types of logistic regression.
08:05Now, how it is implemented using Python? We will be going into Python in later on classes.
08:12So, using the Python concept, how the logistic regression is being implemented? First one
08:18is data pre-processing step. This is basically where you have a file, we have a file type
08:24and how it is being converted and what are the functions used for the processing step
08:30to be taken for that particular file. Then, we are using the fitting logistic regression
08:36to the training set. We are just like the best fit line in the linear regression. We
08:41are using the same concept here, best fitting logistic regression to the training set which
08:48we have maintained. Now, after that we are predicting the test result. Once the training
08:55set has been given, we do the required implementation and we will be getting a desired output or
09:01we might not be getting the desired output, but we will definitely get a result. So, we
09:05are predicting the test results here. Now, once the test result is obtained, we are checking
09:12the accuracy of the test result. Is it really matching our expected output and for that
09:19here we are creating a confusion matrix which we have studied in our previous classes. Now,
09:26after that once the accurate results are obtained, we are visualizing the test result. How the
09:33expected output comes in the form of a graph which will be easier for the viewer to understand.
09:40So, this is how the implementation of logistic regression here is done. Now, coming into
10:52here it is easy to implement, interpret and very efficient to train. It makes no assumptions
11:00about the classes. We make here no assumptions about the distribution of classes in feature
11:07space. Now, it can easily extend here multiple classes. We are using multinomial regression
11:15here and a natural probabilistic view is possible here for the class predictions. Now, it also
11:24provides a measure of how appropriate a predictor that is the coefficient sizes and also its
11:31direction of association whether it is positive or negative. Now, it is fast at classifying
11:39unknown records ok. When we do not have ideas about the records that we are getting, it
11:44helps us to classify those records in much faster way. Now, there is good accuracy even
11:51for the simple data set. We have a good accuracy here when we are going for the logistic regression
11:58and it performs well when the data set is linearly separable ok. When the data set is
12:05linearly separable, we are able to simplify the data set and give accurate results. It
12:13can interpret model coefficients as indicators of feature importance. Now, logistic regression
12:21here is less inclined to overfitting and the problem here is the overfitting of high dimensional
12:30data set ok. It has overfitting issue. It is much less inclined to the overfitting issue,
12:37but when it comes to high dimensional data sets, we will have problem with the overfitting.
12:43So, here we are using the regularization techniques such as L 1 and L 2 to avoid such overfitting
12:51scenarios for high dimensional data sets. So, this is something which is for the logistic
12:58regression easier techniques and analysis which helps us the usage of logistic regression
13:05more comparing the previous one. Now, disadvantages here is if the number of observations here
13:12are lesser than the number of features, logistic regression should not be used ok. When the
13:19observations that we have obtained is lesser than the number of features, we cannot use
13:24here the logistic regression concepts as it may lead to overfitting.
13:30Now, it constructs linear boundaries. Again, we cannot go in we have a limited boundary
13:36concept here and we cannot cross the boundary. Now, the major limitation here of this is
13:43the assumption of linearity between the dependent and the independent variable. We are assuming
13:49the linearity between the dependent and the independent variable. It can also be used
13:55to predict discrete function. Hence, dependent variable of logistic regression is bound to
14:02become discrete number set. This is under disadvantage. Now, non-linear problems cannot
14:09be solved by logistic regression ok. Non-linear problems cannot be solved by logistic regression.
14:17Now, logistic regression requires average or no multicolinearity that is what over there
14:24we are assuming there is no multicolinearity between the independent variables. The same
14:29that is a disadvantages also one of the disadvantage here that we have no multicolinearity is
14:36being assumed. Now, it is tough to obtain complex relationships
14:40using logistic regression. When the equation turns to be much more complex or the data
14:46sets that are given to us are much more complex, we will not be able to solve it using logistic
14:53regression. Because, there are more power and we are specifying there are much more
14:58powerful algorithms for neural networks such as neural networks and other outstanding performing
15:05algorithms. Now, in the case of linear regression independent
15:10and dependent variables are related linearly. Then, the logistic regression needs to be
15:16independent and we are linearly relating this to be as log of p by 1 minus p. This
15:25is the equation which gives us much more complexity. So, this is the disadvantages here of logistic
15:33regression where we are considering is it in the form of s shape form. We are giving
15:39a threshold value and the range should be between 0 to 1. This is just the basic concept
15:46to understand the logistic regression. Now, coming to the k nearest algorithm which
15:51is the next algorithm coming under supervised learning technique. Now, it assumes the similarity
15:57between the data set which is available and the data set which we have received. New data
16:04set which we have received and the available cases and we classify them into a category
16:10that is most similar to the available data set which is given before.
16:15Now, it stores all available data and classifies new data based on the similarity. We are assuming
16:23here the similarity concept the data which is given how much similar it is to the data
16:30which is already available. Now, it can be used for regression as well as classification,
16:36but it is mostly used for the classification problems and it is a non-parametric algorithm
16:42which does not make any assumptions. It is also called as lazy learner. We have learned
16:48about lazy learner and active learner. Lazy learner here because it does not learn from
16:54the training set immediately. It waits for the new one and then only it works on the
17:00training data set. So, this is the basic introduction about the k nearest.
17:04Now, we have a figure here. This is a data new data and we already have a previous two
17:11categories of data which is already being classified. When a new data comes in we apply
17:16this regression this k nearest algorithm on the new data set and we try to analyze which
17:25category to it belongs ok. The new data set to which category it belongs category A or
17:33category B. So, once applying this the figure here before KNN and after KNN after applying
17:39the KNN algorithm the new data set belongs to category A. So, the steps here is we select
17:48the number of k neighbors ok. This is a data set and we are selecting set of neighbors
17:55from that data set and we are calculating the Euclidean distance of that ok. We are
18:01calculating the data set from the data set we are calculating the Euclidean distance
18:07and taking the nearest neighbor as per the calculated Euclidean distance.
18:12So, we are calculating the Euclidean distance and then we are trying to analyze which is
18:19the nearest to that data set. Among these k neighbors count the number of data points
18:25in each category and we are assigning the new data point to the category where the number
18:33of neighbors is maximum. So, this is how the model works. Now example here is now we have
18:42this new data point we have category A and category B that is a data sets we choose the
18:49number of neighbors for example, here we are choosing 5 neighbors of from the data point
18:55ok we are avoiding the data point and we are choosing 5 from total of 5 from both category
19:02A and category B and we are calculating the Euclidean distance between the data points
19:07ok. Now after the Euclidean so, this is how the
19:11Euclidean distance is being calculated we get the nearest neighbor as 3 nearest neighbors
19:18of category A and 2. When you are calculating the Euclidean distance here we are taking
19:243 data points from category A and 2 data points from category B and there is a formula applied
19:31for it which we have already known. Now this is how the figure looks like from A we have
19:36got 3 points and from B we have got 2 and there is a new data point between these 2.
19:43Once the algorithm has been applied ok what do we get? We understand that the data point
19:51belongs to category A ok. So, there is no particular way to determine the best value
19:57for K. So, the trial keeps on going how the model works.
20:05Now example here is now we have this new data point we have category A and category B that
20:14is a data sets we choose the number of neighbors for example, here we are choosing 5 neighbors
20:20of from the data point ok we are avoiding the data point and we are choosing 5 from
20:26total of 5 from both category A and category B and we are calculating the Euclidean distance
20:32between the data points ok. Now after the Euclidean so, this is how the Euclidean distance
20:39is being calculated we get the nearest neighbor as 3 nearest neighbors of category A and 2.
20:47So, when you are calculating the Euclidean distance here we are taking 3 data points
20:52from category A and 2 data points from category B and there is a formula applied for it which
20:58we have already known. Now this is how the figure looks like from A we have got 3 points
21:04and from B we have got 2 and there is a new data point between these 2.
21:10Once the algorithm has been applied ok what do we get? We understand that the data point
21:18belongs to category A ok. So, there is no particular way to determine the best value
21:23for k. So, the trial keeps on going until we find the best out of them and usually we
21:30take the value to be 5 ok minimum of 5 should be taken a very low value such as k 1 k 2
21:37if you are taking 1 or 2 points there will be noisy issues and we will not get the accurate
21:42results. So, minimum of 5 values are considered here and then from that we are calculating
21:49the nearest point or to which category the new data point is considered.
21:56Now the implementation is also the same with respect to the python we are definitely data
22:01pre processing step is being used we are using the best fit for the KNN algorithm using the
22:06training set we are predicting the test result we are testing the accuracy of the result
22:11and we are visualizing the test result. Now advantages here is it is simple it is robust
22:17compared to the noisy training data set and it is more effective if the training data
22:22set is large. Now comparing the disadvantages here always
22:27it needs to determine the value of k which may be complex at some time ok and the computation
22:33cost here is high because we have to calculate the distance between the data points. So,
22:39the time consuming is also more. Now the linear regression and logistic regression difference
22:44here it is used to predict continuous the other one is used to predict the categorical
22:50here we are used for we are using it for problem solving and the other one is using classification
22:56for problem solving. In linear regression we are predicting continuous and the other
23:00one we are predicting categorical values and here in linear regression we find the best fit line
23:07in the case of logistic we are using the s curve and finally, least square estimation method is
23:14used for accuracy here and here in logistic regression we are using maximum likelihood
23:20estimation to get the best results. With this we have completed this session. Thank you so much.