Linear Regression - House Price Prediction
Linear regression is one of the world's oldest machine learning concepts. Invented in the early nineteenth century, it is still one of the more vulnerable methods of understanding the relationship between input and output.
The ideas behind linear regression is familiar to us all. We feel that some things are correlated with one another. Sometimes they are causal in nature. There exists a very fine line between correlation and causation. For example, summer sees more sales in ice creams and cold beverages, while winter sees more sales in hot cocoa and coffee. We could say that the seasons themselves cause the amount of sales—they're causal in nature. But are they really?
Without further analysis, the best thing we can say is that they are correlated with one another. The phenomenon of summer is connected to the phenomenon of greater-than the-rest-of-the-year sales of cold drinks and ice cream. The phenomenon of winter is connected, somehow, to the phenomenon of greater-than-the-rest-of-the-year sales of hot beverages.
Understanding the relationship between things is what linear regression, at its core, is all about. There can be many lenses through which linear regression may be viewed, but we will be viewing it through a machine learning lens. That is to say, we wish to build a machine learning model that will accurately predict the results, given some input.
The desire to use correlation for predictive purposes was indeed the very reason why linear regression was invented in the first place. Francis Galton, who was coincidentally Charles Darwin's cousin, hailed from an upper-class family whose lineage included doctors. He had given up his medical studies after a nervous breakdown and began travelling the world as a geologist—this was back when being a geologist was the coolest job (much like being a data scientist today)—however, it was said that Galton hadn't the mettle of Darwin, and soon he gave up the idea of travelling around the world, soured by experiences in Africa. Having inherited his wealth after his father died, Galton dabbled in all things that tickled his fancy, including biology.
The publication of his cousin's magnum opus, On the Origin of Species, made Galton double down on his pursuits in biology and ultimately, eugenics. Galton experimented, rather coincidentally in the same manner as Mendel, on peas. He had wanted to predict the characteristics of the offspring plants, when only information about the parent plants' characteristics were available. He realized that the offspring was often somewhere in between the characteristics of the parent plants. When Galton realized that he could derive a mathematical equation that represented inheritance using elliptical curve fitting, he invented regression.
The reasoning behind regression was simple: there was a driving force—a signal of sorts—that led the characteristics of the offspring plants to go towards the curve he had fitted. If that was the case, it meant that the driving force obeyed some mathematical law. And if it did obey the mathematical laws, then it could be used for prediction, Galton reasoned. To further refine his ideas, he sought the help of the mathematician Karl Pearson.
It took Galton and Pearson a few more attempts to refine the concept and quantify the trends. But ultimately they adopted a least-squares methodology for fitting the curves.
Even to this day, when linear regression is mentioned, it can be safely assumed that a least- squares model will be used, which is precisely what we will be doing.
We will be performing exploratory data analysis—this will allow us to understand the data better. Along the way, we will build and use the data structures necessary for a machine learning project. We will rely heavily on Gonum's plotting libraries for that. After that, we will run a linear regression, interpret the results, and identify the strengths and weaknesses of this technique of machine learning.