Machine Learning with Scala Quick Start Guide

上QQ阅读APP看书，第一时间看更新

General issues in machine learning models

When we use this input data for the training, validation, and testing, usually the learning algorithms cannot learn 100% accurately, which involves training, validation, and test error (or loss). There are two types of error that one can encounter in a machine learning model:

Irreducible error
Reducible error

The irreducible error cannot be reduced even with the most robust and sophisticated model. However, the reducible error, which has two components, called bias and variance, can be reduced. Therefore, to understand the model (that is, prediction errors), we need to focus on bias and variance only:

Bias means how far the predicted value are from the actual values. Usually, if the average predicted values are very different from the actual values (labels), then the bias is higher.
An ML model will have a high bias because it can't model the relationship between input and output variables (can't capture the complexity of data well) and becomes very simple. Thus, a too-simple model with high variance causes underfitting of the data.

The following diagram gives some high-level insights and also shows what a just-right fit model should look like:

Variance signifies the variability between the predicted values and the actual values (how scattered they are).

Identifying high bias and high variance: If the model has a high training error as well as the validation error or test error is the same as the training error, the model has high bias. On the other hand, if the model has low training error but has high validation or high test error, the model has a high variance.

An ML model usually performs very well on the training set but doesn't work well on the test set (because of high error rates). Ultimately, it results in an underfit model. We can recap the overfitting and underfitting once more:

Underfitting: If your training and validation error are both relatively equal and very high, then your model is most likely underfitting your training data.
Overfitting: If your training error is low and your validation error is high, then your model is most likely overfitting your training data. The just-rightfit model learns very well and performs better on unseen data too.

Bias-variance trade-off: The high bias and high variance issue is often called bias-variance trade-off, because a model cannot be too complex or too simple at the same time. Ideally, we would strive for the best model that has both low bias and low variance.

Now we know the basic working principle of an ML algorithm. However, based on problem type and the method used to solve a problem, ML tasks can be different, for example, supervised learning, unsupervised learning, and reinforcement learning. We'll discuss these learning tasks in more detail in the next section.