Hands-On Predictive Analytics with Python
上QQ阅读APP看书,第一时间看更新

Make explicit the data that will be required

Once the output of the model has been defined, you should make explicit which data will be required to solve the problem and produce the predictions you intend: which data sources you need to be able to access, in what format the data is needed, how much data is needed, and so on. Of course, you may have a list of requirements of what is needed, but in predictive analytics (as in life) most of the time you don't get what you want or even what you need, you get what the circumstances allow you to have; in fact, in most cases you won't have any decision over the data that is available to work with, as the data is already there and you will have to work with what is available, period. For example, you may want 12 months of customer data to produce a credit card default model; however, the guys in charge of the data may tell you, "We only have 6 months of historical data."

At this stage, you must also discuss with the key stakeholders what is the data that they think is relevant from a business perspective. It is very important that these discussions are as clear as possible, as you don’t want to work hard developing a model for credit card defaults based on historical payment data, only to find out that they wanted to include the demographic characteristics of the customer.