Statistics for Data Science
上QQ阅读APP看书,第一时间看更新

New data, new source

What happens when new data or a new data source becomes available or is presented?

Here, new data usually means that more current or more up-to-date data has become available. An example of this might be receiving a file each morning of the latest month-to-date sales transactions, usually referred to as an actual update.

In the business world, data can be either real (actual) as in the case of an authenticated sale, or sale transaction entered in an order processing system, or supposed as in the case of an organization forecasting a future (not yet actually occurred) sale or transaction.

You may receive files of data periodically from an online transactions processing system, which provide the daily sales or sales figures from the first of the month to the current date. You'd want your business reports to show the total sales numbers that include the most recent sales transactions.

The idea of a new data source is different. If we use the same sort of analogy as we used previously, an example of this might be a file of sales transactions from a company that a parent company newly acquired. Perhaps another example would be receiving data reporting the results of a recent online survey. This is the information that's collected with a specific purpose in mind and typically is not (but could be) a routine event.

Machine (and otherwise) data is accumulating even as you are reading this, providing new and interesting data sources creating a market for data to be consumed. One interesting example might be Amazon Web Services ( https://aws.amazon.com/datasets/). Here, you can find massive resources of public data, including the 1000 Genomes Project (the attempt to build the most comprehensive database of human genetic information) as well as NASA's database of satellite imagery of the Earth.

In the previous scenarios, a data developer would most likely be (should be) expecting updated files and have implemented the Extract, Transform, and Load (ETL) processes to automatically process the data, handle any exceptions, and ensure that all the appropriate reports reflect the latest, correct information. Data developers would also deal with transitioning a sales file from a newly acquired company but probably would not be a primary resource for dealing with survey results (or the 1000 Genomes Project).

Data scientists are not involved in the daily processing of data (such as sales) but will be directly responsible for a survey results project. That is, the data scientist is almost always hands-on with initiatives such as researching and acquiring new sources of information for projects involving surveying. Data scientists most likely would have input even in the designing of surveys as they are the ones who will be using that data in their analysis.