Machine Learning for Algorithmic Trading
上QQ阅读APP看书,第一时间看更新

Sources of alternative data

Alternative datasets are generated by many sources but can be classified at a high level as predominantly produced by:

  • Inpiduals who post on social media, review products, or use search engines
  • Businesses that record commercial transactions (in particular, credit card payments) or capture supply-chain activity as intermediaries
  • Sensors that, among many other things, capture economic activity through images from satellites or security cameras, or through movement patterns such as cell phone towers

The nature of alternative data continues to evolve rapidly as new data sources become available and sources previously labeled "alternative" become part of the mainstream. The Baltic Dry Index (BDI), for instance, assembles data from several hundred shipping companies to approximate the supply/demand of dry bulk carriers and is now available on the Bloomberg Terminal.

Alternative data includes raw data as well as data that is aggregated or has been processed in some form to add value. For instance, some providers aim to extract tradeable signals, such as sentiment scores. We will address the various types of providers in Chapter 4, Financial Feature Engineering – How to Research Alpha Factors.

Alternative data sources differ in crucial respects that determine their value or signal content for algorithmic trading strategies. We will address these aspects in the next section after looking at the main sources in this one.

Inpiduals

Inpiduals automatically create electronic data through online activities, as well as through their offline activity as the latter is captured electronically and often linked to online identities. Data generated by inpiduals is frequently unstructured in text, image, or video formats, disseminated through multiple platforms, and includes:

  • Social media posts, such as opinions or reactions on general-purpose sites such as Twitter, Facebook, or LinkedIn, or business-review sites such as Glassdoor or Yelp
  • E-commerce activity that reflects an interest in or the perception of products on sites like Amazon or Wayfair
  • Search engine activity using platforms such as Google or Bing
  • Mobile app usage, downloads, and reviews
  • Personal data such as messaging traffic

The analysis of social media sentiment has become very popular because it can be applied to inpidual stocks, industry baskets, or market indices. The most common source is Twitter, followed by various news vendors and blog sites. Supply is competitive, and prices are lower because it is often obtained through increasingly commoditized web scraping. Reliable social media datasets that include blogs, tweets, or videos have typically less than 5 years of history, given how recently consumers have adopted these tools at scale. Search history, in contrast, is available from 2004.

Business processes

Businesses and public entities produce and collect many valuable sources of alternative data. Data that results from business processes often has more structure than that generated by inpiduals. It is very effective as a leading indicator for activity that is otherwise available at a much lower frequency.

Data generated by business processes includes:

  • Payment card transaction data possibly available for purchase from processors and financial institutions
  • Company exhaust data produced by ordinary digitized activity or record-keeping, such as banking records, cashier scanner data, or supply chain orders
  • Trade flow and market microstructure data (such as L2 and L3 order book data, illustrated by the NASDAQ ITCH tick data example in Chapter 2, Market and Fundamental Data – Sources and Techniques)
  • Company payments monitored by credit rating agencies or financial institutions to assess liquidity and creditworthiness

Credit card transactions and company exhaust data, such as point-of-sale data, are among the most reliable and predictive datasets. Credit card data is available with around 10 years of history and, at different lags, almost up to real time, while corporate earnings are reported quarterly with a 2.5-week lag. The time horizon and reporting lag for company exhaust data varies widely, depending on the source. Market microstructure datasets have over 15 years of history compared to sell-side flow data, which typically has fewer than 5 years of consistent history.

Sensors

Networked sensors embedded in a broad range of devices are among the most rapidly growing data sources, driven by the proliferation of smartphones and the reduction in the cost of satellite technologies.

This category of alternative data is typically very unstructured and often significantly larger in volume than data generated by inpiduals or business processes, and it poses much tougher processing challenges. Key alternative data sources in this category include:

  • Satellite imaging to monitor economic activity, such as construction, shipping, or commodity supply
  • Geolocation data to track traffic in retail stores, such as using volunteered smartphone data, or on transport routes, such as on ships or trucks
  • Cameras positioned at a location of interest
  • Weather and pollution sensors

The Internet of Things (IoT) will further accelerate the large-scale collection of this type of alternative data by embedding networked microprocessors into personal and commercial electronic devices, such as home appliances, public spaces, and industrial production processes.

Sensor-based alternative data that contains satellite images, mobile app usage, or cellular-location tracking is typically available with a 3- to 4-year history.

Satellites

The resources and timelines required to launch a geospatial imaging satellite have dropped dramatically; instead of tens of millions of dollars and years of preparation, the cost has fallen to around $100,000 to place a small satellite as a secondary payload into a low Earth orbit. Hence, companies can obtain much higher-frequency coverage (currently about daily) of specific locations using entire fleets of satellites.

Use cases include monitoring economic activity that can be captured using aerial coverage, such as agricultural and mineral production and shipments, or the construction of commercial or residential buildings or ships; industrial incidents, such as fires; or car and foot traffic at locations of interest. Related sensor data is contributed by drones that are used in agriculture to monitor crops using infrared light.

Several challenges often need to be addressed before satellite image data can be reliably used in ML models. In addition to substantial preprocessing, these include accounting for weather conditions such as cloud cover and seasonal effects around holidays. Satellites may also offer only irregular coverage of specific locations that could affect the quality of the predictive signals.

Geolocation data

Geolocation data is another rapidly growing category of alternative data generated by sensors. A familiar source is smartphones, with which inpiduals voluntarily share their geographic location through an application, or from wireless signals such as GPS, CDMA, or Wi-Fi that measure foot traffic around places of interest, such as stores, restaurants, or event venues.

Furthermore, an increasing number of airports, shopping malls, and retail stores have installed sensors that track the number and movements of customers. While the original motivation to deploy these sensors was often to measure the impact of marketing activity, the resulting data can also be used to estimate foot traffic or sales. Sensors to capture geolocation data include 3D stereo video and thermal imaging, which lowers privacy concerns but works well with moving objects. There are also sensors attached to ceilings, as well as pressure-sensitive mats. Some providers use multiple sensors in combination, including vision, audio, and cellphone location, for a comprehensive account of the shopper journey, which includes not only the count and duration of visits, but extends to the conversion and measurement of repeat visits.