Building Accurate Predictive Models “without Data”

When developing any type of predictive model, it is normal to think of “data” as a set of numbers, typically numerical measurements based on observations. It is a commonly held opinion that the accuracy of a model depends entirely on the quality and quantity of available data, as reflected in the use of statements such as “garbage in, garbage out.”

There is no doubt that the more information is used in building a model, the more accurate the model is likely to be. However, the notion that quantitative, numerical data are the only type of information needed to build an accurate model is flawed. In fact, I believe that the typical business obsession with numeric data can do more damage than good.

To support my claim, suppose you were asked to conduct a study to find ways to reduce traffic jams. A data-driven approach may start by collecting large amounts of data about current traffic, e.g.: average, minimum and maximum speed; traffic volume; amount of time spent standing still in a traffic jam; and so on. These data could then be used to identify correlations. For instance, what is the relationship between time of day and average speed? Or between the number of cars on the road and the severity of traffic jams?

The problem is that this type of modeling is seeking relationships between observed variables, but it is failing to account for some crucialinformation that we, as “domain experts,” possess: when there is nobody in front of us, we accelerate up to some comfortable speed (which may be a bit higher for some of us than others…); when there is someone slow in front of us we slow down and, if there are lanes open, we may switch lanes. This kind of information is critical because it is at the heart of what creates traffic. And yet, typical data-driven approaches would have no way whatsoever of including this domain knowledge along with the numerical data.

In contrast, Agent-Based Simulation (ABS) can include “qualitative” information that is not numerical in nature, but that nonetheless can and should be used in building and calibrating a model. In fact, it is possible to build a simple model of traffic jams that uses only the qualitative information we described. NetLogo, the popular agent-based simulation environment developed and maintained by Northwestern University, includes in their models library a simulation of how traffic jams develop even in a one-lane “loop” road. Based on very simple assumptions, this simulation can exhibit traffic jams that are eerily similar to those observed in the real world (see below).

In general I would not advocate developing predictive models strictly with qualitative information. Rather, I suggest that one should look to enhance a model built on qualitative information by collecting relevant quantitative data. For instance, playing with the NetLogo model reveals that sharp deceleration (sudden braking) is much more likely to cause traffic jams than rapid acceleration. Armed with this information, we could conduct a study to measure how much and how hard people decelerate, use the quantitative data to increase the accuracy of the model, and then determine how much improvement could be achieved by training drivers to slow down more gradually.

I stated earlier that the common obsession with numeric data can do more damage than good. Many of today’s analysts and business leaders believe that the only way to increase predictive accuracy is to collect increasing amounts of numeric data. Sadly, virtually all data-driven analytical approaches have absolutely no way to allow the inclusion of qualitative information such as domain expertise. It is ironic that often the very business leaders who best understand the qualitative aspects of the systems they manage, discard that knowledge and rely instead on quantitative analytical approaches that have very little bearing on how the real world works!

I would ask those business leaders the following question: if you hired a firm to conduct a large study, and that firm were to tell you they were going to throw away half of their results before conducting their analysis, would you find that acceptable? If not, why would you accept any approach that largely or completely ignores what is probably the most valuable information you possess? It seems fairly clear that any model that makes use of both quantitative data and qualitative domain knowledge will always, always, always outperform any model that only uses one of these types of data. Agent-Based simulation is the only approach I know that systematically makes use of both types of information.

Interested in learning more?  Download our free whitepapaer, “Agent-Based Modeling: Methods and Techniques for Simulating Human Systems”.



Icosystem Corporation • 222 Third Street, Suite 0142, Cambridge, MA 02142 • Voice: (617) 520 1000

Terms of UsePrivacy Policy