Opening the machine learning black box

              Lucas Mentch

Lucas Mentch builds research tools. Mentch, Assistant Professor in the Department of Statistics, relies on Center for Research Computing resources to explore what he describes as the intersection of statistics and machine learning, creating models for a wide range of fields – ecology, criminal forensics, and sports analytics among them.

“I enjoy thinking about what other researchers want to do,” he explains. “What questions they want to ask, what they want to measure. I ask if it is possible to quantify uncertainty in the measurement.”

Classical statistics and machine learning both complement and contradict one another. Given enough data, machine learning methods have reached a level where they can often predict much more accurately than classical statistical models. But machine learning algorithms can be impenetrable to the kinds of questions classical statistics asks about the relative impact of individual variables.

“Machine learning may often be more accurate,” Mentch explains. “But it’s rarely obvious why. The model is often a black box where it is not always possible to directly evaluate the impact or contribution made by which sets of variables.”

“But machine learning methods are essential as volumes of data grow. Every field faces the problem of getting a hand on data, so I get to work in a lot of very different areas.” Collaborators include Pitt’s School of Health and Rehabilitation on predicting the likelihood of injuries, and Harvard and University of North Carolina on forecasting disease progression for IBD patients through wearable data.

       Above: The maps illustrate the differences between the original predicted occurrences
       of tree swallows and the random forests calclculate at nine time points throughout the fall.
       Red indicates larger predictions from the original; grey indicates roughly equal predictions

Mentch and graduate student Tim Colemen have recently focused on refining tools for a particular machine learning model known as random forests. The tools they develop allow researchers to test hypotheses about the significance of individual variables. Mentch and Coleman sought to develop these new tests in a manner that is both statistically valid and computationally efficient.

Random forest algorithms organize data into many individual decision trees grouped together to form an ensemble. Each individual decision tree generates a prediction and the answer agreed upon by the largest number of trees becomes the final prediction.  The test developed by Mentch and Coleman compares the accuracy of these predictions to those produced by another ensemble that is purposefully built with some information missing.  If the two sets of predictions are similarly accurate, this suggests that the information is not of much predictive importance.  If, however, the ensemble built with the full original data performs significantly better, this gives scientists reason to believe that those variables play an important role. 

Mentch describes the method. “As statisticians, we’re generally most interested in the long-run performance of these tests.  We build tools for machine learning models that require thousands of parallel iterations, but we have to repeat the entire procedure thousands of times to evaluate the performance of the tests.  That’s a lot of computing, which is why CRC is so important to us.”

In one project, a random forest model was trained on bird population data from the Cornell University Lab of Ornithology’s eBird project. eBird relies on the observations of citizen-scientist birders to build databases based on more than 100 million bird sightings each year, documenting bird distribution, abundance, and habitat use.

Mentch and Coleman worked with data related to migration of tree swallows on the east coast of the United States. In some years, bird watchers reported an earlier than expected autumn migration of tree swallows south from New England to the Chesapeake Bay. Tree swallows migrate opportunistically based on local conditions and availability of food; researchers speculated that colder weather and high mortality in the northern range may have contributed to the early migration.

“eBird creates an amazing level of detail, so the migration presented many complex forms of local variation that contribute to a larger pattern,” Mentch explains. “We can use machine learning methods to predict when the swallows will be at any given point and try to estimate the probability that you will see one in that area at that time, but why is the prediction higher or lower? The challenge was not only to build a model that produced accurate predictions, but also to develop testing procedures that would allow us to isolate the effects of individual variables.”

In the end, Mentch and Coleman did uncover some evidence that slight temperature variations could have played a role in the migration changes.  From a biological perspective, even slight temperature fluctuations could cause movement among the insects on which the birds feed, and if the insects move, the birds follow.

Mentch says he was pleasantly surprised when he came to Pitt and began working with CRC consultants, especially research associate professor Kim Wong, who customized and debugged R software for the project. “It’s not like this at every university. CRC is one of the few places where you can ask a question and the consultants will investigate, instead of the researcher being forced to take time away from the project to work out a computational problem. If Kim didn’t know the answer, he found out. I feel  fortunate.”

Brian Connelly
Pitt Center for Research Computing

Friday, September 4, 2020