The election prediction business is one small aspect of a far-reaching change across industries that have increasingly become obsessed with data, the value of it and the potential to mine it for cost-saving and profit-making insights. It is a behind-the-scenes technology that quietly drives everything from the ads that people see online to billion-dollar acquisition deals.
Examples stretch from Silicon Valley to the industrial heartland. Microsoft, for example, is paying $26 billion for LinkedIn largely for its database of personal profiles and business connections on more than 400 million people. General Electric, the nation’s largest manufacturer, is betting big that data-generating sensors and software can increase the efficiency and profitability of its jet engines and other machinery.
But data science is a technology advance with trade-offs. It can see things as never before, but also can be a blunt instrument, missing context and nuance. All kinds of companies and institutions use data quietly and behind the scenes to make predictions about human behavior. But only occasionally — as with Tuesday’s election results — do consumers get a glimpse of how these formulas work and the extent to which they can go wrong.
Google Flu Trends for instance, looked to be a triumph of big data prescience, tracking flu outbreaks based on trends in flu-related search terms. But in the 2012-13 flu season it greatly overstated the number of cases.
This year, Facebook’s algorithm took down the image, posted by a Norwegian author, of a naked 9-year-old girl fleeing napalm bombs. The software code saw a violation of the social network’s policy prohibiting child pornography, not an iconic photo of the Vietnam War and human suffering.
And a Microsoft chat bot, intended to learn “conversational understanding” by mining online text, was quickly retired this year after its machine-learning algorithm began generating racist comments.
Even well-meaning attempts to harness data analysis for the greater good can backfire. Two years ago, the Samaritans, a suicide-prevention group in Britain, developed a free app to notify people whenever someone they followed on Twitter posted potentially suicidal phrases like “hate myself” or “tired of being alone.” The group quickly removed the app after complaints from people who warned that it could be misused to harass users at their most vulnerable moments.
This week’s failed election predictions suggest that the rush to exploit data may have outstripped the ability to recognize its limits.
“State polls were off in a way that has not been seen in previous presidential election years,” said Sam Wang, a neuroscience professor at Princeton University who is a co-founder of the Princeton Election Consortium. He speculated that polls may have failed to capture Republican loyalists who initially vowed not to vote for Mr. Trump, but changed their minds in the voting booth.
Beyond election night, there are broader lessons that raise questions about the rush to embrace data-driven decision-making across the economy and society.
The enthusiasm for big data has been fueled by the success stories of Silicon Valley giants born on the internet, like Google, Amazon and Facebook. The digital powerhouses harvest vast amounts of user data using clever software for search, social networks and online commerce. Data is the fuel, and algorithms borrowed from the tool kit of artificial intelligence, notably machine learning, are the engine.
The early commercial use for the technology has been to improve the odds of making a sale — through targeted ads, personalized marketing and product recommendations. But big-data decision-making is increasingly being embraced in every industry, and to make higher-stakes decisions that crucially affect people’s lives — like helping to make medical diagnoses, hiring choices and loan approvals.
The danger, data experts say, lies in trusting the data analysis too much without grasping its limitations and the potentially flawed assumptions of the people who build predictive models.
The technology can be, and is, enormously useful. “But the key thing to understand is that data science is a tool that is not necessarily going to give you answers, but probabilities,” said Erik Brynjolfsson, a professor at the Sloan School of Management at the Massachusetts Institute of Technology.
Mr. Brynjolfsson said that people often do not understand that if the chance that something will happen is 70 percent, that means there is a 30 percent chance it will not occur. The election performance, he said, is “not really a shock to data science and statistics. It’s how it works.”
So, what happened with the election data and algorithms? The answer, it seems, is a combination of the shortcomings of polling, analysis and interpretation, perhaps both in how the numbers were presented and how they were understood by the public.
Mr. Silver, the founder of FiveThirtyEight, did not immediately respond to an email seeking comment. Amanda Cox, the editor of The Upshot, and Mr. Wang of the Princeton Election Consortium said state polling errors were largely to blame for the underestimates of Mr. Trump’s chances of winning.
In addition to the polling errors, data scientists said the inherent weakness of election models might have caused some forecasting errors. Before an election, forecasters use a combination of historical polls and recent polling data to predict a candidate’s chance of winning. Some may also factor in other variables, such as giving higher weight to a candidate who is an incumbent.
But even with decades of polls to analyze, it is difficult for forecasters to predict accurately a candidate’s chance of winning the presidency months or even weeks ahead of time. Dr. Mutalik of Yale compared election modeling to weather forecasting.
“Even with the best models, it is difficult to predict the weather more than 10 days out because there are so many small changes that can cause big changes,” Dr. Mutalik said. “In mathematics, this is known as chaos.”
But, unlike weather prediction, current election models tend to take into account only several decades’ worth of data. And changing the parameters of that data set can also significantly affect calculations.
The FiveThirtyEight model, for instance, is calibrated based on general elections since 1972, a year when state polling began to increase. On Oct. 24, that model put Mrs. Clinton’s chances of winning at 85 percent. But when the site experimentally recalibrated the model based on more recent polls, dating back just to 2000, Mrs. Clinton’s chances rose to 95 percent, Mr. Silver wrote on his blog.
In this presidential election, analysts said, the other big problem was that some state polls were wrong. Recent polls from Wisconsin, for instance, put Mrs. Clinton well ahead of Mr. Trump. And election forecasts relied on that information for their predictions. Britain encountered similar lapses when polls mistakenly predicted that the nation would vote in June to stay in the European Union.
“If we could go back to the world of reporting being about the candidates and the parties and the issues at stake instead of the incessant coverage of every little blip in the polls, we would all be better off,” said Thomas E. Mann, an election expert at the Brookings Institution. “They are addictive, and it takes the eye off the ball.”