Some companies have begun relying more on computer-administered tests than human interviewers to find the best applicants. New research by Harvard Business School Assistant Professor Danielle Li and colleagues suggests that in this case, we may have to score one for the machine.
Job testing was popular in the 1950s and ’60s as a way of sifting through bulging applicant pools. After researchers questioned its reliability, testing fell out of use in favor of personal interviews. Now, with the emergence of big data, machine testing has come back in sophisticated new forms.
“The main question is, what should you be doing with this information?”
Testing companies use a rash of custom-designed assessments, including personality tests, skills assessments, math and logic problems, and judgment tests, on hypothetical work situations. Results are measured using proprietary algorithms and machine learning to predict which candidates will do best in a particular position.
The question is, how much should companies weight this information versus the more subjective impression gleaned from job interviews?
“Essentially firms [are] trying to figure out how to best allocate resources,” says Li, who has studied how companies make organizational decisions in industries such as health care and education. “They are figuring out how to use the information of managers and combine it with this new technology.”
Testing the testers
To come up with an answer, Li obtained data from a testing firm, analyzing it along with Mitchell Hoffman of the University of Toronto’s Rotman School of Management and Lisa B. Kahn of Yale School of Management; they present their findings in a new working paper Discretion in Hiring.
The data included test results for a specific low-skilled job across 15 different companies in a variety of industries. (The researchers agreed to keep the actual job description confidential, but say it is similar in nature to data-entry, standardized test grading, or call center work.)
Crucially, the testing firm also tracked how long the applicants who were eventually hired stayed in their positions. Li and her colleagues used that tenure as a proxy for job performance, reasoning workers who did better in a position were apt to stay longer.
When they crunched the numbers, they found that once computer testing was introduced at a company, workers ended up staying on the job for an average of 15 percent longer.
The result indicates mechanical testing can be helpful in hiring decisions, but it doesn’t gauge whether testing is better than human judgment.
“The main question is, what should you be doing with this information?” says Li. “Should you have hard-and-fast rules about using testing to weed through applicants, or should you allow managers discretion to ignore the test results if they choose to.”
The question is a thorny one for firms, given the variability and unverifiability of interview assessments. A candidate’s test score of 93 is a straightforward measure of fitness. But a manager might be enthusiastic about a candidate for any number of reasons. “If you see managers making a lot of exceptions from the test, it could be because they are well-informed about what will make someone successful, or it could be that they are hiring someone from their hometown”—or any number of other biases that have nothing to do with job performance.
To test this, Li and her fellow researchers used categories devised by the testing firm to divide applicants into green, yellow, and red, according to how well they scored on the hiring test (green being highest, yellow average, and red lowest). They then looked at how many exceptions managers made from the test results—hiring a yellow applicant over a green one, or a red applicant over a yellow one. Finally, they compared the average tenure of applicants hired as exceptions with those hired by the rules.
On average, they found that managers made exceptions from the test 20 percent of the time. And there was a stark downward correlation between the number of exceptions managers made and worker tenure. On average, workers hired by managers who made fewer exceptions (those in the bottom quarter of exception rates) stayed on average 120 days; those hired by managers who made the most exceptions (those in the top quartile of exceptions) stayed only 100 days.
In an even more explicit apples-to-apples comparison, the researchers considered two applicants up for a position at the same time—one yellow, one green. In cases where the yellow worker was hired as an exception and later on the green worker was hired, they put them head-to-head to see which one stayed longer. They found that the passed-over green workers were superior, staying an average of 8 percent longer and implying the manager would have been better off hiring the green worker in the first place rather than making the exception.
Why are humans so fallible?
It’s hard to say from the data exactly what mistakes managers are making in their hiring decisions, but they are probably not intentionally hiring applicants they know won’t be as good at the job.
“My sense is that managers are probably doing their best to hire the people they believe will be the best candidates,” says Li. “But they are not as good at predicting that compared to an algorithm that has access to much more data on worker outcomes and has been trained to recognize these patterns.”
Studies in other settings have shown that many things we typically think of as correlated with performance are in fact not; for example, schoolteachers with master’s degrees in education generally perform no better in their roles than those who do not. “So it’s likely recruiting managers are simply placing too much weight on things that look good on paper, but don’t turn out to matter much in practice,” Li says.
Does this settle the debate between man and machine forever? Not necessarily.
“What it really shows is that whatever firms are doing right now, they could do better by eliminating discretion,” she says.
But that doesn’t mean there might not be superior ways companies could better utilize human judgment in combination with testing. “For example, companies could tell managers they must hire greens before yellows, but within the greens they could hire anyone they wanted,” says Li.
She also stresses these results occurred in hiring for a job position with fairly routine tasks—and that human discretion might be more valuable for positions with more complex job duties.
When it comes to the kind of hiring decisions the researchers looked at in the study, however, it may be time to concede there are some things machines simply do better than humans.
“It’s natural for people to think they are learning valuable information in interviews, and their judgment is valuable,” says Li. “But is it more valuable than quantitative information? Not always, and in this case, probably not.”