Tetlock
Article
Tetlock is a recurring person in the Astral Codex Ten archive, appearing 7 times across 7 issues between April 09, 2021 and February 20, 2025. The archive places it in contexts such as “he quotes Tetlock’s comment on Galen”; “Tetlock write, “yet Galen never conducted anything resembling a modern experiment.””; “Remember that quote that Tetlock used to dunk on Galen?“. It most often appears alongside COVID, Manifold, Metaculus.
Metadata
- Category: People
- Mention count: 7
- Issue count: 7
- First seen: April 09, 2021
- Last seen: February 20, 2025
Appears In
- Your Book Review: On The Natural Faculties
- 22
- Your Book Review: Public Choice Theory And The Illusion Of Grand Strategy
- 23
- 24
- Highlights From The Comments On The Lab Leak Debate
- Lives Of The Rationalist Saints
Related Pages
-
- COVID (3 shared issues)
-
- Manifold (3 shared issues)
-
- Metaculus (3 shared issues)
-
- Scott (3 shared issues)
-
- Twitter (3 shared issues)
-
- China (2 shared issues)
-
- CIA (2 shared issues)
-
- Congress (2 shared issues)
-
- effective altruism (2 shared issues)
-
- Elon Musk (2 shared issues)
-
- Europe (2 shared issues)
-
- France (2 shared issues)
External Links
Source Context
Recovered passages from the original issue text. When the raw archive preserved outbound links inside the source passage, they are listed directly under the quote.
And so on until the present day. In Scott’s review of Superforecasting, he quotes Tetlock’s comment on Galen:
Inline links: Scott’s review of
The most extreme example comes from a debate with the disciples of Asclepiades about the function of the ureters, trying to convince this rival school that urine flows from the kidneys to the bladder through these channels. After exhausting his rhetorical options, Galen turns to empirical anatomy. First he shows them, in a dead animal, that the ureters connect the two structures. This isn’t enough. Next he shows them “in a still living animal, the urine plainly running out through the ureters into the bladder.” This doesn’t change their minds either. Next he takes a live animal, ligates the ureters, bandages the animal up, and lets it go. When he opens it up again later, he finds the ureters “quite full and distended”, and when he removes the ligature, everyone can see the urine flow into the bladder. You’d think the story would end there, but not so. Instead, says Galen, “tie a ligature round [the animal’s] penis and then … squeeze the bladder all over.” He points out that nothing goes back through the ureters to the kidneys, demonstrating that the conveyance is a special, one-way action. He goes on like this for a while. Let the animal urinate and tie a ligature around one ureter but not the other. Cut open both the ureters and see the urine “spurt out of it”. Bandage the animal up and open him up later to discover his insides full of urine and the bladder empty. “Now, if anyone will but test this for himself on an animal,” Galen concludes, “I think he will strongly condemn the rashness of Asclepiades.” Today we know that Galen was wrong, and that humorism isn’t a great way to think about medicine. But whatever Galen might have been lacking, it certainly was not the empirical bent. He was no armchair philosopher, and was more than happy to cut up lots of animals to make a point about the function of the ureters. This is funny because, again, this is the opposite of the story we’re told about Galen. He’s described as a pre-scientific or even unscientific thinker, believing that experimentation and investigation are a waste of time. Clearly this isn’t the case, and he made full use of all the resources available to him. We know that human dissection was prohibited in the empire, but Galen worked with gladiators, so we know that he had firsthand experience with human anatomy. He certainly was unafraid, even eager, to practice animal dissection and vivisection. Other doctors of the time didn’t seem to do either of these things, or at least didn’t do nearly as much, and so Galen starts looking more and more like a lone light of empiricism in the wilderness. (However extreme and disturbing his methods may be.) VII. In view of this, it’s extremely depressing to see Tetlock write, “yet Galen never conducted anything resembling a modern experiment.” Galen isn’t here to respond, but if he were, I imagine he would say: and yet Tetlock never conducted anything resembling a basic literature review! Galen definitely isn’t as charitable as we might want him to be. He calls some of the ideas he disagrees with “impossible, nay, perfectly nonsensical”, or “stupid—I might say insane”. His intellectual rivals “are like slaves” he says, “caught in the act of stealing … quite bewildered, and while the one says nothing, the other indulges in shameless lying.” But I’m pretty sympathetic to Galen’s position, because his contemporaries really do sound like idiots. Of course, all this is being filtered through Galen’s own account, but if he’s describing them with any accuracy, he is totally fair in saying that they have no idea what they are talking about. Some of the positions he argues against include: Urine passes into the bladder in the form of vapors, rather than being secreted by the kidneys and passed through the ureters to the bladder. Galen argues against this first by pointing out that the kidneys and bladder are connected by the ureters (which must have some purpose), and second by the extensive evidence from vivisection that I mentioned above.
Remember that quote that Tetlock used to dunk on Galen? “All who drink of this treatment recover in a short time, except those whom it does not help, who all die. It is obvious, therefore, that it fails only in incurable cases.”
Everyone always tells the story of how Tetlock’s superforecasters beat CIA experts. Is it true? Arb finds that it’s more complicated:
If I’m understanding this right, the average forecaster did worse than the average expert, but Tetlock had the bright idea to use clever aggregation methods for his superforecasters, and the CIA didn’t use clever aggregation methods for their experts. The CIA did try a prediction market, which in theory and under ideal conditions should work at least as well as any other aggregation method, but under real conditions (it was low-volume and poorly-designed) it did not.
Fourth, it’s difficult to know who possesses genuine expertise, so foreign policy discourse is prone to capture by special interests. History runs only once — the cause and effect in foreign policy are hard to generalise into measurable forecasts; as demonstrated by Tetlock’s superforecasters, geopolitical experts are worse than informed laymen at predicting world events. Unlike those who have fought the tobacco companies that denied the harms of smoking, or oil companies that denied global warming, the opponents of interventionists may never be able to muster evidence clear enough to win against those in power with special interests backing.
The press should include Tetlock’s superforecasting/prediction markets when reporting the forecasts by the military and national security bureaucracy at public interviews, official reports, and congressional testimony
What happens if they don’t? The White House report says a “protracted” default (ie for more than three months) could sink the stock market by 45%. Is this an exaggeration? Given that this is about any default, and not just a “protracted” one, I think this backs up the White House claim that this would be pretty catastrophic. EPJ Probes The Long Run Superforecasters are pretty good at telling you who will win next month’s sports game, next month’s election, or next year’s geopolitical clash. What about the longer-term? Can they predict broader political trends? The distant future of AI? Until now, we didn’t know, for a simple reason: superforecasting was only a few decades old. Philip Tetlock did the original Expert Political Judgment experiments in the 80s and 90s. In a predictive success of his own, Tetlock realized this would be a problem early on. In 1998, he got experts to make predictions for the next 25 years. Specifically, he asked his forecasters to predict the course of nuclear proliferation and various border conflicts. Some were geopolitics scholars were were experts in these fields; others weren’t. It’s been 25 years since 1998, so we’re ready to open the time capsule and see how they did. Before answering: how do we judge the results? That is, the subjects made some guesses about the world in 2023. Let’s say a third of them were right. Is that good or bad? Does it mean people can predict the future 25 years out, or they can’t? Tetlock proposes several specific questions, of which I’ll focus on the three I find most interesting: Will forecasters do better than some hacked-together algorithmic guess based on base rates? For example, if we ask “will countries X and Y go to war in the next 25 years?”, will experts outperform just guessing the base rate of war between those two countries (or two similar countries) over a 25-year-period?
Inline links: https://polymarket.com/event/us-debt-ceiling-hike-by
I interpret this as: it’s tempting to treat this as Team Long-Range-Forecasting-Is-Possible Vs. Team No-It-Isn’t. But everyone agrees certain kinds of long-range forecasts are possible (I predict with high confidence that the US President in 2050 will not be a Zoroastrian) and others are impossible (I cannot begin to predict the name of the US President in 2050). People who consider themselves “believers” vs. “skeptics” about long-range forecasting should figure out the exact boundary of which cases they disagree on. And then Tetlock et al can test those cases and figure out who’s right.
Are these the data I’ve been trying to get for years - which forecasting platforms beat which others? I don’t think so - Metaculus’ good Briar score only means it performs well on Metaculus’ questions, which might be easier or harder than some other platform’s questions. Can we use the Halawi et al AI as a fixed comparison point, since it’s always the same skill level? I’m not sure - it trained on each of these markets for the style of question that’s in each market, so it might be biased. Still, these numbers are all about where I would expect them to be, except maybe Polymarket, which does better than I would have expected. But the crowd still beats the AI, right? Halawi et al object that humans can forecast only when they feel like it - you can bet on a prediction market question you feel confident on, and avoid one you don’t. When they let their AI forecast only on those questions where it’s most likely to do well (eg those with lots of relevant news articles), it very slightly outperforms the human crowd. As AI gets better, will it naturally beat humans in forecasting? Halawi et al say this won’t be trivial. They find a version of their system based off GPT-3.5 is only very slightly worse than the final version built off GPT-4. This suggests a forecasting AI built off GPT-5 or 6 might get only small improvements. The second team is Tetlock et al. They start from the same place as Halawi - out-of-the-box LLMs aren’t good at forecasting. They’re more scathing about this than Halawi was - they argue that out-of-the-box models do worse than predicting 50% for everything (this was close to true of human forecasters in the ACX tournament). Instead of increasing quality, Tetlock increases quantity. He wants to do wisdom of crowds, where the crowd is a bunch of different LLMs. So he gets twelve LLMs - including Bard, GPT, Claude, Mistral, PaLM, LLaMa, some Chinese models I’d never heard of, and a couple of variations on these bases - asks them to predict questions, and averages the results. Remember, you gotta prompt your model with “you are a smart person”, or else it won’t be smart! The results: Next, we compare the LLM crowd performance to that of the human crowd for our second hypothesis, directly putting the two crowd-aggregation mechanisms head-to-head. To do this, we use the same LLM crowd average as before (taking the median LLM prediction on each question and averaging up the Brier scores across questions). We compare this to the average of median human predictions on the same questions. In our preregistered analysis, we fail to find statistically significant differences between the LLM crowd’s mean Brier score of M=0.20 (SD=0.12) and that of the human crowd, M=0.19 (SD=0.19), t(60) = 0.19, p = 0.850 Their study was much smaller than Halawi’s (31 questions vs. 3,672), so I don’t think this result (nonsignificant small difference) should be considered different from Halawi’s (significant small difference). Still, it’s weird, isn’t it? Halawi used a really complicated tower of prompts and APIs and fine-tunings, and Tetlock just got more LLMs, and they both did about the same. I have two questions after reading these results: Did they actually do the same, or is this just a function of the small sample size in Tetlock and the non-head-to-head comparison?
Inline links: Tetlock et al, https://substackcdn.com/image/fetch/$s_!4SEc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdce72400-aa57-4f52-99cb-5f551bd4d79d_675x435.png
Remember, you gotta prompt your model with “you are a smart person”, or else it won’t be smart! The results: Next, we compare the LLM crowd performance to that of the human crowd for our second hypothesis, directly putting the two crowd-aggregation mechanisms head-to-head. To do this, we use the same LLM crowd average as before (taking the median LLM prediction on each question and averaging up the Brier scores across questions). We compare this to the average of median human predictions on the same questions. In our preregistered analysis, we fail to find statistically significant differences between the LLM crowd’s mean Brier score of M=0.20 (SD=0.12) and that of the human crowd, M=0.19 (SD=0.19), t(60) = 0.19, p = 0.850 Their study was much smaller than Halawi’s (31 questions vs. 3,672), so I don’t think this result (nonsignificant small difference) should be considered different from Halawi’s (significant small difference). Still, it’s weird, isn’t it? Halawi used a really complicated tower of prompts and APIs and fine-tunings, and Tetlock just got more LLMs, and they both did about the same. I have two questions after reading these results: Did they actually do the same, or is this just a function of the small sample size in Tetlock and the non-head-to-head comparison?
Did they actually do the same, or is this just a function of the small sample size in Tetlock and the non-head-to-head comparison?
Even if you don’t want to convince yourself, this is the correct next step. Again by analogy to Tetlock - if he had started with just one superforecaster, and his thesis was “this guy is really smart, but I refuse to prove it”, nothing would have changed. Instead, his theory of change goes through publishing in a bunch of papers, to identifying other superforecasters, to teaching general principles of superforecasting, to superforecasting as a service (either through specific superforecasters at GJO, or through projects that seek to emulate them like Metaculus, FutureSearch, etc). If Rootclaim doesn’t scale, it either dies with Saar, or at best Saar lives a long life and puts out a few more dozen Rootclaim analyses but nothing else comes of it. You’ve got to start training other people eventually, and part of that process involves demonstrating you did it right, and that’s going to involve inter-rater reliability.
St. Felix publicly declared that he believed with 79% probability that COVID had a natural origin. He was brought before the Emperor, who threatened him with execution unless he updated to 100%. When St. Felix refused, the Emperor was impressed with his integrity, and said he would release him if he merely updated to 90%. St. Felix refused again, and the Emperor, fearing revolt, promised to release him if he merely rounded up one percentage point to 80%. St. Felix cited Tetlock’s research showing that the last digit contained useful information, refused a third time, and was crucified.