Good Judgment Project

Article

Good Judgment Project is a recurring organization in the Astral Codex Ten archive, appearing 11 times across 11 issues between November 01, 2021 and May 13, 2024. The archive places it in contexts such as “My source says that the Good Judgment Project is looking into this, which makes sense”; “This paper is by Good Judgment Project who have just spent years identifying a population of superforecasters”; “If I say ‘I’m a superforecaster in the Good Judgment Project’“. It most often appears alongside Metaculus, Polymarket, Manifold.

Metadata

  • Category: Organizations
  • Mention count: 11
  • Issue count: 11
  • First seen: November 01, 2021
  • Last seen: May 13, 2024

Appears In

Source Context

Recovered passages from the original issue text. When the raw archive preserved outbound links inside the source passage, they are listed directly under the quote.

November 01, 2021 · Original source
Finally, the need to isolate everyone limits your options. You can’t do this in a prediction market; you would have to have a tournament. And you can’t do an open tournament, because then lots of stupid people would be in it and the challenge would be figuring out what stupid people would guess. My source says that the Good Judgment Project is looking into this, which makes sense - they’re the kind of closed tournament between savvy forecasters where this could actually work.
November 15, 2021 · Original source
They admit that you’ve got to be really careful with this. If there are a lot of low-quality forecasters in the tournament, then since high-quality forecasters will accurately predict that low-quality forecasters will give a low-quality answer, everyone will converge on the low-quality answer. This paper is by Good Judgment Project who have just spent years identifying a population of superforecasters, so their plan is to use these people, who are all great, who all know they’re all great, who all know they all know they’re all great, etc. Philip Tetlock wasn’t writing all those books and tweets to self-aggrandize, he was writing them to create common knowledge!
February 21, 2022 · Original source
More important, it lampshades an important quality of “reputational” systems: so far, none of them actually produce any kind of a reputation. By this I mean something like: if I claim “I have an IQ of 160” or “I can bench press 300 lbs”, people might be impressed by me. If I say “I’m a superforecaster in the Good Judgment Project”, the small number of people who know and care what that is will be impressed. I’ve heard people claim all of these things, but I have never heard anyone casually drop their Metaculus score in conversation, even in the weird heavily-selected circles where everyone knows about Metaculus and agrees it is good.
March 14, 2022 · Original source
A common misconception is that superforecasters outperformed intelligence analysts by 30%. Instead: Goldstein et al showed that [EDIT: the Good Judgment Project's best-performing aggregation method][2] outperformed the intelligence community, but this was partly due to the different aggregation technique used (the GJP weighting algorithm performs better than prediction markets, given the apparently low volumes of the ICPM market). The forecaster prediction market performed about as well as the intelligence analyst prediction market; and in general, prediction pools outperform prediction markets in the current market regime (e.g. low subsidies, low volume, perverse incentives, narrow demographics). [85% confidence]
July 12, 2022 · Original source
The Swift Centre will publish forecasts from a panel of highly experienced and accurate forecasters including Good Judgment Project Superforecasters and financial industry professionals, collated and explained to help you navigate the world.
December 20, 2022 · Original source
If we try this plan, then looking back on it ten years from now, will we agree it was a mistake? Prediction markets give us a way to get accurate and canonical answers to questions like these, and to short circuit the usual discussions about how biased different information sources are. See below for some clever, more exotic ways we can use prediction markets. 4. What are the most common objections to prediction markets? These are various objections, some wrongheaded, some true but nonfatal. There are many of them, making this section very long - you might want to skip over any objections you’re not worried about. 4.1: Would prediction markets be ruined by insider trading? That is, suppose there is a market on whether President Biden will resign before the end of his term. President Biden has special knowledge of this, so he could bet on the true outcome and make a lot of money unfairly. He could even change his behavior (eg resign at an unexpected time) just to make more money. Isn’t this unfair? One answer is that normal markets (eg the stock market) face these same problems, but manage them by making insider trading illegal. These laws don’t always work perfectly, but they work well enough that most people are happy to buy stocks. Another answer is that, while this is bad for other investors, it’s not bad for the accuracy of prediction markets, or their use in creating unbiased social consensuses. In fact, knowing that President Biden is insider-trading on a “Will President Biden resign?” prediction market should only increase your confidence in it getting the right answer! This is slightly too rosy, because if insider trading is bad enough for other investors, they might just not trade. This would be a partial effect: investors would be willing to overcome their fear for a big enough payday, meaning that concerns about insider trading probably would increase the likelihood of persistent small mispricings while still not allowing bigger ones (with the exact size depending on how frequent the insider trading was). It’s unclear whether this negative effect would be bigger or smaller than the positive effect from insiders having more information, so in different situations the market might end up either more or less accurate. Overall, economists are split on whether insider trading makes markets more or less accurate. Commodities markets don’t really have insider trading laws right now, and seem to be about as accurate as anything else. I hope prediction markets will experiment with different insider trading rules, and the ones that best satisfy all participants and create the most accurate results will win out. If for some reason this doesn’t work, I don’t expect it to make too much difference either way. 4.2: Would prediction markets encourage harmful or illegal activities? What about the risk of insider trading by committing harmful / illegal acts? That is, could President Biden’s doctor decide to poison him, then make money when he has to resign due to ill health? I think the strongest evidence against is that this basically never happens in stock markets. Tesla stock would plummet if Elon Musk died or resigned, but nobody realistically worries that Musk’s doctor will short Tesla and poison him. Lots of corporations’ stocks would sink to zero if you burned down their offices and factories, but nobody shorts them and then commits arson. Probably this is because there are laws against doing harmful and illegal things, and people have decided that stock market gains aren’t worth breaking the law and getting punished. Since prediction markets have only a tiny fraction of the amount of money that stock markets do, probably people won’t consider it worthwhile to commit harmful actions to manipulate them either. If you were going to murder someone to profit off a market, who would you rather kill: a US politician (the PredictIt market on the presidential election has a volume of about $600,000)? Or a Fortune 500 CEO (whose companies might have market caps in the hundreds of billions)? 4.2.1: What about prediction markets in very specific harmful or illegal activities? I guess if you created a market in “Will someone burn down the 7-11 on Main Street tomorrow at 3:32 AM?”, then bet a lot of money, then did it, that would be bad. I think realistically nobody would bet against you on that. But probably prediction markets should avoid hosting markets on these very specific bad things, just to make sure. 4.3: Would prediction markets give rich people more power? That is, suppose we used prediction markets to assess socially important questions like “will the climate change by such-and-such a number of degrees by 2030?” It would be bad if rich people could manipulate our social consensus on this. But you move prediction markets by buying shares, and rich people can afford more shares than poor people. So doesn’t this mean that rich people can manipulate how concerned we are by global warming? No. See 3.2 for the general reasons why it’s very difficult or impossible to successfully manipulate a prediction market. These reasons apply to rich people too. Suppose a rich person spent $100 million to buy NO shares in “will the climate be warmer in 2030 than today?”, pushing the market’s implicit chance of global warming down to 1%. That means if there is global warming, you could multiply your money by 100x by buying YES. I would immediately invest $10,000 in this market, so that I could get $1 million back in 2030 and retire rich. My $10,000 isn’t going to be enough to fully move this market all the way back - we already said the rich person spent $100 million manipulating it. But “you can get a free $1 million quickly with no downside at an evil rich person’s expense by correcting an obvious misconception about global warming” sounds like the sort of thing that could make it to the front page of Reddit (to put it lightly). I think more than enough people would learn about this to fully correct the mispricing. Is there any amount of money that could successfully manipulate a market? I think the answer is that you need to have more money than the sum total owned by everybody else in the world who wants to make $1 million quick. And at the limit, there’s always Goldman Sachs - who watch financial markets very closely, definitely want to make $1 million quick, and have a lot of money. So I think the most honest answer to this objection is: if you are an evil rich person reading this FAQ, then it will definitely work for you. Please sink $100 million into reducing a prediction market’s chance of global warming to 1%. And make sure you tell me first, so that I can fully marvel at your evil genius. This will work great for you and nothing will possibly go wrong. 4.3.1: But wouldn’t the subtle biases of rich people (which they might genuinely believe) still affect the market more, since they have more money? No. See 3.3 for the general reasons why we should expect prediction markets to be free from subtle biases which people genuinely believe. These reasons apply to rich people too. Suppose rich people have subtle biases which make them wrong more often than poor people. And suppose rich people (wrongly) believe global warming is 75% likely, but poor people (correctly) believe it’s 99% likely. This just reduces to the Nate Silver situation earlier, with poor people playing Nate Silver. The aggregated opinion of poor people is “an expert” which is right more often than the markets. It’s easy for someone to notice this and get rich quick (in expectation) by betting on what poor people think. Since lots of people can easily notice this and want to get rich quick, eventually they will correct the mispricing. Even if rich people have so much more money than poor people that no group of poor people, however large, can ever correct a rich person mispricing, eventually some smart rich person will hit upon this strategy themselves. If no individual rich person does it, Goldman Sachs will definitely do it. 4.3.1.1: What if both rich people and poor people have biases, and neither one is consistently more right than the other? Won’t the market still reflect rich people’s biases rather than poor people’s? Not if it’s possible for anybody to notice these biases and correct for them. Treating the aggregate opinion of poor people as an expert was just one example. If the winning strategy is something like “trust rich people on financial questions, poor people on environmental questions, and the point exactly halfway between them on social questions”, then whoever discovers that strategy can get rich quick. The more often people use prediction markets, the easier it should be to detect strategies like these. 4.4: Aren’t prediction markets worse than superforecasting? “Superforecasting” refers to a variety of forecasting methods similar to those pioneered by Philip Tetlock and the Good Judgment Project. Typically, they would do something like: Ask many smart people to give probabilistic answers to a very well-specified question
If you’re very interested in this, it might be worth contacting Metaculus or Good Judgment Project about a partnership where they walk you through ways to use superforecasting for your organization. I don’t know of an easy way to get exactly this same service with a real prediction market yet.
January 24, 2023 · Original source
3rd: Skerry. Skerry works in finance, with a background in economics. He’s participated in forecasting before, including the Good Judgment Project and a small amount of prediction market trading.
April 25, 2023 · Original source
This is the basic idea behind Zou et al (2022), Forecasting Future World Events With Neural Networks. They create a dataset, Autocast, with 6000 questions from forecasting tournaments Metaculus, Good Judgment Project, and CSET Foretell. Then they ask their AI (a variant of GPT-2) to predict them, given news articles up to some date before the event happened. Here’s their result:
July 19, 2023 · Original source
Why: Philip Tetlock, co-author of Superforecasting and co-founder of the Good Judgment Project and the Forecasting Research Institute, is in town and has kindly agreed to come to an ACX meetup.
March 28, 2024 · Original source
This was a decisive victory. There were two judges, who each gave separate verdicts (or were allowed to declare a draw). Both judges decided in favor of Peter. You can see the judges’ own summary of their reasoning here (Will, Eric) Manifold agreed with the judges. There was a prediction market on who would win. It started out 70-30 in favor of lab leak. As the videos came out, zoonosis started doing better and better. I don’t want to take the exact final numbers too seriously, since I think some of the later price increases involved hints from the participants’ behavior. But it’s clear which way viewers thought the wind was blowing4. Around the same time, the Good Judgment Project - Philip Tetlock’s group studying superforecasters - put out a report on the lab leak hypothesis. After studying it in depth, his forecasters ended up 75-25 in favor of zoonosis. The Rootclaim debate was one of ten sources they said they found especially interesting. And also around the same time, and unrelated to any of this, the Global Catastrophic Risks Institute surveyed experts (“168 virologists, infectious disease epidemiologists, and other scientists from 47 countries”) and found the same thing (though see here for some potential problems with the survey): For what it’s worth, I was close to 50-50 before the debate, and now I’m 90-10 in favor of zoonosis. III. The Math And The Aftermath The third debate session was about “inference”, how to put evidence together. I put this part off until after disclosing the winner, because I wanted to talk about some of these issues at more length. The Math: Judges Both judges included a probabilistic analysis in their written decision. Here’s the same table as above, expanded to add the judges: I shoehorned the judges’ factors into the categories I already had; some of them were actually subtly different from Peter’s, Saar’s, and each other’s. The “priors” category is especially a mess here. We’ll go over these later, but I get the impression that they both thought of probabilistic analyses as an afterthought. For example, Judge Eric wrote 30,000 words about which considerations moved him, and only then includes the analysis, saying: I am not convinced that this Bayesian calculation is even an appropriate way to estimate the relative posterior probability of Z and LL; it just seemed fair that after criticizing Rootclaim’s calculations at length I should make an attempt at it myself. Judge Will’s decision ran to 10,000 words. He said he independently tried both reasoning it out intuitively, and running the Bayesian analysis, and was relieved when these two methods returned the same result. He said: I am skeptical that the Bayesian decision making/evaluation methods are any more "objective" than [intuitive reasoning]. I think they maximize legibility, not objectivity, and tend to hide the intuitive/heuristic portion in the data inclusion step and values, where it’s harder to see . . . I am not skilled in the Bayesian method, and I am sure I made significant mistakes. More time and practice would improve and refine my estimates. At the fundamental rules of the universe level, Bayesian analysis must be the best way to evaluate evidence. However, I am unsure that it’s a good strategy for a human given our cognitive limitations, and doubly unsure it’s truly being used (in the dispassionate sense) where the outcome is social desirability/fame/Twitter likes. I’m focusing on this because Saar’s opinion is that the debate went wrong (for his side) because he didn’t realize the judges were going to use Bayesian math, they did the math wrong (because Saar hadn’t done enough work explaining how to do it right), and so they got the wrong answer. I want to discuss the math errors he thinks the judges made, but this discussion would be incomplete without mentioning that the judges themselves say the numbers were only a supplement for their intuitive reasoning. That having been said, let’s look deeper into some of Saar’s concerns. The Math: Extreme Odds Saar complained that Peter’s odds were too extreme. For example, Peter said there was only a 1/10,000 chance that a lab leak pandemic would first show up at a wet market. Peter’s argument went something like: obviously a zoonotic pandemic would start at a site selling weird animals. But a lab leak pandemic - if it didn’t start at the lab - could show up anywhere. 1/10,000 Wuhan citizens work at the wet market. So if a lab leak was going to show up somewhere random, the wet market was a 1/10,000 chance. Saar had specific arguments against this, but he also had a more general argument: you should rarely see odds like 1/10,000 outside of well-understood domains. In his blog post, he gave this example: A prosecutor shows the court a statistical analysis of which DNA markers matched the defendant and their prevalence, arriving at a 1E-9 probability they would all match a random person, implying a Bayes factor near 1E9 for guilty. But if we try to estimate p(DNA|~guilty) by truly assuming innocence, it is immediately evident how ridiculous it is to claim only 1 out of a billion innocent suspects will have a DNA match to the crime scene. There are obviously far better explanations like a lab mistake, framing, an object of the suspect being brought by someone to the scene, etc. So the real p(wet market|lab leak) isn’t the 1/10,000 chance a pandemic arising in a random place hits the wet market, but the (higher?) probability that there’s something wrong with Peter’s argument. Then Saar tried to show specific things that might be wrong with Peter’s argument. I didn’t find his specific examples convincing. But maybe the question shouldn’t be whether I agreed with him. It should be whether I’m so confident he’s wrong that I would give it 10,000-to-1 odds. This makes total sense, it’s absolutely true, and I want to be really, really careful with it. If you take this kind of reasoning too far, you can convince yourself that the sun won’t rise tomorrow morning. All you have to do is propose 100 different reasons the sunrise might not happen. For example: The sun might go nova.
May 13, 2024 · Original source
(I understand most of the NO vote here is based on the theory that there will be legal intervention - maybe because the government is willing to tolerate sweepstakes casinos but not sweepstakes prediction markets). Manifold co-founder Austin Chen won’t be involved. He’s leaving the site - not explicitly because of the pivot, he just said it seems to be “trapped in local optima”. He plans to focus on other parts of the Manifold empire, especially Manifund, which tests impact markets, regranting, and other “experimental” charity models. Manifold will continue in the hands of the other two co-founders, James and Stephen Grugett. Superhindcasting I mentioned this in my lab leak post, but it deserves more attention here: Good Judgment Project’s report on Superforecasting The Origins Of The COVID-19 Pandemic. Good Judgment Project employs superforecasters who will predict things for clients. Some people interested in COVID origins asked them to judge whether lab leak was plausible. Their headline result was 74% zoonosis, 25% lab leak, 1% something else. Part of GJP’s method is getting their forecasters to share sources and talk to each other. Here’s the graph for how that went: People changed their minds a little over time, but not in a very consistent way that mattered much in the end. What was the “client feedback”? The report says: Client feedback was provided to the Superforecasters on December 21. The client posed questions to the Superforecasters about their assessments up to that date and asked for their reactions to several studies and articles. In the days following the client engagement, the Superforecasters lowered their confidence in the natural zoonosis hypothesis from 73% to 67%, although zoonosis remained the most likely potential cause in their assessment. But following an active engagement with recent genomic studies and historical base rates of zoonotic spillovers, those numbers began to return to earlier levels. January also saw increased attention to the geopolitical context and transparency issues, particularly related to research activities in Wuhan Is this bad? I’m imagining a pro-lab-leak client saying “But what about [this list of pro-lab-leak arguments]?” and then the superforecasters read them and adjust. In one sense, it’s good that they got to see more arguments; on the other, it seems like a potential route by which clients could bias the results - probabilities never quite got back to where they were before the feedback, though they got pretty close. The last-minute spike for zoonosis might be the Rootclaim debate results, which were released on 2/18. So maybe the client feedback and the Rootclaim results both slightly affected the numbers, but mostly the superforecasters started out pro-zoonosis and stuck to their guns. Dan Schwarz and the FutureSearch team say that forecasting has a “rationale-shaped hole”. Despite the report making this sound like a pretty intense process, we don’t get much information about details: In their extensive discussions , Good Judgment’s Superforecasters assessed base rates and historical patterns, existing evidence and scientific analysis, geopolitical context and transparency concerns, trust in intelligence communities, and methodological constraints. 1. Base Rates and Historical Patterns: The Superforecasters frequently referenced base rates, i.e., the history of pandemics emerging from natural zoonosis versus the history of laboratory leaks, to anchor their probabilities. For the former, they discussed how the base rates are changing as the climate warms and as expanding human populations push farther into natural environments that previously saw little human presence. For the latter, they acknowledged that it has only been 12 years since the advent of CRISPR gene- editing tools, and the base rate of lab leaks in the short synthetic biology era is not yet well established. 2. New Evidence and Scientific Analysis: Throughout the period, the Superforecasters adapted their forecasts in light of new scientific evidence, including genomic analyses of SARS-CoV-2 and its relation to bat viruses, and the debate over potential laboratory manipulation. 3. Geopolitical Context and Transparency Concerns: The geopolitical implications of the virus’s origins, particularly in relation to China’s transparency and the involvement of international research institutions, played a significant role in the analysis. Concerns over data veracity, and over the political ramifications of determining that the pandemic’s origins were other than zoonosis, were extensively debated. 4. Trust in Intelligence: Commentary on trust in intelligence communities and discussions about the impact of geopolitical biases on the interpretation of evidence illustrated the complex interplay between science, politics, and human behavior in assessing the pandemic’s origins. 5. Methodological Critiques and the Evaluation of Evidence: The Superforecasters engaged in methodological critiques of the evidence base, including the scrutiny of laboratory practices and biocontainment levels [...] In the end, most Superforecasters were in rough agreement on issues like the base rates of zoonotic spillover. Where they most often disagreed was on the interpretation of actions by Chinese officials and whether their actions reflected how an authoritarian government would react in any crisis over which it did not have full control, or whether those actions were indicative of attempts to cover up a biomedical research-related accident that allowed the SARS-CoV-2 virus to enter circulation in China and, ultimately, the entire globe. Probably it would be too much to ask for to get a transcript of all their discussions - then they’d be nervous saying things that might make them look bad to an audience. What would be a good balance between getting more information and not imposing on their time? Forecasting is an unusually legible and easy-to-judge domain. One of the theories of change for forecasting was to use it to identify smart people with good reasoning, then turn them loose on less well-behaved problems. This is one of the first big attempts to do this at scale. How did it work? We can’t tell, because it’s inherently an illegible and hard-to-judge domain. Darn. I don’t know what I expected. Notes From A Local Optimum Austin’s concern - that forecasting has reached a local optimum - is widely shared. We have some good sites: Manifold, Metaculus, Polymarket, GJO, etc - all doing good work. We have good-ish probabilities for a few important questions. Every so often a news source cites them. Sometimes a decision-maker looks at them behind the scenes, maybe. Is this all there is? The FutureSearch team says the next step is to focus on “rationale”. We need to use forecasting not just to get a raw probability, but to explain what’s going on and why we think something. Then instead of just convincing policy-makers to trust forecasts, we can tell them why something is true, or inform their discussions even if they’re not willing to blindly trust a number. Is this a betrayal of the forecasting ethos? The original dream was that instead of a bunch of people giving arguments, we could just test who was right. Now we’re going back to the arguments? People have argued forever; what does forecasting add to that? Well, they add the knowledge that the arguments are from people who have been right a lot before and are incentivized to be right again. Still, it’s not a natural fit. Probably it’s relevant here that FutureSearch’s forecasting AI does a really good job of this by default, in a way humans can’t match. Nuno’s yearly forecasting roundup doesn’t have a single thesis, but the first part is a well-supported complaint that most forecasting sites aren’t good business. They either burn VC money, burn EA donations, or converge towards casinos to support themselves. He gives an honorable exception to Cultivate Labs, which sells prediction market software rather than the results themselves. Open Philanthropy (billionaire Dustin Moskovitz’s EA-aligned charitable foundation) has at least given forecasting a vote of confidence, recently choosing to promote it to one of their main donation areas. Still, they got a lot of pushback on the decision, for example SuperDuperForecasting here: This will be a total waste of time and money unless OpenPhil actually pushes the people it funds towards achieving real-world impact. The typical pattern in the past has been to launch yet another forecasting tournament to try to find better forecasts and forecasters. No one cares, we already know how to do this since at least 2012! The unsolved problem is translating the research into real-world impact. Does the Forecasting Research Institute have any actual commercial paying clients? What is Metaculus's revenue from actual clients rather than grants? Who are they working with and where is the evidence that they are helping high-stakes decision makers improve their thought processes? Incidentally, I note that forecasting is not actually successful even within EA at changing anything: superforecasters are generally far more relaxed about Xrisk than the median EA, but has this made any kind of difference to how EA spends its money? It seems very unlikely. And Marcus Abramovich here: I'm in the process of writing up my thoughts on forecasting in general and particularly EA's reverence for forecasting but I feel, similar to @Grayden that forecasting is a game that is nearly perfectly designed to distract EAs from useful things. It's a combination of winning, being right when others are wrong and seemingly useful, all wrapped into a fun game. I'd like to see tangible benefits to more broad funding of forecasting that seems to be done in t he millions and tens of millions of dollars. I would also be the type of person you would think would be a greater fan of forecasting. I'm the number one forecaster on Manifold and I've made tens of thousands of dollars on Polymarket. But I think we should start to think of forecasting as more of a game that EAs like to play, something like Magic the Gathering that is fun and has some relations to useful things but isn't really useful by itself. Eli Lifland has a long and hard-to-summarize comment here, response from Ozzie Gooen here, podcast between them on “Is Forecasting A Promising EA Cause Area?” here. I’m split on this. My previous hope was that the field would gradually grow, without any qualitative changes or discontinuities, until it became big enough that journalists and policy-makers were aware of it and took it seriously (compare eg the growth of the Internet as a scholarly resource). I think the strongest argument against this is Manifold’s relatively flat user numbers. Is there a new hope? I think if nothing else, forecasting might be useful as a testing ground: First, to create forecasting AIs (like FutureSearch) which can then get consulted on a variety of questions, eg by policy-makers. The biggest holdup has always been the need to gather 20 or 50 or however many hard-to-find superforecasters for whatever question you’re asking, and then trust their advice even though they’re fallible fleshbag humans. If you can use the 20 to 50 superforecasters to inspire an AI, and then test the AI and prove it’s good, people might be more interested. This is especially true if the AI can branch out beyond traditional forecasting questions. Once we have a few of these, we can start comparing the next generation of AIs to the previous generation, and skip the superforecasters.