FutureSearch

Article

FutureSearch is a recurring organization in the Astral Codex Ten archive, appearing 5 times across 5 issues between February 20, 2024 and September 17, 2024. The archive places it in contexts such as “But FutureSearch says it does. They let me test their AI model”; “FutureSearch is run by a team formerly from Metaculus”; “Like, FutureSearch (above) is pretty cool”. It most often appears alongside Manifold, Metaculus, Polymarket.

Metadata

  • Category: Organizations
  • Mention count: 5
  • Issue count: 5
  • First seen: February 20, 2024
  • Last seen: September 17, 2024

Appears In

Source Context

Recovered passages from the original issue text. When the raw archive preserved outbound links inside the source passage, they are listed directly under the quote.

February 20, 2024 · Original source
The Nermit bot is based on FutureSearch.ai, a new company trying to build an AI-based forecaster. Based on their own internal calculations, they claim success: But see foonote 1 How is this1 possible? Some studies of superforecasters converge on the same technique: figure out a base rate for some event, then alter it based on the current situation. For example, if you wanted to know the chance of a cease-fire in Ukraine over the next year, you might start by plotting the distribution of war lengths over the past century, then check how many wars that had lasted at least two years had a cease-fire in the third. Then you might adjust a little bit down for factors like “there haven’t been any promising peace talks yet” and “the two sides seem equally balanced”. FutureSearch’s AI tries to do something similar. It prompts itself with questions like “What would be a good reference class for this question?”
But see foonote 1 How is this1 possible? Some studies of superforecasters converge on the same technique: figure out a base rate for some event, then alter it based on the current situation. For example, if you wanted to know the chance of a cease-fire in Ukraine over the next year, you might start by plotting the distribution of war lengths over the past century, then check how many wars that had lasted at least two years had a cease-fire in the third. Then you might adjust a little bit down for factors like “there haven’t been any promising peace talks yet” and “the two sides seem equally balanced”. FutureSearch’s AI tries to do something similar. It prompts itself with questions like “What would be a good reference class for this question?”
“Putting it all together, what do you think the probability is?” This doesn’t sound to me like it would work. But FutureSearch says it does. They let me test their AI model. I tried four questions: Will Nikki Haley win the 2028 presidential election? Answer: 10%
March 12, 2024 · Original source
Last month we talked about FutureSearch, a prediction startup that claims their AI is as good as experienced forecasters. This month, two academic teams claim to have gotten similar results with AIs of their own.
Halawi fine-tunes the out-of-the-box AI (in his case, a version of GPT-4) using some of the same tricks as FutureSearch. They attach it to “news APIs” (NewsCatcher, Google News) and teach it to search them effectively and reason about the contents.
Halawi and Tetlock’s AIs did between slightly-worse-than and equivalent-to the participant aggregate, so let’s say 90-95th percentile. FutureSearch claims to equal a 98th percentile forecaster, but they got this number through totally different and slightly suspicious methodology, so I don’t know if it’s actually any better. Still, we see that Samotsvety is capable of 98%ile performance (likely real and repeatable) and Metaculus of 99.5th. So there’s still a long way to go before we exhaust the limits of what’s possible to predict given the available amount of information! Towards Rationality Engines An interlude, before we get to other interesting prediction news. Forecasting AIs are pretty cool. I wouldn’t have expected them to work as well as they do. They are already superforecaster-level, and given the amount of low-hanging fruit that gets picked every day here, I can see them equalling or exceeding the top human forecasters in the next few years But they can’t answer many of the questions we care about most - questions that aren’t about prediction. Do masks prevent COVID transmission? Was OJ guilty? Did global warming contribute to the California superdrought? What caused the opioid crisis? Is social media bad for children? I see two interesting challenges ahead here: Making an AI that can do this.
April 09, 2024 · Original source
Even if you don’t want to convince yourself, this is the correct next step. Again by analogy to Tetlock - if he had started with just one superforecaster, and his thesis was “this guy is really smart, but I refuse to prove it”, nothing would have changed. Instead, his theory of change goes through publishing in a bunch of papers, to identifying other superforecasters, to teaching general principles of superforecasting, to superforecasting as a service (either through specific superforecasters at GJO, or through projects that seek to emulate them like Metaculus, FutureSearch, etc). If Rootclaim doesn’t scale, it either dies with Saar, or at best Saar lives a long life and puts out a few more dozen Rootclaim analyses but nothing else comes of it. You’ve got to start training other people eventually, and part of that process involves demonstrating you did it right, and that’s going to involve inter-rater reliability.
May 13, 2024 · Original source
People changed their minds a little over time, but not in a very consistent way that mattered much in the end. What was the “client feedback”? The report says: Client feedback was provided to the Superforecasters on December 21. The client posed questions to the Superforecasters about their assessments up to that date and asked for their reactions to several studies and articles. In the days following the client engagement, the Superforecasters lowered their confidence in the natural zoonosis hypothesis from 73% to 67%, although zoonosis remained the most likely potential cause in their assessment. But following an active engagement with recent genomic studies and historical base rates of zoonotic spillovers, those numbers began to return to earlier levels. January also saw increased attention to the geopolitical context and transparency issues, particularly related to research activities in Wuhan Is this bad? I’m imagining a pro-lab-leak client saying “But what about [this list of pro-lab-leak arguments]?” and then the superforecasters read them and adjust. In one sense, it’s good that they got to see more arguments; on the other, it seems like a potential route by which clients could bias the results - probabilities never quite got back to where they were before the feedback, though they got pretty close. The last-minute spike for zoonosis might be the Rootclaim debate results, which were released on 2/18. So maybe the client feedback and the Rootclaim results both slightly affected the numbers, but mostly the superforecasters started out pro-zoonosis and stuck to their guns. Dan Schwarz and the FutureSearch team say that forecasting has a “rationale-shaped hole”. Despite the report making this sound like a pretty intense process, we don’t get much information about details: In their extensive discussions , Good Judgment’s Superforecasters assessed base rates and historical patterns, existing evidence and scientific analysis, geopolitical context and transparency concerns, trust in intelligence communities, and methodological constraints. 1. Base Rates and Historical Patterns: The Superforecasters frequently referenced base rates, i.e., the history of pandemics emerging from natural zoonosis versus the history of laboratory leaks, to anchor their probabilities. For the former, they discussed how the base rates are changing as the climate warms and as expanding human populations push farther into natural environments that previously saw little human presence. For the latter, they acknowledged that it has only been 12 years since the advent of CRISPR gene- editing tools, and the base rate of lab leaks in the short synthetic biology era is not yet well established. 2. New Evidence and Scientific Analysis: Throughout the period, the Superforecasters adapted their forecasts in light of new scientific evidence, including genomic analyses of SARS-CoV-2 and its relation to bat viruses, and the debate over potential laboratory manipulation. 3. Geopolitical Context and Transparency Concerns: The geopolitical implications of the virus’s origins, particularly in relation to China’s transparency and the involvement of international research institutions, played a significant role in the analysis. Concerns over data veracity, and over the political ramifications of determining that the pandemic’s origins were other than zoonosis, were extensively debated. 4. Trust in Intelligence: Commentary on trust in intelligence communities and discussions about the impact of geopolitical biases on the interpretation of evidence illustrated the complex interplay between science, politics, and human behavior in assessing the pandemic’s origins. 5. Methodological Critiques and the Evaluation of Evidence: The Superforecasters engaged in methodological critiques of the evidence base, including the scrutiny of laboratory practices and biocontainment levels [...] In the end, most Superforecasters were in rough agreement on issues like the base rates of zoonotic spillover. Where they most often disagreed was on the interpretation of actions by Chinese officials and whether their actions reflected how an authoritarian government would react in any crisis over which it did not have full control, or whether those actions were indicative of attempts to cover up a biomedical research-related accident that allowed the SARS-CoV-2 virus to enter circulation in China and, ultimately, the entire globe. Probably it would be too much to ask for to get a transcript of all their discussions - then they’d be nervous saying things that might make them look bad to an audience. What would be a good balance between getting more information and not imposing on their time? Forecasting is an unusually legible and easy-to-judge domain. One of the theories of change for forecasting was to use it to identify smart people with good reasoning, then turn them loose on less well-behaved problems. This is one of the first big attempts to do this at scale. How did it work? We can’t tell, because it’s inherently an illegible and hard-to-judge domain. Darn. I don’t know what I expected. Notes From A Local Optimum Austin’s concern - that forecasting has reached a local optimum - is widely shared. We have some good sites: Manifold, Metaculus, Polymarket, GJO, etc - all doing good work. We have good-ish probabilities for a few important questions. Every so often a news source cites them. Sometimes a decision-maker looks at them behind the scenes, maybe. Is this all there is? The FutureSearch team says the next step is to focus on “rationale”. We need to use forecasting not just to get a raw probability, but to explain what’s going on and why we think something. Then instead of just convincing policy-makers to trust forecasts, we can tell them why something is true, or inform their discussions even if they’re not willing to blindly trust a number. Is this a betrayal of the forecasting ethos? The original dream was that instead of a bunch of people giving arguments, we could just test who was right. Now we’re going back to the arguments? People have argued forever; what does forecasting add to that? Well, they add the knowledge that the arguments are from people who have been right a lot before and are incentivized to be right again. Still, it’s not a natural fit. Probably it’s relevant here that FutureSearch’s forecasting AI does a really good job of this by default, in a way humans can’t match. Nuno’s yearly forecasting roundup doesn’t have a single thesis, but the first part is a well-supported complaint that most forecasting sites aren’t good business. They either burn VC money, burn EA donations, or converge towards casinos to support themselves. He gives an honorable exception to Cultivate Labs, which sells prediction market software rather than the results themselves. Open Philanthropy (billionaire Dustin Moskovitz’s EA-aligned charitable foundation) has at least given forecasting a vote of confidence, recently choosing to promote it to one of their main donation areas. Still, they got a lot of pushback on the decision, for example SuperDuperForecasting here: This will be a total waste of time and money unless OpenPhil actually pushes the people it funds towards achieving real-world impact. The typical pattern in the past has been to launch yet another forecasting tournament to try to find better forecasts and forecasters. No one cares, we already know how to do this since at least 2012! The unsolved problem is translating the research into real-world impact. Does the Forecasting Research Institute have any actual commercial paying clients? What is Metaculus's revenue from actual clients rather than grants? Who are they working with and where is the evidence that they are helping high-stakes decision makers improve their thought processes? Incidentally, I note that forecasting is not actually successful even within EA at changing anything: superforecasters are generally far more relaxed about Xrisk than the median EA, but has this made any kind of difference to how EA spends its money? It seems very unlikely. And Marcus Abramovich here: I'm in the process of writing up my thoughts on forecasting in general and particularly EA's reverence for forecasting but I feel, similar to @Grayden that forecasting is a game that is nearly perfectly designed to distract EAs from useful things. It's a combination of winning, being right when others are wrong and seemingly useful, all wrapped into a fun game. I'd like to see tangible benefits to more broad funding of forecasting that seems to be done in t he millions and tens of millions of dollars. I would also be the type of person you would think would be a greater fan of forecasting. I'm the number one forecaster on Manifold and I've made tens of thousands of dollars on Polymarket. But I think we should start to think of forecasting as more of a game that EAs like to play, something like Magic the Gathering that is fun and has some relations to useful things but isn't really useful by itself. Eli Lifland has a long and hard-to-summarize comment here, response from Ozzie Gooen here, podcast between them on “Is Forecasting A Promising EA Cause Area?” here. I’m split on this. My previous hope was that the field would gradually grow, without any qualitative changes or discontinuities, until it became big enough that journalists and policy-makers were aware of it and took it seriously (compare eg the growth of the Internet as a scholarly resource). I think the strongest argument against this is Manifold’s relatively flat user numbers. Is there a new hope? I think if nothing else, forecasting might be useful as a testing ground: First, to create forecasting AIs (like FutureSearch) which can then get consulted on a variety of questions, eg by policy-makers. The biggest holdup has always been the need to gather 20 or 50 or however many hard-to-find superforecasters for whatever question you’re asking, and then trust their advice even though they’re fallible fleshbag humans. If you can use the 20 to 50 superforecasters to inspire an AI, and then test the AI and prove it’s good, people might be more interested. This is especially true if the AI can branch out beyond traditional forecasting questions. Once we have a few of these, we can start comparing the next generation of AIs to the previous generation, and skip the superforecasters.
First, to create forecasting AIs (like FutureSearch) which can then get consulted on a variety of questions, eg by policy-makers. The biggest holdup has always been the need to gather 20 or 50 or however many hard-to-find superforecasters for whatever question you’re asking, and then trust their advice even though they’re fallible fleshbag humans. If you can use the 20 to 50 superforecasters to inspire an AI, and then test the AI and prove it’s good, people might be more interested. This is especially true if the AI can branch out beyond traditional forecasting questions. Once we have a few of these, we can start comparing the next generation of AIs to the previous generation, and skip the superforecasters.
September 17, 2024 · Original source
The basic structure is the same as past forecasting AIs like FutureSearch. A heavily-modified copy of ChatGPT gathers relevant news articles, then prompts itself to think in superforecaster-like ways. The creators say the ChatGPT copy had a knowledge cutoff of October 2023, so they tested it on Metaculus questions from after that date. It got 87.7% accuracy, slightly above Metaculus forecasters’ 87.0%. Manifold is skeptical: The commenters, especially Neel Nanda, found that doing knowledge cutoffs properly is hard, and the ChatGPT base seems to know about news events after October 2023 - upon questioning, it seemed aware of an earthquake in November 2023. When presented with a different set of questions that were all after November 2023, FiveThirtyNine substantially underperformed the Metaculus average. But also, my attempts to play around with the bot haven’t been encouraging: I asked it to predict the chance that Prospera would have a population of at least 1,000 in 2027. Like FutureSearch on the same question, it cited many interesting news articles on Prospera’s chances but failed to do the basic step of figuring out its current population and growth rate. It eventually concluded 35% chance, which is reasonable enough. But when asked whether Prospera would have a population of 100,000 in 2028, it also said 35% chance, which is absurd.
I asked it to predict the chance that Prospera would have a population of at least 1,000 in 2027. Like FutureSearch on the same question, it cited many interesting news articles on Prospera’s chances but failed to do the basic step of figuring out its current population and growth rate. It eventually concluded 35% chance, which is reasonable enough. But when asked whether Prospera would have a population of 100,000 in 2028, it also said 35% chance, which is absurd.
A Twitter user pointed out (and I confirmed) that upon being asked “What is the probability that Joe Biden is still President in October 2025?”, it goes through a lot of reasoning about his age and dementia and finally concludes 55% because he’s not that demented. I originally thought this might be due to the knowledge cutoff (it doesn’t know Biden dropped out in favor of Harris), but if I ask the AI about October 2029, then it says that Joe Biden has dropped out in favor of Harris (even though in that question it doesn’t matter). So now I think it’s more like ChatGPT’s tendency to round anything that sounds vaguely like the surgeon riddle off to the surgeon riddle - in the same way, FiveThirtyNine rounds off anything that sounds vaguely like the popular question “is Biden too old and demented to stay president?” into that question, even though there are much stronger non-dementia-related reasons he can’t be president next year. The FutureSearch team wrote a LessWrong post generalizing these kinds of observations, Contra Papers Claiming Superhuman AI Forecasting. They examine four claims, including the one above, and find similar problems with all of them. Sometimes the teams involved missed potential data contamination (ie their LLM wasn’t forecasting, it just already knew the answers). Other times the LLM failed but - in the spirit of technologists everywhere - the researchers invented finicky definitions of “above human level” by which even mediocre AIs qualified. They conclude: Today's autonomous AI forecasting can be better than average, or even experienced, human forecasters…but it's very unlikely that any autonomous AI forecaster yet built is close to the accuracy of a top 2% Metaculus forecaster, or the crowd. Still, FiveThirtyNine is a big advance in at least one way: as far as I know, it’s the first high-quality AI forecaster which is free to the general public. Try it out! This means there’s still time to use this joke when they invent the actually good one! r/MarkMyWords This is a subreddit for people who want to record bold predictions. There’s nothing formal - nobody gives probabilities, and some of them don’t even have end dates. It’s just people going out on a limb to say they’re sure something will happen. …most of them are “mark my words, time will prove Democrats right about everything, and reveal Republicans to be disgusting criminal hypocrites”. …so much so that it kind of fails as a potentially interesting institution and becomes just another monument to how sad the Internet’s gotten. Still, it might be fun to keep going until you find an old post where the prediction has already “resolved”, and see what happens. Here are some of the highest-upvoted posts from at least a year ago (minus pop culture and dumb in-jokes): MMW: It will turn out the Notre Dame fire was actually arson, and not an “accident” as the Paris police initially claimed.