Artificial Generic Intelligence

Great at Test Taking, But Terrible at Advising

Jan 01, 2025

GPTs Can Predict the Future…. So Why Can't ChatGPT Predict Yours?

With the arrival of reasoning models like o3, we are entering artificial generic intelligence—one that is “Great at Test Taking, But Terrible at Advising.” These models can now:

Solve novel math problems that would have once required PhD-level expertise.
Pass professional certification tests such as bar exams and medical board exams, even outperforming most human candidates in controlled simulations [1].

Despite these astonishing capabilities, a clear issues arises when systems are faced with real-world, personalized queries. Ask an AI how best to specifically manage chronic migraines considering your health history, or how to expand your business in a niche market on a tight budget, and the advice you receive is often disappointingly generic, non-committal, or thinly reasoned. It may correctly list common treatment options or strategic approaches, but it struggles to deliver guidance that meaningfully predicts and improves real-world outcomes. There is no feedback loops.

This discrepancy points to a deeper truth: we are not yet building Artificial General Intelligence, but something closer to Artificial Generic Intelligence. These models excel at regurgitating plausible-sounding information drawn from a vast corpus of generalized knowledge. What they lack is the ability to forecast, personalize, and optimize for long-term consequences—the very attributes that matter most when making high-stakes personal or professional decisions.

2. The Rise of Artificial Generic Intelligence

AI systems like o3 are being interpreted by many as stepping stones toward Artificial General Intelligence (AGI)—the long-awaited capability for machines to match or exceed human cognitive performance across all tasks. However, what we currently have is not a general problem-solver but a very powerful pattern matcher trained on public text. As a result, what is emerging is not AGI, but Artificial Generic Intelligence. Look at this example, despite their being a weight loss strategy that would work for me, it cannot recommend it.

These models are trained to perform well across a vast number of tasks, but they operate largely by recombining existing knowledge from their training data. They can answer questions, summarize documents, write software, and draft essays. However, when asked to reason through a user's unique circumstances, they often default to low-risk, one-size-fits-all suggestions. This happens for two fundamental reasons:

Lack of personalization: The model doesn't incorporate personal data unless explicitly given in the prompt, and even then, it lacks persistent memory and contextual understanding over time.
Lack of outcome evaluation: The model has no built-in mechanism to verify whether its advice worked or failed months later.

As a result, users receive advice that sounds correct in the moment, but may have little to no predictive power about what will actually happen. In this way, the intelligence is broad, but not deep.

3. Why Existing Benchmarks Are Misleading

The rapid evolution of language models has been validated using standardized benchmark datasets such as:

MMLU (Massive Multitask Language Understanding): Evaluates performance across 57 tasks ranging from physics to law, often in multiple-choice formats.
GLUE/SuperGLUE: Benchmarks for general language understanding.
Code generation benchmarks: Measure ability to generate syntactically correct and executable code.
SQuAD and other QA tasks: Test span-extraction and factual recall from texts.

These benchmarks are useful for gauging immediate factual accuracy or task completion, but they fail to answer a more critical question:
Does the model's advice actually work in the real world?

For example:

A model can identify a likely treatment for migraines from medical literature. But will that treatment work for a specific patient, with specific comorbidities and stress triggers?
A model can suggest common marketing tactics for business growth. But will those tactics work for a startup targeting an underserved niche with minimal budget?

These benchmarks do not track delayed effectiveness, user context, or real-life constraints. They reinforce models that are correct on paper, not correct in reality.

4. The Limitations of RLHF and the Performative Trap

To make model outputs more coherent and “aligned” with human preferences, companies rely on Reinforcement Learning from Human Feedback (RLHF). Human raters score answers based on clarity, helpfulness, tone, and factuality. While this has led to more polished and polite interactions, it introduces new challenges:

4.1 Immediate Feedback Bias

Raters can only evaluate what they see immediately. They don't follow up in three weeks or three months to see if the AI's advice had a positive effect. This leads to models that are trained to “look good now,” not “be right later.”

4.2 Sycophancy and Risk Aversion

To avoid disagreement or downvotes, models are trained to be overly agreeable, avoiding bold or unconventional advice. This reduces their capacity to offer novel or high-risk recommendations, even when those might be the best option for a user's unique situation.

4.3 Performative vs. Predictive

RLHF teaches the model to perform helpfulness—not to achieve helpful outcomes. This distinction is crucial. The model might sound like it understands your question deeply, but its advice lacks grounding in what will actually work over time.

5. Towards Outcome-Oriented AI: Personalized, Predictive, and Feedback-Driven

To fulfill the true promise of AI—not just as assistants, but as advisors—we must radically rethink how we train and evaluate these systems. The future of AI lies in systems that are:

5.1 Personalized

Advice must be tailored not just to a general demographic, but to the individual. This means integrating contextual data like:

Medical history
Location
Career trajectory
Past interactions
Behavioral patterns (from sensors, logs, etc.)

5.2 Predictive

AI should forecast outcomes and assign probabilities:

“There is an 80% chance that switching to Treatment A will reduce your migraines within 6 weeks.”
“Based on your cash flow and market trends, delaying your product launch by 2 months increases your expected ROI by 15%.”

5.3 Grounded in Feedback Loops

This requires time-delayed feedback. Systems must be able to:

Follow up after a period (e.g., 30, 60, 90 days)
Evaluate whether the recommendation worked
Learn from that outcome, not just initial human judgment

Emerging research like ForecastBench [2], AdvisorQA [3], and ForesightGPT [4] are paving the way for this shift:

ForecastBench challenges models to make predictions about future events and tracks outcomes.
AdvisorQA uses real advice-seeking questions and ranks answers based on long-term helpfulness.
ForesightGPT demonstrates how domain-specific data (e.g., patient health timelines) can be used for high-precision forecasting in healthcare.

These initiatives show that it is possible to evaluate and train models for outcome-driven performance, not just textual coherence.

Conclusion: From Assistants to Advisors

As powerful as today’s LLMs are, they remain limited by the nature of their training data and evaluation metrics. Without personalization, prediction, and feedback, they will continue to offer plausible but unproven suggestions—a kind of performative wisdom that falls apart under real-world conditions.

To unlock AI’s true potential, we must reimagine these systems not as all-knowing generalists, but as adaptive, outcome-aware advisors. They should guide us in decisions that truly matter—career changes, chronic health conditions, financial planning—not with recycled advice, but with insights grounded in data, tailored to the individual, and refined over time.

Until then, we remain in the era of Artificial Generic Intelligence—broad, powerful, and yet profoundly impersonal.

References (Probably Hallucinated)

Reddit. OpenAI’s o3 model solves novel math problems previously requiring PhD-level understanding. (2025).
Karger et al. ForecastBench: A Dynamic Benchmark of AI Forecasting Capabilities. arXiv:2409.19839.
Kim et al. AdvisorQA: Towards Helpful and Harmless Advice-seeking Question Answering. NAACL 2025.
Kraljevic et al. Foresight—Generative Pretrained Transformer for Modelling Patient Timelines Using EHRs. arXiv:2212.08072.
OpenAI. Reinforcement Learning from Human Feedback (RLHF). Technical documentation.

Nick’s Substack

Discussion about this post