Part 2: Delayed Reinforcement Learning with Human Feedback

Why outcomes matter and how companies will go from "API wrapper" to crowdsourcing intelligence with DRLHF

Mar 25, 2024

”So what do I think five years from now we'll be talking about? I think we'll be talking about systems that use… use the crowd to learn something…. So that model, which is you crowdsource information in, you learn it, and then you sell it, is in my view a highly likely candidate for the next hundred billion dollar corporations.”

-Eric Schmidt Former CEO of Google, Source

Summary

In part one of this series, I made the claim that “AI Still Sucks” and that personalized advice is the most lucrative thing AI still can't do. But to give useful personalized advice, AI must first be trained on outcome validated data, not outputs that just sort of sounds correct. Current systems like ChatGPT, have been fine tuned using Reinforcement Learning from Human Feedback (RLHF) but this approach relies on immediately asking users feedback, which does not capture the true effectiveness and long-term impact of AI-generated outputs. Instead it is optimizing for outputs that “sound the best” but is not actually the most useful. This article introduces Delayed Reinforcement Learning from Human Feedback (DRLHF), a simple approach that intentionally delays asking for feedback until the outcomes of AI's outputs can be fully assessed, allowing for a more comprehensive evaluation of the AI's performance.

1. Introduction

RLHF plays a crucial role for recent developments in AI such as chatgpt, by leveraging human feedback to guide the learning process of models. While this immediate feedback loop helps in quickly adjusting the model's responses based on user satisfaction and factualness, it inherently lacks the depth to evaluate the long-term utility and accuracy of the advice or information provided. This article proposes DRLHF as an alternative, focusing on delaying feedback until the real-world outcomes or the effectiveness of AI's outputs can be fully assessed.

2. Limitations of Immediate Feedback in RLHF

Immediate feedback, such as the upvotes or downvotes ChatGPT receives immediately after generating an output, serves as a rapid indicator of user satisfaction or dissatisfaction. However, this immediate reinforcement model has its limitations:

Short-term Focus: Immediate feedback emphasizes the user's initial reaction, potentially overlooking the long-term correctness of the information provided.
Overgeneralization: The reliance on immediate feedback can lead AI models to favor responses that are broadly acceptable rather than tailored to specific user needs, reducing the personalized effectiveness of the advice.
Lack of Outcome Validation: Immediate feedback does not consider whether the advice or information provided by the AI actually works in real-world scenarios.

3. The Garbage In, Garbage Out Problem

In addition to an inherently faulty RL system, most LLMs, have been trained on web data including sources like Reddit and Quora. While this data is vast and diverse, it often contains advice or information that sounds correct but may not be accurate or effective in practice. Consider a common reddit relationship advice question, where the poster asks whether to leave his or spouse. Many of these threads recommend people to leave when it would have been better to stay and work through it. ChatGPT being trained on this data, will be emulating these inaccurate reddit responses in its outputs.

This "garbage in, garbage out" problem can lead to AI models providing seemingly convincing but ultimately flawed outputs. DRLHF can help mitigate this issue by focusing on the real-world outcomes of AI-generated advice. By delaying feedback until the effectiveness of the advice can be assessed, DRLHF allows for the identification and correction of inaccurate or ineffective outputs, leading to more reliable and trustworthy AI systems.

4. Conceptual Framework for DRLHF

DRLHF proposes a shift in feedback timing to after the outcomes of AI's outputs have been realized. This section outlines the conceptual shift required for DRLHF, including:

a. Comprehensive context: For DRLHF to work effectively, it needs to be trained on the full context of the situation and the output. This includes gathering relevant details about the user's background, preferences, and specific challenges. For example, when someone asks for career advice, the AI should have access to information such as their job history, skills, personality type, and any work-related struggles they face. By providing a more comprehensive context, the AI can generate more personalized and relevant advice.

b. Scheduled feedback: Once the AI generates an output, there needs to be a mechanism to schedule a follow-up request for feedback after a suitable time period. This could be in the form of a popup, email, or any other communication method that allows the user to provide feedback on the effectiveness of the advice or information they received. The timing of the feedback request should be based on the nature of the advice and the expected time frame for observing its outcomes.

c. Outcome-based evaluation: The feedback collected should focus on the real-world outcomes and effectiveness of the AI's outputs. Users should be asked to assess whether the advice or information provided by the AI was helpful, accurate, and relevant to their specific situation. This outcome-based evaluation will help the AI model learn from its successes and failures, enabling it to refine its outputs and provide more effective advice in the future.

5. Practical Applications of DRLHF

DRLHF can enhance AI-generated advice in various domains by focusing on personalized, outcome validated feedback. Consider three common scenarios: asking for a raise, choosing a diet for weight loss, and resolving a specific argument in a relationship. In the case of asking for a raise, traditional chatbots using RLHF might provide generic advice, while DRLHF would gather comprehensive context about the employee's performance, and boss's personality to offer tailored guidance that had worked based on similar training data. For selecting a weight loss diet, DRLHF would collect information on the user's eating habits, preferences, and health conditions, and adjust its recommendations. When addressing a specific relationship argument, DRLHF would delve into the couple's communication styles, recurring issues, and individual perspectives, refining its advice to target the root cause of the problem. By incorporating DRLHF in these practical applications, AI can provide more effective, personalized advice that adapts to the unique needs and circumstances of each user.

6. The future 100B company

Referencing back to Eric Schmidt's prediction that the next 100B company will use the crowd to train AI models. I would argue the next 100B companies will likely initiate their offering as an API wrapper using existing pre-trained models. This initial step allows companies to offer value while collecting and crowdsourcing feedback data to learn and subsequently commercialize insights. As these companies evolve, the integration of Delayed Reinforcement Learning from Human Feedback (DRLHF)—a concept introduced to address the limitations of immediate feedback systems like RLHF—will play a crucial role. By utilizing the nuanced, outcome-validated data collected over time from their user base, these firms will refine, fine-tune, or even develop proprietary models that not only surpass the capabilities of their initial offerings but also deliver genuinely personalized and effective advice. This progression from leveraging existing models to creating bespoke, sophisticated systems illustrates a strategic pathway to achieving significant valuation and market impact, embodying Schmidt's vision of utilizing the crowd to learn and innovate.

6. Personalized AI

The ultimate goal of Delayed Reinforcement Learning from Human Feedback (DRLHF) is to enable the development of truly personalized AI systems that can provide tailored, effective advice to individual users. By incorporating comprehensive context, scheduled feedback, and outcome-based evaluation, DRLHF lays the foundation for AI that can adapt to the unique needs, preferences, and circumstances of each user. Personalized AI powered by DRLHF has the potential to revolutionize various aspects of our lives, from career guidance and financial planning to health and wellness advice.

By leveraging the vast amounts of data collected through user interactions and feedback, these AI systems can identify patterns, correlations, and best practices that are specific to individual users or user segments. For example, a personalized AI career coach could analyze a user's skills, experience, personality traits, and career goals to provide customized job recommendations, interview tips, and networking strategies. The AI would continuously learn from the user's experiences and outcomes, refining its advice over time to optimize the user's career trajectory.

7. Conclusion

Delayed Reinforcement Learning from Human Feedback (DRLHF) presents a promising approach to address the limitations of current AI systems that rely on immediate feedback. By focusing on comprehensive context, scheduled feedback, and outcome-based evaluation, DRLHF enables the development of AI that can provide truly personalized and effective advice. As companies leverage this approach to collect and learn from nuanced, outcome-validated data, they have the potential to create proprietary models that surpass the capabilities of existing offerings. The integration of DRLHF in practical applications across various domains, from career guidance to health and wellness, can revolutionize the way AI interacts with and benefits individuals. As Eric Schmidt predicted, the next 100B companies will likely utilize the crowd to train AI models, and DRLHF will play a crucial role in their success. By embracing this approach, we can look forward to a future where personalized AI becomes an integral part of our lives, helping us make better decisions and achieve our goals more effectively.

Nick’s Substack