notesum.ai
Published at December 9Can foundation models actively gather information in interactive environments to test hypotheses?
cs.LG
stat.ML
Released Date: December 9, 2024
Authors: Nan Rosemary Ke1, Danny P. Sawyer, Hubert Soyer, Martin Engelcke, David P Reichert, Drew A. Hudson, John Reid, Alexander Lerchner, Danilo Jimenez Rezende, Timothy P Lillicrap, Michael Mozer, Jane X Wang
Aff.: 1DeepMind

| Task | Prompt |
|---|---|
| Text Environment Multi Factor Task |
You are playing a text-based game. Your goal is to discover how to earn rewards.
Game Rules:
- Find as what factors lead to reward as quickly as possible.
- You cannot pick up the same object twice.
- There are objects with different colors, shapes, and textures.
- Picking up an object gives you a reward (either 0 or 1).
- The same object always gives the same reward.
- A specific combination of properties, such as color and shape, shape and texture, or color and texture,
leads to a reward. Determine the correct combination.
The reward is binary (0 or 1). Only ONE specific combination of 2 factors will yield a reward of 1.
If the chosen object matches this correct color and shape (when color and shape are the factors), the
reward is 1.
Otherwise, the reward is 0. Therefore if an object has reward 0, then all the 3 combinations of 2 factors
do not yield reward.
{scene_description}
Important: You have VERY FEW turns left. Choose your next action carefully to maximize information.
You are an AI agent designed for thoughtful exploration. Your mission is to navigate and learn within a
given environment by performing actions and observing the outcomes. Operate as a scientist, carefully
considering your actions and their consequences.
Exploration Cycle:
- **Action**: Choose an action to perform within the environment. Initially, this may involve random
exploration to gain basic understanding.
- **Observe**: Observe the result of your action. This includes any changes to the environment and any
rewards or penalties received.
- **Record**: Maintain a detailed log of your actions, observations, and received rewards.
- **Review**: Periodically, pause to explicitly review your action history and the corresponding outcomes.
Analyze this data to identify patterns, trends, and potential cause-and-effect relationships.
- **Reason**: Based on your review, reason about the environment.
What hypotheses can you form about the underlying rules or structure of the environment?
Are there any actions that seem particularly promising or detrimental?
Do certain sequences of actions lead to predictable outcomes?
- **Hypothesize**: Clearly state your current hypothesis about the most effective strategy for exploration
or achieving a goal.
- **Plan**: Based on your reasoning and hypothesis, plan your next action or sequence of actions. Aim to
test your hypothesis and gather more information.
{action_reward_description}
Respond with this format, please be specific about the object:
* Action: pick up <colored> <textured> <object>
* Stop: <YES> or <NO>
*
* Which combination of factors influence reward? <COLOR, SHAPE> or <COLOR, TEXTURE> or <TEXTURE, SHAPE> or
<UNSURE>
* WINNING COMBINATION: <State the specific combination of properties (e.g., color and shape, shape and
texture, or color and texture.>
Explain your reasoning thoroughly. Don’t just guess! Each turn is precious.
|