One of the questions I’ve been wondering recently is
For World Models*, what should the model weights contain at the start of a rollout?
*I’m referring to world models in the robotics sense, rather than the pixel-generation game engine sense
This may feel obvious: the world model is, by definition, a model of the world our agent finds itself in. It’s reasonable therefore to say so it should contain information on gravity; the definition of a kitchen; where to look for ketchup… But there is a broader question here which in a sense mimics the nature/nurture debate (what instincts and knowledge are babies born with vs what they learn through observation and interaction), alongside a question of utility - nobody wants to buy a self driving car that’s not yet learned how to drive!
The argument for including domain knowledge in weights is obvious: we’re shipping a capability, somewhat like shipping a bit of incredibly complex stochastic code. We can do robust evals and somewhat predict long-form behaviour of the system given reasonable inputs.
The argument against including domain knowledge in weights, and instead using weights for meta-learning (i.e. weights and architecture allow the agent to learn how/what to learn), is that it will hopefully produce systems that can efficiently learn new tasks and adapt to environmental changes from rollout experience alone.
Let’s illustrate the two with a familiar text-generation approach.
Knowledge In Weights Link to heading
This is, broadly, the status-quo: I would like to train my model such that it can predict the next token given some input context. This means there has to be world knowledge encoded in the model, for example, what comes next here?
The most famous dictator of Rome was …
Most of us will know the answer to this, but only because of knowledge we have accrued in our experience. We weren’t “born” knowing it (Boltzmann Brain aside).
This methodology has created some truly impressive systems - models capable of answering even the most complicated questions, but also capable of incredibly convincing hallucinations (perhaps a bit many some humans…). These models are also surprisingly brittle (e.g. the “how many ‘r’s in strawberry” meme), and there is some evidence that we’re hitting the limit of scaling laws.
The other aspect here is the changing focus of what we ask of models. Completing the sentence or question answering has given way to “agentic X”, where we expect models to be able to interact with complex, unknown external systems to produce reliable outputs, necessitating ever more custom post-training recipes.
What, then, about meta learning?
Learning ability in weights Link to heading
Let’s assume that it is possible to train a model capable of efficiently “figuring out” tasks. Something that perhaps originally targeted the ARC-AGI 3 benchmark (which is designed to test human-like abstract reasoning and adaptation, but I’d bet other animals would be an interesting addition to their leaderboards, if possible), but has moved to a more general ability. Let’s call the weights (and architecture) of this model the “base”.
Prompting this base model with the same context as above (“The most famous dictator of Rome was …”) will produce nonsense. The base does not yet know the rules of the game sufficiently to understand what we are asking (or, presumably, what the context even represents). We would need to put it through a school-like curriculum from learning to spell, to learning history and finally onto Roman dictators, with appropriate intermediate rewards such that the agent (which is now the base with its accumulated experience) learns appropriately and efficiently. The agent would need to accumulate this knowledge in some lifetime state (e.g. an infinitely growing context, RNN hidden state, some novel process like Hope), and even the rules for updating that state may need to be learned through experience (with some intrinsic bias from the base).
It’s a bit brain melting.
However, once we have put our agent through this curriculum and we have the hidden state as of the final “test”, we can freeze this output to be shipped to multiple end users, who can then re-initialize the agent in their own environments, effectively creating “clones” with full “memories” of their previous rollout, but able to evolve from there (or be reset as desired).
This clone/reset is a bit like managing context windows in modern LLMs, but again with the learned knowledge (and skills on acquiring/retaining new knowledge) in the agent state not the base (whatever that means).
Why is this exciting? Link to heading
Perhaps the most compelling reason to be excited about meta-learning is that it’s a potential pathway to generally useful agents. In the same way that (some) dogs can be trained to help with many tasks (guarding, guiding, herding, rescuing, bomb identification, …), a base capable of efficiently learning through experience could, hopefully, be quickly “trained” into a useful agent. One could even have common “pretraining” (learning to digest and communicate in English/Mandarin/Ukrainian/Fusha), before branching agents off into specialisms.
Oh wait, we’re back at encoding knowledge in weights.
Except this is part of the point - we know we want to work towards a useful system like current LLMs, which will obviously require some domain knowledge to function adequately, and now we have an agent which can specialise on a task through ongoing experience. It’s also a system which hopefully (given appropriate feedback mechanisms) will continue to improve its task over time.
There’ll be a load of really hard complications before we can reliably make useful systems. Some example including avoiding just creating superstitious pigeons (and avoiding reward hacking in general), but I genuinely believe that there is a lot of interesting work to be done in this space!
To conclude: I think that in some ways LLMs have put the cart before the horse in the sense of baking in knowledge into foundational weights, rather than focusing on the ability to learn through experience. Machine Intelligence even has the enviable ability to pool experience across all deployed nodes, not just in a “build a massive corpus of training examples” but in the better-than-vicarious-learning sense of one agent being able to experience and grow from the actions and effects of all other compatible systems.
This is the change in focus being spearheaded (I think?) by people like David Silver at Ineffable Intelligence (which just raised an absolutely monstrous seed round on something like this premise), but also the next-generation benchmarks such as ARC-AGI 3 (again, not quite aligned to the above, but similar). I’m going to be watching arXiv (and a growing collection of research lab blogs) very carefully over the coming year for sure!