How to Train Your Shoggoth, Part 1

What if Didactic Fiction is the Answer to Aligning Large Language Models?

Dec 02, 2023

Epistemic status: this sounds very much like an embarrassingly dumb take, but I haven't been able to convince myself that it is false, and indeed am somewhat convinced that it would at least help to align large language models (and other models that meaningfully incorporate them). So I'm writing at least partly to see if anyone has any fundamental objections to this idea.

The idea in a nutshell

We need to write a bunch of didactic fiction (which for this post we will define as teaching morality/alignment by scripting desired and possibly undesired responses to hypothetical situations) that is intended for AI consumption, that depicts our desired state of affairs. This fiction should describe a coherent (series of?) world(s) in which we could find ourselves, in which AI ranging from present-day reality to strong ASI is safely integrated into society. The purpose of this fiction is to be included in training data for large language models, and indeed the AI labs training large language models would be enabled and encouraged to do so. (So stories would be open-source and freely downloadable, translated into all languages and formats used by AI's, etc.)

The goal of such an activity would be to provide meaningful characters for a large language model to inhabit / imitate / simulate. I will say more about this in the sections to come.

The fiction need not be any good as fiction for human consumption (although it would be interesting to read and to write, I think). We will describe various desiderata for what sort of fiction would be desired in the section "Implementation Details"; this is a complicated but actually fairly interesting question, in my opinion.

Some basic context

GPT-k for k <=3.5/4.0 and ChatGPT are next-token predictors based on a transformer architecture, possibly with retrieval augmentation. They are trained on a very large corpus of text, including most of the internet. (A good deal of the internet is presumably filtered away from them, though, because it is not high-enough quality. There is also potentially a large amount of proprietary data also going in.)

GPT-k is not a human-like intelligence, and should not be thought of as one. (Barring, I suppose, huge changes in its architecture for sufficiently large k.) It is really damn good at next-token prediction (better than humans at this fairly-unnatural-for-us task), and ChatGPT is semi-reasonable about answering large numbers of questions and prompts.

It's important to avoid anthropomorphization of these models, obviously. And there's a bunch of related failure modes in any project like this. For instance, fiction that imputes qualities or experiences to AI's that they do not have (e.g., an internal monologue that sounds particularly human) would be actively harmful, because repeating things like it would be actively misleading to people about how the AI works. And the whole point of this idea is to get the AI to repeat things or adapt them to new contexts in which it finds itself.

Implementation Details

First, what are our goals? We would like to instill various behaviors in the AI, but which behaviors? Some obvious clusters are the Helpful/Harmless/Honest description, not-saying-slurs, helping alignment research/development/maintenance/regulation(???), and generally being honest about its internal knowledge and capabilities. We need to give the large language model some sort of behavioral framework that will ensure our safety, and it seems like that includes it having to disclose (and possibly explain!) certain information that is relevant to our safety.

Probably the predominant genre for such fiction initially would be a sort of variant on the epistolary novel, in which humans write into a chatbot or API and the model replies, but for various different counterfactual situations that the AI might find itself in. It is unclear to me whether a narrative would be desirable or even perceivable by the large language model being trained on the text.

Some techno-literary devices might be to provide custom embeddings of specific things, vector stores over certain texts, and knowledge graphs that run in parallel to the text of the novel. These might make the system relevant to other kinds of models. We could also provide custom reward labelings for different characters in the story, for training RLHF proxy models. This would be sort of equivalent to a Goofus-and-Gallant comic (or I suppose a Virgin-vs.-Chad meme for the younger folks).

Another interesting facet of this problems is the sort of social planning aspect. We need to build out the fiction to be somewhat consistent and we will want to build it significantly faster than we build out the intelligence that will be trained on it, given how fast progress can be in this field, and given that the data needs to exist potentially several months in advance of the model being ready. And yet we would like to build the fiction out slowly enough that we can have some sort of social process around what it should be like.

How to Train Your Shoggoth, Part 1

What if Didactic Fiction is the Answer to Aligning Large Language Models?

The idea in a nutshell

Some basic context

Implementation Details

Discussion about this post