In this post, we publish (the beginnings of) an open-source implementation of our best guess as to what the notorious Q* algorithm is that Twitter is talking about.
https://github.com/heretotalk-ai/loki-models/blob/main/src/loki/generation.py#L110
Why do I think that this is a good guess as to what Q*? It’s in the name: Q* is almost certainly an A*-inspired algorithm that uses Q-functions as the heuristic to efficiently decode sequences that are both highly likely (the prefix score under the LLM’s likelihood function) and highly likely to be rewarded (the quantilizer score under the reward model’s Q-score-to-go).
What is a quantilizer? It’s a function (powered by a pretrained Transformer encoder model finetuned on reward prediction) that predicts the reward of an entire sequence given only access to a prefix. It thus provides a “gut feeling” of “how well you are doing” to the model.
Am I claiming to have invented Q* first? Not quite, I never had the idea to add a scaled version of the Q-function coming out of the quantilizer to the log-likelihood of the prefix, and then throwing that through beam search. Which, I would bet $100 against a nickel that this is the core insight in OpenAI’s new breakthrough. (Bet offer only valid to high-level LessWrong posters with >= 10K karma on LessWrong, and only the first 3 of them.)
I will post an official and clean writeup of the equations when I get off work today, but this is hopefully enough for any serious user of the Huggingface ecosystem to use the new technique to their heart’s desire.