notesum.ai

Published at December 3

Time-Reversal Provides Unsupervised Feedback to LLMs

cs.CL

cs.AI

Released Date: December 3, 2024

Authors: Yerram Varun, Rahul Madhavan¹, Sravanti Addepalli², Arun Suggala², Karthikeyan Shanmugam², Prateek Jain²

Aff.: ¹Indian Institute of Science; ²Google DeepMind

Arxiv: http://arxiv.org/pdf/2412.02626v1

Model	Description
TRLM-Ba	Pre-trained in the reverse token order for previous token prediction (Alg. 1 in the supplement). Instruction-tuned variant is FLaN fine-tuned (Longpre et al., 2023) in reverse token order. Scores the reversed question given a reversed answer combined with suitable prompts. Generates questions in the reverse direction when conditioned on answers in the reverse direction. Scoring: $\mathbb{P}_{\texttt{TRLM-Ba}}$ (Reverse(Scoring Prompt+Query) \| Reverse(Conditioning Prompt + Answer)) (Alg. 2 in the supplement). Generation: $\mathbb{P}_{\texttt{TRLM-Ba}}\big{(}\enspace\cdot\enspace\mid\enspace\mathrm{% Reverse}(\texttt{Conditioning Prompt}+\mathrm{Answer})\big{)}$
TRLM-Fo	Pre-trained in the usual forward token order. Scores Question given Answer using the prompt. Generates from the conditional distribution of an answer. Scoring: $\mathbb{P}_{\texttt{TRLM-Fo}}$ (Query \| Answer + Conditioning Prompt ) (Alg. 3 in the supplement) Generation: $\mathbb{P}_{\texttt{TRLM-Fo}}(\enspace\cdot\enspace\mid\enspace\mathrm{Answer}% +\texttt{Conditioning Prompt})$
TRLM-FoBa (Reverse)	Pre-trained both in forward and reverse token order (Alg. 4 in the supplement). Understands text in both directions. Reverse version scores and generates identically to TRLM-Ba. Scoring: Scores identically to TRLM-Ba. Generation: Generates identically to TRLM-Ba.
TRLM-FoBa (Forward)	Pre-trained both in forward and reverse token order. Forward version scores and generates identically to TRLM-Fo. Scoring: Scores identically to TRLM-Fo. Generation: Generates identically to TRLM-Fo.
Self Scoring	The model that is used for generating a given response is also used for scoring responses given queries in the conventional forward scoring direction. Scoring: We use the model’s own perplexity scores as feedback to select the responses.
Forward Baseline	A conventional forward model trained for next-token prediction on the same training corpus and model class as TRLM . Scoring: While self-scoring used the perplexity obtained from the generator model, in this setting, we use perplexity of a different forward model.