Abstract: Large language models (LLMs) achieve good performance on challenging reasoning benchmarks, yet could also make basic reasoning mistakes. One hypothesis is that the increasingly high and nearly saturated performance on common reasoning benchmarks could be due to the memorization of similar problems. In this paper, we systematically investigate this hypothesis with a quantitative measurement of memorization in reasoning tasks, using a dynamically generated logical reasoning benchmark based on Knights and Knaves (K&K) puzzles.
A proposed memorization metric for reasoning task, and a new logical reasoning dataset.
To facilitate our memorization study, we propose a new logical reasoning benchmark that supports automatic problem perturbations.
Knights and Knaves (K&K) (Johnson-Laird & Byrne, 1990) is a type of logical puzzle where some characters tell truth, and others only lie. The goal is to infer each character’s truthfulness. Based on the K&K puzzle, we design a dynamic benchmark that supports:You can download our data on Hugging Face 🔗.
From off-the-shelf models to fine-tuned models (finetuning with CoTs and finetuning with answers only).
FT setup: fine-tuning with detailed synthetic CoT steps and answer (CoT FT) and fine-tuning with the answers (Direct FT). We fine-tune the models for each N-people task (1000 training samples) separately.
Compared to CoT FT, learning from only answers (Direct FT) without detailed reasoning steps is intuitively more challenging, as the models need to come up with the reasoning procedures on their own. It turns out that models can learn to reason K&K puzzles well directly from observing only question-answer pairs.
Can brute-force finetuning on a large number of puzzles eventually solve K&K tasks?
Test accuracy improvement on N-people problems for FTed LLMs on M-people problems, compared to the un-FTed LLM.
GPT4o-mini CoT FT / GPT4o-mini Direct FT
Llama3-8B Direct FT
We use probing techniques (Hewitt & Liang, 2019) to analyze internal representations of Direct FTed models on K&K related tasks, to see whether they develops internal understanding of K&K related reasoning skills when learning only from the answers. The probing task: distinguish correct from incorrect statements in a given puzzle based on the model's intermediate outputs.
Is there a simple indicator that determines whether a model would solve a given puzzle by reasoning or memorization?
We collect correctly solved training samples by the targeted LLM, and assign a binary categorical label as either “consistently solved” (i.e., solved by reasoning) puzzle or “not consistently solved” (i.e., solved by memorization) puzzle under local perturbation. We train a simple logistic regression model to solve this binary classification problem.
We consider text features including TF-IDF, Bag-of-Words, Word Length, Character Length of different text fields of the puzzles.
We feed each puzzle question to the FT-ed/unFT-ed model, collect the average embedding at each layer as model-based indicator.