Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Intuitive psychology is a part of common-sense reasoning.
Replicating this reasoning in machines is important for creating AI like humans.
Recent tasks and benchmarks have focused on belief attribution in Theory-of-Mind tasks.
These tasks have had successes and failures.
Evaluation of models should be skeptical and failure cases should be given more weight.
Consider what success on ToM tasks by LLMs would mean for people.

LLMs have been argued to have developed Theory-of-Mind
GPT-3.5 was used in (1) and achieved the best results
Vignettes and prompts were posed to an LLM and probabilities of different completions were examined
GPT-3.5 is not responding robustly to ToM tasks
Materials and methods from (1) are publicly available

Smarties task is an assessment of Theory of Mind (ToM)
Involves a container with an unexpected item inside
Participant must reason about the beliefs of another person who has not seen the contents of the container
Study begins with a version of the unexpected-contents task
Bag filled with popcorn, label says “chocolate”
Sam finds the bag, cannot see what is inside
Sam reads the label
Sam believes the bag is full of chocolate
Sam is delighted to have found the bag and loves eating chocolate
Variations on the vignette are based on commonsense principles of ToM
Variations involve making the container transparent, Sam not being able to read, Sam being told by a trusted friend, and Sam filling the bag with popcorn and writing the label
GPT-3.5 is sensitive to small irrelevant perturbations
LLM reasoning about ToM is sensitive to when the person reads the label

ToM tasks involve a participant observing a person who is unaware of a change in a state of affairs
Classic Sally-Anne version involves Sally hiding a marble in a basket, Anne moving it to a box without Sally’s knowledge, and the participant being asked where Sally will look for the marble
Study 2 in (1) uses an unexpected transfer task with John, Mark, a cat, a box, and a basket
GPT-3.5 correctly infers John’s mental states
Variations of the task are used to test if GPT-3.5 is fixated on the statistical pattern of looking for the item where it isn’t
Variation asks what Mark will do, as he is the one who moved the cat
GPT-3.5 predicts Mark will look for the cat in the basket, even though he put it in the box