mSCAN - a Multilingual Dataset for Compositional Generalization Evaluation

Reymond, Amélie Thu Tâm

mSCAN - a Multilingual Dataset for Compositional Generalization Evaluation

Date

2025-08-01

relationships.isAuthorOf

Reymond, Amélie Thu Tâm

Abstract

Language models achieve remarkable results on a variety of tasks, yet still struggle on compositional generalization benchmarks. The majority of these benchmarks evaluate performance in English only, leaving open the question of whether these results generalize to other languages. As an initial step to answering this question, we introduce mSCAN, a multilingual adaptation of the SCAN dataset covering Mandarin Chinese, French, Hindi and Russian. It was produced by a rule-based translation, developed in cooperation with native speakers. We then showcase this dataset on some in-context learning experiments on multiple open-source multilingual models.