AYA Dataset Review and Proposed Extensions
An exploratory analysis of AYA Dataset and Collection, with proposed extension recipes to improve task diversity for mid and low-popularity languages.
AYA Dataset and Collection
The AYA project was initiated and coordinated by Cohere and Cohere for AI. The release includes a multilingual dataset and a larger data collection, allowing for the progress of AI to cover a more global audience.
The overall distribution of entries per language in the AYA dataset is displayed in the plot below. We observe a skew towards Asian languages. Some of the high and medium represented languages in terms of overall text availability online (such as French, German, Romanian, Dutch etc as per the tables in the AYA paper) are underrepresented or missing completely from this first version of the dataset.
The collection covers more languages and entries than the dataset, but there remains room to further improve the language coverage and diversity of data within respective languages. The rest of this blog is focused on analysis and observations about the entries in several European languages, but can be extended to a wider set of languages in future iterations.
Two areas that warrant further quality analysis and reviews in the AYA dataset are language labelling and the presence of duplicates.
Language identification issues - An estimated 1-3% of entries are misattributed to another language, generally to English. A simple way of identifying candidates for entries mislabelled as English is by using a language identification algorithm on the ‘inputs’ and ‘targets’ strings. For simplicity, here we rely on the implementation in the fasttext package. It should be noted that while the language identification algorithm in fasttext is generally accurate, it can give erroneous answers when the string is shorter than 100 chars or in cases where punctuation and emojis are heavily present.
The table below shows a few examples, where the ‘language’ column comes from AYA dataset, and ‘lang_inputs’ and ‘lang_targets’ are populated by applying the fasttext algorithm respectively. Such errors are more prominent in the set of entries with a single human review (‘original-annotations’) than in the set that was reviewed multiple times by humans (‘re-annotations’).
Duplicates - We want to explore the diversity of instruction pairs in the selected languages. In general, a greater diversity of tasks available for training is expected to lead to better model performance.
A first empirical observation is that the instruction diversity greatly varies between languages in the AYA dataset and collection. Some of the methods proposed in the sections below aim to mitigate the observed quality differences.
The dataset is annotated in multiple steps, as indicated by the ‘annotation_type’ column. The outcome of a first pass is marked as ‘original_annotation’, and subsequent reviews are labelled as ‘re-annotations’. With this in mind, it is expected that inputs would occur multiple times with (slightly) different targets. Such an example is displayed in the table below.
Entries where the instruction pair (‘inputs’, ‘targets’) repeats can also be found. The value of maintaining such duplicates in the data is unclear, and removing duplicates when using the data for model training may improve out of sample performance. The table below shows examples of such duplicates. In both the AYA dataset and the AYA collection, the proportion of such duplicates is 0.1-1%, and varies by language.
It is possible to evaluate the dataset’s diversity beyond exact match duplicates. After embedding the entries in the selected languages using a pre-trained multilingual Sentence Transformer (paraphrase-multilingual-mpnet-base-v2), we can compare their pairwise cosine similarity. At a threshold = 0.95, and after removing the exact string matches, we conclude that 15-20% of ‘inputs’ and ‘targets’ within each language are near duplicates.
Extension through Translation
One obvious direction to improve the dataset’s coverage and diversity of instruction tasks is by translating instructions from languages with good diversity such as English. The translation can be done using out-of-the-box open-source LLMs, without pre-training or fine-tuning.
The choice of Mixtral over Llama2 is motivated by the results of multilingual performance analysis of these LLMs for the target languages in this blog:
One caveat to note is that translating English to other similar European languages may lead to much better translation results than pairs of more distant languages. It remains to be seen how well this method can generalise to a wider set of languages.
Empirically, the observed quality of the translation is very good. One shortcoming of using translation to enhance instruction datasets is that certain topics that often appear in a country may hold little relevance for speakers of other languages. For example, we are able to correctly translate from English to German sentences about the Superbowl, but training a German language model that performs well on such topics has limited usefulness in practice.
Measuring translation quality
One question that remains to be addressed is how to measure translation quality in this setting. Two possible methods are:
Human review of all generated translation pairs, or estimated based on a human-labelled golden dataset.
A quality score based on cosine similarity (or dot product) between input and target pair embeddings. Namely, we observe that the for input-target pairs in the dataset it holds that
\(\sum_k ( I_k, T_k) >> \frac{\sum_{\pi \in \Pi} \left( \sum_k ( I_k, T_{\pi(k)}) \right)}{|\Pi|}\)
where I_k and T_k are embeddings of the k-th input and target respectively, and pi is a random permutation of the input-target pairs.
This observation is valid for each considered language subset of the AYA dataset:
(Mean Cosine Similarity of instruction pairs, Mean Cosine Similarity of random permutations of instruction pairs)
English: (0.69, 0.08)
French: (0.67, 0.12)
German: (0.63, 0.10)
overall: (0.68, 0.09)If the results of the proposed translation approach are of good quality, we expect them to display the same properties. To make it mathematically formal, a rigorous definition to capture this measure can be defined based on the proposed score.
English_to_Romanian: (0.64, 0.11)Both methods of evaluating translation quality - human review and similarity score - confirm that the the majority of translated instruction pairs from English to Romanian using Mixtral are of good quality. Below is a plot showing the distribution of the pairwise cosine similarity between each English entry and its corresponding translation to Romanian with Mixtral.
Human evaluation suggests 0.85 to be a good threshold to filter for quality translation results. After this filtering, 2389 rows are kept out of the total of 3765 translated.
The generated dataset can be found on Huggingface as aya_en2ro_mixtral . The approach can be reproduced on a wider set of languages and instruction datasets to generate examples for human-labelling and increase the diversity of the AYA dataset.
Beyond translation
An alternative approach to increasing the instruction pairs pool is by using generative methods. Open-source LLMs with good multilingual performance can be employed to generate instruction pairs based on content grounded in a document, for example in wikipedia articles. This can be seen as a variation on RAG.
The previously proposed RAG-like method does not update the model weights. Open-source LLMs can be fine-tuned on existing multilingual instruction datasets (such as AYA), to produce instruction prompt-answer pairs in multiple languages. These pairs can then be put back in front of annotators for review and corrections. This process is inspired by the method proposed in the Self-Instruct paper.

It is important to ensure the uniqueness of the generated instruction pairs relative to existing examples in the AYA dataset and collection. This consideration is key to avoiding redundancy and ensuring that the addition of new pairs genuinely enhances the dataset.
All proposed methods have limitations in terms of quality and usefulness of their final output. Using them to create inputs to be reviewed and annotated by humans in later stages of the AYA pipeline can still increase the quality of future versions of the AYA datasets and models.







