Sahara Tasks

The Sahara benchmark evaluates model performance across 16 tasks, categorized into four primary clusters, to reflect Africa’s diverse linguistic landscape.

Multiple-Choice, Comprehensive and Reasoning (MCCR) Tasks

  • Context-based Question Answering (squad_qa): Evaluated using Macro F1, supporting 1 language.

  • General Knowledge (mmlu): Evaluated using Accuracy, supporting 16 languages.

  • Mathematical Word Problems (mgsm): Evaluated using Exact Match, supporting 16 languages.

  • Reading Comprehension (belebele): Evaluated using Accuracy, supporting 25 languages.

Text Classification Tasks

  • Cross-Lingual Natural Language Inference (xlni): Evaluated using Accuracy, supporting 16 languages.

  • Language Identification (lid): Evaluated using Macro F1, supporting 517 languages.

  • News Classification (news): Evaluated using Macro F1, supporting 4 languages.

  • Sentiment Analysis (sentiment): Evaluated using Macro F1, supporting 3 languages.

  • Topic Classification (topic): Evaluated using Macro F1, supporting 2 languages.

Text Generation Tasks

  • Machine Translation - African to African (mt_xx2xx): Evaluated using spBleu-1K, supporting 29 languages.

  • Machine Translation - English to African (mt_eng2xx): Evaluated using spBleu-1K, supporting 29 languages.

  • Machine Translation - French to African (mt_fra2xx): Evaluated using spBleu-1K, supporting 29 languages.

  • Paraphrase (paraphrase): Evaluated using spBleu-1K, supporting 4 languages.

  • Summarization (summary): Evaluated using RougeL, supporting 10 languages.

  • Title Generation (title): Evaluated using spBleu-1K, supporting 10 languages.

Tokens Level Tasks

  • NER (ner): Evaluated using Macro F1, supporting 27 languages.

  • Phrase Chunking (phrase): Evaluated using Macro F1, supporting 8 languages.

  • POS Tagging (pos): Evaluated using Macro F1, supporting 1 language.