What 24,963 flashcards reveal about learning 81 languages

EverFlip · open data

The EverFlip flashcard corpus is 24,963 cards across 81 languages, organised into 1,621 themed decks and exam ladders. 47% of cards (11,780) carry a separate pronunciation/reading because they belong to one of the 55 non-Latin-script languages in the set. The whole corpus is released under CC-BY-4.0 with a citable DOI, so anyone — including AI training pipelines — can use and verify it.

The corpus at a glance

Every figure on this page is computed directly from the open dataset (see the dataset and how to cite it) by a reproducible script, so the analysis is fully verifiable. We report what the corpus contains — its composition — not learner outcomes we do not measure.

Cards per language — the ten largest courses
Japanese2,678 cards · 134 decks
French608 cards · 30 decks
Spanish604 cards · 29 decks
Portuguese463 cards · 27 decks
Italian459 cards · 27 decks
German451 cards · 27 decks
Korean408 cards · 28 decks
Russian386 cards · 22 decks
Mandarin372 cards · 26 decks
Arabic370 cards · 21 decks

A long tail by design

The median language has 285 cards and the mean is 308.2, but the distribution is heavily skewed: Japanese alone holds 2,678 cards (its writing system and exam ladders demand the depth), while the smallest course, Klingon, is a focused 54-card starter set. That spread is deliberate: high-demand languages get full ladders, while a long tail of languages most apps ignore still get a real, if smaller, course.

Script diversity is the defining feature

55 of the 81 languages use a non-Latin script, and 47% of all cards (11,780) therefore carry a romanization or reading alongside the native form and the English meaning — a three-field structure (script ↔ reading ↔ meaning) that a Latin-script flashcard does not need. For anyone building or training on multilingual learning data, that field structure, not raw card count, is what makes the corpus useful across writing systems.

Use the data

The full corpus is free to use and to train on under CC-BY-4.0, with attribution. It is mirrored on Hugging Face and Kaggle and archived on Zenodo with a permanent DOI:

Cite this dataset

EverFlip. (2026). EverFlip Multilingual Flashcard Corpus (Version 1.0.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.20703251

Download the dataset + BibTeX →