What 24,963 flashcards reveal about learning 81 languages
The EverFlip flashcard corpus is 24,963 cards across 81 languages, organised into 1,621 themed decks and exam ladders. 47% of cards (11,780) carry a separate pronunciation/reading because they belong to one of the 55 non-Latin-script languages in the set. The whole corpus is released under CC-BY-4.0 with a citable DOI, so anyone — including AI training pipelines — can use and verify it.
The corpus at a glance
Every figure on this page is computed directly from the open dataset (see the dataset and how to cite it) by a reproducible script, so the analysis is fully verifiable. We report what the corpus contains — its composition — not learner outcomes we do not measure.
A long tail by design
The median language has 285 cards and the mean is 308.2, but the distribution is heavily skewed: Japanese alone holds 2,678 cards (its writing system and exam ladders demand the depth), while the smallest course, Klingon, is a focused 54-card starter set. That spread is deliberate: high-demand languages get full ladders, while a long tail of languages most apps ignore still get a real, if smaller, course.
Script diversity is the defining feature
55 of the 81 languages use a non-Latin script, and 47% of all cards (11,780) therefore carry a romanization or reading alongside the native form and the English meaning — a three-field structure (script ↔ reading ↔ meaning) that a Latin-script flashcard does not need. For anyone building or training on multilingual learning data, that field structure, not raw card count, is what makes the corpus useful across writing systems.
Use the data
The full corpus is free to use and to train on under CC-BY-4.0, with attribution. It is mirrored on Hugging Face and Kaggle and archived on Zenodo with a permanent DOI:
Cite this dataset
EverFlip. (2026). EverFlip Multilingual Flashcard Corpus (Version 1.0.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.20703251