CuneiML: A Cuneiform Dataset for Machine Learning

oleh: Danlu Chen, Aditi Agarwal, Taylor Berg-Kirkpatrick, Jacobo Myerston

Format: Article
Diterbitkan: Ubiquity Press 2023-12-01

Deskripsi

The cuneiform writing system holds a vast reservoir of ancient literature, encompassing over 3000 years of history. Originating around the mid-fourth millennium BCE and enduring until the late first millennium BCE, cuneiform writing spans various genres such as administrative, legal, medical, and scientific documents, among others. This article introduces a curated dataset, CuneiML, featuring 38,947 high-resolution 2D photos of Sumerian and Akkadian cuneiform tablets, accompanied by their cuneiform Unicode transcriptions, transliterations, lineart, and metadata. This dataset aims to support the development of machine learning tools for processing and analyzing Sumerian and Akkadian cuneiform artifacts – e.g. for automatically classifying genre, provenance, or period from unannotated tablet images. Thus, CuneiML is designed with consistency of format as a primary concern. Specifically, CuneiML is a result of meticulously preprocessing, segmenting, filtering, and re-transliterating data that is available online in the Cuneiform Digital Library Initiative (CDLI) collection.