Skip to yearly menu bar Skip to main content


Poster

Muharaf: Manuscripts of Handwritten Arabic Dataset for Cursive Text Recognition

Mehreen Saeed · Adrian Chan · Anupam Mijar · joseph Moukarzel · Gerges Habchi · Carlos Younes · amin elias · Chau-Wai Wong · Akram Khater

West Ballroom A-D #5110
[ ] [ Project Page ]
Wed 11 Dec 11 a.m. PST — 2 p.m. PST

Abstract:

We present the Manuscripts of Handwritten Arabic (Muharaf) Dataset, which is a machine learning dataset of more than 1,600 historic handwritten page images punctiliously transcribed by experts in archival Arabic. Each document image is accompanied by spatial polygonal coordinates of its text lines as well as basic page elements. This dataset was compiled to advance the state of the art in handwritten text recognition (HTR) of not only Arabic manuscripts but also cursive text in general. The Muharaf Dataset consists of diverse handwriting styles and a wide range of document types including personal letters, diaries, notes, poems, church records, and legal correspondences. In this paper, we describe the data acquisition pipeline as well as the notable dataset features and statistics. We also provide a preliminary baseline result achieved by training convolutional neural networks using this data.

Live content is unavailable. Log in and register to view live content