Lead
A considerable amount of knowledge is preserved in documents of any kind. This includes modern document and database structures, as well as historical writings which are available as physical manuscripts. With state-of-the-art scan methods, millions of historical documents are digitized, i.e., scanned or photographed and stored as electronic images. However, the textual search in such facsimiles of historical images is rather difficult, because the pixel-information is not computer-readable. Especially for large databases of libraries and archives this poses a challenge when striving towards a complete digitization of their collection.

Lay summary

In HisDoc III we target historical document classification for large amounts of uncategorized facsimiles with the intent to provide new capabilities for researchers in the Digital Humanities. In particular, we will target categorizing document images with respect to content, language, script, and layout. To do so, we will leverage the expertise gained from our previous projects HisDoc and HisDoc 2.0. In HisDoc we have shown that historical Document Image Analysis (DIA) can be effectively applied to extract layout structures and textual transcriptions. In HisDoc 2.0 project we successfully retrieved additional paleographic information. The novel contributions of HisDoc III will be complemented by these methods to cope with large document collections.

The objective of HisDoc III is twofold: (i) fundamental research on combined text- and image-based classification methods and (ii) making developed technology useful for libraries, archives, and researchers in the Humanities.

 For the first task, i.e., the classification of documents, we will study novel deep learning methods for large amounts of unlabeled text and image data. These methods will be complemented by structural approaches based on document graphs. For the combination of these diverse approaches we will investigate Multiple Classifier Systems (MCS) on the one hand and integrated neural network architectures on the other.

For the second task, we will combine three ideas for making methods useful for libraries: (i) novel means for reducing the needed amount of ground truth by unsupervised machine learning and alternatively bootstrapping combined with active learning; (ii) intuitive computer-assisted presentation and annotation tools; and (iii) making our systems publicly available as Web services To demonstrate the suitability of the HisDoc III research results, we will design novel computer-assisted work-flows in collaboration with an advisory board compiled of scholars, librarians and archivists.