In HisDoc III we target historical document classification for large amounts of uncategorized facsimiles with the intent to provide new capabilities for researchers in the Digital Humanities. In particular, we will target categorizing document images with respect to content, language, script, and layout. To do so, we will leverage the expertise gained from our previous projects HisDoc and HisDoc 2.0. In HisDoc we have shown that historical Document Image Analysis (DIA) can be effectively applied to extract layout structures and textual transcriptions. In HisDoc 2.0 project we successfully retrieved additional paleographic information. The novel contributions of HisDoc III will be complemented by these methods to cope with large document collections.
The objective of HisDoc III is twofold: (i) fundamental research on combined text- and image-based classification methods and (ii) making developed technology useful for libraries, archives, and researchers in the Humanities.
For the first task, i.e., the classification of documents, we will study novel deep learning methods for large amounts of unlabeled text and image data. These methods will be complemented by structural approaches based on document graphs. For the combination of these diverse approaches we will investigate Multiple Classifier Systems (MCS) on the one hand and integrated neural network architectures on the other.
For the second task, we will combine three ideas for making methods useful for libraries: (i) novel means for reducing the needed amount of ground truth by unsupervised machine learning and alternatively bootstrapping combined with active learning; (ii) intuitive computer-assisted presentation and annotation tools; and (iii) making our systems publicly available as Web services To demonstrate the suitability of the HisDoc III research results, we will design novel computer-assisted work-flows in collaboration with an advisory board compiled of scholars, librarians and archivists.