Project

Back to overview

HisDoc III : Large-Scale Historical Document Classification

English title HisDoc III : Large-Scale Historical Document Classification
Applicant Ingold Rolf
Number 169618
Funding scheme Project funding
Research institution Département d'Informatique Université de Fribourg
Institution of higher education University of Fribourg - FR
Main discipline Information Technology
Start/End 01.01.2017 - 30.06.2021
Approved amount 439'744.00
Show all

Keywords (8)

scructural learning; historical documents; big data; digital humanities; document classification; document image analysis; deep learning; cultural heritage preservation

Lay Summary (German)

Lead
Ein Grossteil unseres Wissensschatzes ist in Form von Dokumenten jeglicher Art bewahrt. Das umschliesst moderne Datei- und Datenbankformate genauso wie historische Schriften, die in physischer Form überliefert sind. Mit modernen Scanmethoden werden Millionen solcher Dokumente „digitalisiert“, also abfotografiert und elektronisch gesichert. Während das textuelle „Durchsuchen“ und Erforschen moderner elektronischer Dokumente sich sehr einfach gestaltet, ist die Erschliessung älterer Dokumente aufwändiger, da die Pixelbilder für den Computer nicht einfach verständlich sind. Gerade bei grossen Datensammlungen stellt dies eine Herausforderungen für Bibliotheken und Archive dar, die eine komplette Digitalisierung ihrer Sammlungen anvisieren.
Lay summary

Der Fokus von HisDoc III liegt in der Klassifikation und Kategorisierung von grossen Sammlungen bisher noch nicht klassifizierter historischer Dokumente mit dem Ziel, neue Forschungsmethoden für die Digital Humanities zu schaffen. Wir entwerfen Verfahren und Algorithmen zur Kategorisierung von Dokumentbildern in Bezug auf Inhalt, Sprache, Schrift und Layout. Dabei bauen wir auf Verfahren für die einzelnen Teilaspekte auf, die wir Rahmen von HisDoc und HisDoc 2.0 entwickelt haben. In HisDoc haben wir gezeigt, dass Dokumentanalysemethoden praktisch verwendbar sind, um Layout und Textelemente zu bestimmen. In HisDoc 2.0 wurden Verfahren zur Bestimmung Paläographischer Informationen entwickeln. In HisDoc III entwickeln wir zusätzliche Verfahren, die mit grossen Dokumentsammlungen arbeiten können. Weiterhin stellen wir sie einfacher für Bibliotheken, Archive und ForscherInnen der Humanities zur Verfügung indem wir WebServices über das Internet (die Cloud) bereitstellen, die ohne Installationsaufwand in Analyse-Tools verwendet werden können.

Stichworte:

Bewahrung kulturellen Erbes, Computerunterstützte Paläographie, Dokumentbild-Analyse, Dokumentklassifikation, Fragmentklassifikation, elektronische Handschriftenkataloge, Grosse Dokumentsammlungen

Direct link to Lay Summary Last update: 11.12.2016

Lay Summary (English)

Lead
A considerable amount of knowledge is preserved in documents of any kind. This includes modern document and database structures, as well as historical writings which are available as physical manuscripts. With state-of-the-art scan methods, millions of historical documents are digitized, i.e., scanned or photographed and stored as electronic images. However, the textual search in such facsimiles of historical images is rather difficult, because the pixel-information is not computer-readable. Especially for large databases of libraries and archives this poses a challenge when striving towards a complete digitization of their collection.
Lay summary

In HisDoc III we target historical document classification for large amounts of uncategorized facsimiles with the intent to provide new capabilities for researchers in the Digital Humanities. In particular, we will target categorizing document images with respect to content, language, script, and layout. To do so, we will leverage the expertise gained from our previous projects HisDoc and HisDoc 2.0. In HisDoc we have shown that historical Document Image Analysis (DIA) can be effectively applied to extract layout structures and textual transcriptions. In HisDoc 2.0 project we successfully retrieved additional paleographic information. The novel contributions of HisDoc III will be complemented by these methods to cope with large document collections.

The objective of HisDoc III is twofold: (i) fundamental research on combined text- and image-based classification methods and (ii) making developed technology useful for libraries, archives, and researchers in the Humanities.

 For the first task, i.e., the classification of documents, we will study novel deep learning methods for large amounts of unlabeled text and image data. These methods will be complemented by structural approaches based on document graphs. For the combination of these diverse approaches we will investigate Multiple Classifier Systems (MCS) on the one hand and integrated neural network architectures on the other.

For the second task, we will combine three ideas for making methods useful for libraries: (i) novel means for reducing the needed amount of ground truth by unsupervised machine learning and alternatively bootstrapping combined with active learning; (ii) intuitive computer-assisted presentation and annotation tools; and (iii) making our systems publicly available as Web services To demonstrate the suitability of the HisDoc III research results, we will design novel computer-assisted work-flows in collaboration with an advisory board compiled of scholars, librarians and archivists.

Direct link to Lay Summary Last update: 11.12.2016

Responsible applicant and co-applicants

Employees

Publications

Publication
Handwritten historical document analysis, recognition, and retrieval - state of the art and future trends
FischerAndreas, LiwickiMarcus, IngoldRolf (2020), Handwritten historical document analysis, recognition, and retrieval - state of the art and future trends, World Scientific, New Jewrsey, London, Singapour, ....
A Comprehensive Study of ImageNet Pre-Training for Historical Document Image Analysis
Studer Linda, Alberti Michele, Pondenkandath Vinaychandran, Goktepe Pinar, Kolonko Thomas, Fischer Andreas, Liwicki Marcus, Ingold Rolf (2019), A Comprehensive Study of ImageNet Pre-Training for Historical Document Image Analysis, in 2019 International Conference on Document Analysis and Recognition (ICDAR), 720-725.
DNNViz: Training Evolution Visualization for Deep Neural Network
Clavien Gil, Alberti Michele, Pondenkandath Vinaychandran, Ingold Rolf, Liwicki Marcus (2019), DNNViz: Training Evolution Visualization for Deep Neural Network, in 2019 6th Swiss Conference on Data Science (SDS), 19-24.
Historical Document Synthesis with Generative Adversarial Networks
Pondenkandath Vinaychandran, Alberti Michele, Diatta Michaël, Ingold Rolf, Liwicki Marcus (2019), Historical Document Synthesis with Generative Adversarial Networks, in 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW), 5, 146-151.
Improving Reproducible Deep Learning Workflows with DeepDIVA
Alberti Michele, Pondenkandath Vinaychandran, Vögtlin Lars, Würsch Marcel, Ingold Rolf, Liwicki Marcus (2019), Improving Reproducible Deep Learning Workflows with DeepDIVA, in 2019 6th Swiss Conference on Data Science (SDS), 13-18.
Labeling, cutting, grouping: an efficient text line segmentation method for medieval manuscripts
Alberti Michele, Voegtlin Lars, Pondenkandath Vinaychandran, Seuret Mathias, Ingold Rolf, Liwicki Marcus (2019), Labeling, cutting, grouping: an efficient text line segmentation method for medieval manuscripts, in 2019 15th IAPR international conference on document analysis and recognition (ICDAR), 1200-1206.
A Semi-automatized Modular Annotation Tool for Ancient Manuscript Annotation
Seuret Mathias, Bouillon Manuel, Simistira Fotini, Würsch Marcel, Liwicki Marcus, Ingold Rolf (2018), A Semi-automatized Modular Annotation Tool for Ancient Manuscript Annotation, in 2018 13th IAPR International Workshop on Document Analysis Systems (DAS), 340-344, IEEE, Vienna340-344.
Are You Tampering With My Data?
Alberti Michele, Pondenkandath Vinaychandran, Wursch Marcel, Bouillon Manuel, Seuret Mathias, Ingold Rolf, Liwicki Marcus (2018), Are You Tampering With My Data?, in 15th European Conference on Computer Vision (ECCV), Objectionable Content and Misinformation worksho, 296-312.
DeepDIVA: A Highly-Functional Python Framework for Reproducible Experiments
Alberti Michele, Pondenkandath Vinaychandran, Würsch Marcel, Ingold Rolf, Liwicki Marcus (2018), DeepDIVA: A Highly-Functional Python Framework for Reproducible Experiments, in 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR), 423-428.
Identifying Cross-Depicted Historical Motifs
Pondenkandath Vinaychandran, Alberti Michele, Eichenberger Nicole, Ingold Rolf, Liwicki Marcus (2018), Identifying Cross-Depicted Historical Motifs, in 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR), 333-338.
Leveraging Random Label Memorization for Unsupervised Pre-Training
Pondenkandath Vinaychandran, Alberti Michele, Puran Sammer, Ingold Rolf, Liwicki Marcus (2018), Leveraging Random Label Memorization for Unsupervised Pre-Training, in Workshop of Integration of Deep Learning Theories at Conference on Neur, 1-6.
Web Services in Document Image Analysis - Recent Developments on DIVAServices and the Importance of Building an Ecosystem
Würsch Marcel, Liwicki Marcus, Ingold Rolf (2018), Web Services in Document Image Analysis - Recent Developments on DIVAServices and the Importance of Building an Ecosystem, in 2018 13th IAPR International Workshop on Document Analysis Systems (DAS), 334-339.
A Pitfall of Unsupervised Pre-Training
Alberti Michele, Seuret Mathias, Ingold Rolf, Liwicki Marcus (2017), A Pitfall of Unsupervised Pre-Training, in arXiv:1703.04332 [cs], NIPS, Long Beach.
Exploiting State-of-the-Art Deep Learning Methods for Document Image Analysis
Pondenkandath Vinaychandran, Seuret Mathias, Ingold Rolf, Afzal Muhammad Zeshan, Liwicki Marcus (2017), Exploiting State-of-the-Art Deep Learning Methods for Document Image Analysis, in 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), 05, 30-35.
ICDAR2017 competition on layout analysis for challenging medieval manuscripts
Simistira Fotini, Bouillon Manuel, Seuret Mathias, Würsch Marcel, Alberti Michele, Ingold Rolf, Liwicki Marcus (2017), ICDAR2017 competition on layout analysis for challenging medieval manuscripts, in 2017 14th IAPR international conference on document analysis and recognition (ICDAR), 01, 1361-1370.
Open Evaluation Tool for Layout Analysis of Document Images
Alberti Michele, Bouillon Manuel, Ingold Rolf, Liwicki Marcus (2017), Open Evaluation Tool for Layout Analysis of Document Images, in 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), 04, 43-47.
PCA-Initialized Deep Neural Networks Applied to Document Image Analysis
Seuret Mathias, Alberti Michele, Liwicki Marcus, Ingold Rolf (2017), PCA-Initialized Deep Neural Networks Applied to Document Image Analysis, in 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), 01, 877-882.
Turning Document Image Analysis Methods into Web Services - An Example Using OCRopus
Würsch Marcel, Simistira Foteini, Ingold Rolf, Liwicki Marcus (2017), Turning Document Image Analysis Methods into Web Services - An Example Using OCRopus, in 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), 04, 48-52.

Collaboration

Group / person Country
Types of collaboration
Sousse University Tunisia (Africa)
- in-depth/constructive exchanges on approaches, methods or results
- Publication
- Exchange of personnel
University of Fribourg Switzerland (Europe)
- in-depth/constructive exchanges on approaches, methods or results
- Publication
École pratique des hautes études France (Europe)
- in-depth/constructive exchanges on approaches, methods or results
- Publication
Docuteam GmbH Switzerland (Europe)
- in-depth/constructive exchanges on approaches, methods or results
- Publication
e-Codices Switzerland (Europe)
- in-depth/constructive exchanges on approaches, methods or results
- Publication
Technische Universität Kaiserslautern Germany (Europe)
- in-depth/constructive exchanges on approaches, methods or results
- Research Infrastructure
- Exchange of personnel

Scientific events



Self-organised

Title Date Place

Associated projects

Number Title Start Funding scheme
125220 HisDoc: Historical Document Analysis, Recognition, and Retrieval 01.05.2009 Sinergia
150173 HisDoc 2.0 : Towards Computer-Assisted Paleography 01.01.2014 Project funding

Abstract

In HisDoc III we target historical document classification for large amounts of uncategorized facsimiles with the intent to provide new capabilities for researchers in the Digital Humanities. In particular, we will address the task of categorizing document images with respect to content, language, script, and layout. To do so, we will leverage the expertise gained from our previous projects HisDoc and HisDoc 2.0. In HisDoc we have shown that historical Document Image Analysis (DIA) can be effectively applied to extract layout structures and textual transcriptions and in the current HisDoc 2.0 project we successfully retrieved additional paleographic information. The novel contributions of HisDoc III will be complemented by these methods to cope with large document collections.Existing methods are largely based on supervised learning and thus require an extensive amount of labeled training data. Therefore they are not directly applicable to classify collections of heterogeneous manuscripts with a large variety of layout structure, textual content, degradation traces, and other artifacts. While this problem is already relevant for homogeneously digitized books, it becomes even more crucial for isolated pages and the tremendous amount of yet non-cataloged and unexplored fragments distributed over many libraries around the world.The objective of HisDoc III is twofold: (i) fundamental research on combined text- and image-based classification methods and (ii) making developed technology useful for libraries, archives, and researchers in the Humanities. Firstly, for the classification of documents we will study novel deep learning methods for large amounts of unlabeled text and image data. These methods will be complemented by structural approaches based on document graphs. For the combination of these diverse approaches we will investigate Multiple Classifier Systems on the one hand and integrated neural network architectures on the other. Secondly, we will combine three ideas for making methods useful for libraries: (i) novel means for reducing the needed amount of ground truth by unsupervised machine learning and alternatively bootstrapping combined with active learning; (ii) intuitive computer-assisted presentation and annotation tools; and (iii) making our systems publicly available as Web services.To demonstrate the suitability of the HisDoc III research results, we will design novel computer-assisted workflows in collaboration with an advisory board compiled of scholars, librarians and archivists. A particular focus is speeding up the generation of catalog and database entries and devising ways to present methods and results in an understandable way.In HisDoc III, we formulate novel research ideas, solve fundamental problems in DIA, and make innovative tools and services available for the research community. We expect this project to become a catalyst for the development of innovative solutions for the Digital Humanities.
-