Project

Back to overview

Bootstrapping Handwriting Recognition Systems for Historical Documents

Applicant Fischer Andreas
Number 141453
Funding scheme Fellowships for prospective researchers
Research institution CENPARMI Faculty of Engineering and Computer Science Concordia University
Institution of higher education Institution abroad - IACH
Main discipline Mathematics
Start/End 01.10.2012 - 30.09.2013
Show all

Keywords (5)

pattern recognition; historical document analysis; handwriting recognition; bootstrapping; confidence models

Lay Summary (English)

Lead
Lay summary

The objective of the project is to provide tools to support cultural heritage preservation of handwritten historical documents. Automatic handwriting recognition is needed to access the content of scanned documents and hence make the manuscripts amenable to browsing and searching in digital libraries. Once integrated into digital libraries, the world's cultural heritage would become readily available to researchers and the public.

Facing millions of manuscript pages that need to be transcribed into machine-readable text, only little human interaction can be taken into account for bootstrapping recognition systems, i.e., for building recognizers from scratch for a given script and language. In this project, algorithmic solutions for effective and efficient bootstrapping are investigated that aim at providing access to the manuscript content with little human effort. Bootstrapping can be considered as a key element along the road towards mass digitization of handwritten historical documents.

In the bootstrapping phase, only a small amount of learning samples is initially available to train a recognition system. Hence, errors are to be expected and confidence models are needed to identify reliable results without human interaction. The project investigates such confidence models as a core component which is needed for diverse applications including semi-supervised learning, multiple classifier systems, and keyword spotting.

Direct link to Lay Summary Last update: 21.02.2013

Responsible applicant and co-applicants

Publications

Publication
A binarization-free clustering approach to segment curved text lines in historical manuscripts
Garz Angelika, Fischer Andreas, Bunke Horst, Ingold Rolf (2013), A binarization-free clustering approach to segment curved text lines in historical manuscripts, in Proc. 12th Int. Conf. on Document Analysis and Recognition, Washington, DC, USAIEEE Computer Society Press, Los Alamitos, CA.
A discriminative approach to on-line handwriting recognition using bi-character models
Prum Sophea, Visani Muriel, Fischer Andreas, Ogier Jean-Marc (2013), A discriminative approach to on-line handwriting recognition using bi-character models, in Proc. 12th Int. Conf. on Document Analysis and Recognition, Washington, DC, USAIEEE Computer Society Press, Los Alamitos, CA.
A fast matching algorithm for graph-based handwriting recognition
Fischer Andreas, Suen Ching Y., Frinken Volkmar, Riesen Kaspar, Bunke Horst (2013), A fast matching algorithm for graph-based handwriting recognition, in Proc. 9th Int. Workshop on Graph-Based Representations in Pattern Recognition, Vienna, AustriaSpringer, Berlin, Heidelberg.
Generation of learning samples for historical handwriting recognition using image degradation
Fischer Andreas, Kieu Van Cuong, Visani Muriel, Suen Ching Y. (2013), Generation of learning samples for historical handwriting recognition using image degradation, in Proc. 2nd Int. Workshop on Historical Document Imaging and Processing, Washington, DC, USAACM, New York, NY.
Handwriting recognition in historical documents using very large vocabularies
Frinken Volkmar, Fischer Andreas, Martinez-Hinarejos Carlos D. (2013), Handwriting recognition in historical documents using very large vocabularies, in Proc. 12th Int. Conf. on Document Analysis and Recognition, Washington, DC, USAACM, New York, NY.
Improving HMM-based keyword spotting with character language models
Fischer Andreas, Frinken Volkmar, Bunke Horst, Suen Ching Y. (2013), Improving HMM-based keyword spotting with character language models, in Proc. 12th Int. Conf. on Document Analysis and Recognition, Washington, DC, USAIEEE Computer Society Press, Los Alamitos, CA.
Keyword spotting for self-training of BLSTM NN based handwriting recognition systems
Frinken Volkmar, Fischer Andreas, Baumgartner Markus, Bunke Horst, Keyword spotting for self-training of BLSTM NN based handwriting recognition systems, in Pattern Recognition.

Collaboration

Group / person Country
Types of collaboration
Rolf Ingold, Document, Image and Voice Analysis (DIVA), University of Fribourg Switzerland (Europe)
- in-depth/constructive exchanges on approaches, methods or results
- Publication
Kaspar Riesen, professor at the University of Applied Sciences and Arts Northwestern Switzerland Switzerland (Europe)
- in-depth/constructive exchanges on approaches, methods or results
- Publication
Jean-Marc Ogier, Laboratoire Informatique, Image et Interaction (L3i), University of La Rochelle France (Europe)
- in-depth/constructive exchanges on approaches, methods or results
- Publication
Josep Llados, Computer Vision Center (CVC), Autonomous University of Barcelona Spain (Europe)
- in-depth/constructive exchanges on approaches, methods or results
- Publication

Scientific events

Active participation

Title Type of contribution Title of article or contribution Date Place Persons involved
9th Int. Workshop on Graph-based Representations in Pattern Recognition Talk given at a conference A fast matching algorithm for graph-based handwriting recognition 15.05.2013 Vienna, Austria Fischer Andreas;


Awards

Title Year
Concordia University Conference and Exposition Award 2013

Associated projects

Number Title Start Funding scheme
151279 Automatic Handwriting Recognition and Writer Identification based on the Kinematic Theory 01.04.2014 Advanced Postdoc.Mobility

Abstract

In order to preserve the cultural heritage conveyed in handwritten historical manuscripts, libraries all around the world digitize large amounts of books and letters. Storing scanned or photographed manuscript images saves the world's cultural heritage from being lost due to paper and parchment degradation. After digitization, however, the libraries are facing a challenging and yet unsolved problem to make the textual content of millions of document images readily accessible to researchers and the public alike. The goal is to create digital libraries with search engines similarly to the ones on the internet that allow for searching and browsing the manuscripts based on their textual content. To that end, the historical scripts have to be transcribed into computer-readable text. Considering the large amount of document images, manual transcription is not feasible within reasonable time. Instead, there is a growing interest in pattern recognition methods that allow for automatic transcription.Nowadays, several state of the art handwriting recognition systems exist that have a high potential to solve the task of automatic transcription. However, they require large amounts of learning samples for each and every historical script in order to model characters, words, and sentences accurately. Such learning samples typically consist of text line or word images alongside with their correct transcription. Their retrieval from manuscript images is difficult and involves time-consuming manual interaction. Hence, in the context of mass digitization, only few learning samples can be provided initially for each historical script within reasonable time. During this bootstrapping phase, only weak recognition systems can be realized that are prone to commit recognition errors.In this project, the problem of bootstrapping is investigated. The main question addressed is how to employ weak recognition systems effectively during bootstrapping with little or none human interaction. Algorithmic solutions for bootstrapping can be considered as a key element to make existing handwriting recognition systems from current research available to the real-world problem of mass digitization of historical manuscripts, which does not allow for extensive, time-consuming manual interaction.
-