Research
Learning from unlabelled or partly labelled data is a great challenge for machine learning in
general. This project will produce insights and methods for semi- and unsupervised learning that
can be applied both to other languages and completely different application domains. More
specifically, the project will contribute new knowledge to the following domains:
- Automatic model adaptation: The automatic adaptation of the model to a new type of
document, handwriting, or content will be developed by integrating a self-supervised
adaptation process to the recognition process. Unsupervised learning methodology for
document layout analysis, HTR, and NER has been used with some success to enhance
searchability and readability in larger languages, such as English and French, and will here be
applied for the very first time on handwritten period Norwegian.
- Unsupervised quality estimation with confidence metrics: Unsupervised quality metrics will be
developed for document analysis, handwritten text recognition, and named-entity extraction.
Present implementations have as yet been only partly successful and have never been done
for Norwegian documents.
- Modelling complex linguistic variations: The baseline recognition system proposed for the
HUGIN-MUNIN system is based on subwords, so that it is possible to recognize out-of-lexicon
words. This is particularly important for Norwegian, with two major written variants, substantial changes over the last two centuries, and extensive use of compounding. Out-of-
lexicon word spotting and zero-shot word recognition will also be explored for the first time in the context of Norwegian historical documents. This will help in retrieving and indexing
document images even on the basis of word classes that were not included in the training set. This is another approach with a potential to further increase access to the huge digital
repositories of historical Norwegian documents.