Research

Learning from unlabelled or partly labelled data is a great challenge for machine learning in general. This project will produce insights and methods for semi- and unsupervised learning that can be applied both to other languages and completely different application domains. More specifically, the project will contribute new knowledge to the following domains:

  • Automatic model adaptation: The automatic adaptation of the model to a new type of document, handwriting, or content will be developed by integrating a self-supervised adaptation process to the recognition process. Unsupervised learning methodology for document layout analysis, HTR, and NER has been used with some success to enhance searchability and readability in larger languages, such as English and French, and will here be applied for the very first time on handwritten period Norwegian.
  • Unsupervised quality estimation with confidence metrics: Unsupervised quality metrics will be developed for document analysis, handwritten text recognition, and named-entity extraction. Present implementations have as yet been only partly successful and have never been done for Norwegian documents.
  • Modelling complex linguistic variations: The baseline recognition system proposed for the HUGIN-MUNIN system is based on subwords, so that it is possible to recognize out-of-lexicon words. This is particularly important for Norwegian, with two major written variants, substantial changes over the last two centuries, and extensive use of compounding. Out-of- lexicon word spotting and zero-shot word recognition will also be explored for the first time in the context of Norwegian historical documents. This will help in retrieving and indexing document images even on the basis of word classes that were not included in the training set. This is another approach with a potential to further increase access to the huge digital repositories of historical Norwegian documents.