Constructing a public meeting corpus

May, 2020

Abstract

In this paper, we propose a method for constructing a large corpus about a century of public meetings in historical Australian newspapers, and analyze the constructed corpus. The corpus construction method is based on image processing and Optical Character Recognition (OCR). We digitize and transcribe texts of the specific topic of public meeting. Experiments show that our proposed method achieves a F-score of 71.5% with a high recall of 97.5% for corpus construction. This allows us to feed a content search tool for temporal and semantic content analysis.

Type

Conference paper

Publication

Proceedings - the 12th International Conference on Language Resources and Evaluation (LREC 2020)

Constructing a public meeting corpus

Abstract

Intelligence and Sensing Lab.