Constructing a public meeting corpus


In this paper, we propose a method for constructing a large corpus about a century of public meetings in historical Australian newspapers, and analyze the constructed corpus. The corpus construction method is based on image processing and Optical Character Recognition (OCR). We digitize and transcribe texts of the specific topic of public meeting. Experiments show that our proposed method achieves a F-score of 71.5% with a high recall of 97.5% for corpus construction. This allows us to feed a content search tool for temporal and semantic content analysis.

Proceedings - the 12th International Conference on Language Resources and Evaluation (LREC 2020)
Intelligence and Sensing Lab.
Intelligence and Sensing Lab.