Constructing a public meeting corpus

概要

In this paper, we propose a method for constructing a large corpus about a century of public meetings in historical Australian newspapers, and analyze the constructed corpus. The corpus construction method is based on image processing and Optical Character Recognition (OCR). We digitize and transcribe texts of the specific topic of public meeting. Experiments show that our proposed method achieves a F-score of 71.5% with a high recall of 97.5% for corpus construction. This allows us to feed a content search tool for temporal and semantic content analysis.

論文種別
発表文献
Proceedings - the 12th International Conference on Language Resources and Evaluation (LREC 2020)
Intelligence and Sensing Lab.
Intelligence and Sensing Lab.

コンピュータビジョン、コンピュテーショナルフォトグラフィ、パターン認識、自然言語処理、機械学習などの分野の研究をしています。