Repository logo
 

Extracting Structured Information From Academic Documents

Date

2022-04-11

Journal Title

Journal ISSN

Volume Title

Publisher

ORCID

0000-0001-9061-8004

Type

Thesis

Degree Level

Masters

Abstract

Document information extraction is a broad and challenging topic. With a growing number of academic documents published online each year, there is a large amount of knowledge trapped in static files. Naturally, the field of document information extraction can be applied to these information-rich documents for automatic extraction and structuring of knowledge. The process begins with document layout analysis (DLA), where document objects like titles and headings are identified. Then, specific text regions are used as input for further information extraction, reducing the noise and complexity for downstream methods. In this thesis, a new DLA dataset is proposed. This Dense Article Dataset (DAD) provides annotations for 42 document objects, allowing deep learning models to be trained for precise and fine-grained layout analysis. A new bounding box regression method is proposed and used with several popular segmentation networks. The results show that the approach not only increases accuracy when labeling document objects. Furthermore, models trained on DAD can also be used to accelerate the annotation of more data, paving the way for future expansion of DAD and more robust models. With the DLA task complete, the focus moves to textual analysis to better understand document contents. Specifically, a Descriptive Relation Dataset (DReD) is proposed, which trains models to describe the relationship between two noun phrases. Previous relation extraction works rely on classification, but narrow categories limit the amount of relevant information extracted. Several state-of-the-art sequence generation models are trained using DReD and thoroughly prove the ability to model relation descriptions. Existing datasets are also modified to have relation descriptions and compare them to related works. The models trained to predict relation descriptions achieve competitive results with typical classification networks, further proving that describing relations rather than classifying them is a viable approach.

Description

Keywords

computer vision, natural language processing, transformers, datasets, information extraction

Citation

Degree

Master of Science (M.Sc.)

Department

Electrical and Computer Engineering

Program

Electrical Engineering

Advisor

Citation

Part Of

item.page.relation.ispartofseries

DOI

item.page.identifier.pmid

item.page.identifier.pmcid