An unsupervised linguistic‑based model for automatic glossary term extraction from a single PDF textbook

Loading...
Thumbnail Image

Date

2023-05

Journal Title

Journal ISSN

Volume Title

Type

Article

Publisher

Series Info

Education and Information Technologies;

Abstract

Term extraction from textbooks is the cornerstone of many diferent intelligent natu- ral language processing systems, especially those that support learners and educators in the education system. This paper proposes a novel unsupervised domain-inde- pendent model that automatically extracts relevant and domain-related key terms from a single PDF textbook, without relying on a statistical technique or external knowledge base. It only relies on the basic linguistic techniques of the natural lan- guage processing: pattern recognition, sentence tokenization, part-of-speech tag- ging, and chunking. The model takes a PDF textbook as an input and produces a list of key terms as an output. Furthermore, the model proposes a novel classifcation of sentences from which the concept of defning sentences is proposed. The defning sentences are the main textual units that the model revolves around to identify the key terms. The architecture of the proposed work consists of 21 processes distrib- uted across three phases. The frst phase consists of fve processes for extracting text from a PDF textbook and cleaning it for the next phases. The second phase consists of eight processes for identifying the defning sentences and extracting them from all the textbook’s sentences. The last phase consists of eight processes for identify- ing and extracting the key terms from every defning sentence. The proposed work was evaluated by two experiments in which two PDF textbooks from diferent felds are used. The experimental evaluation showed that the results were promising.

Description

Keywords

Term extraction ·, Linguistic techniques ·, Natural language processing ·, Unsupervised machine learning ·, PDF textbooks ·, Intelligent Tutoring

Citation