An unsupervised linguistic‑based model for automatic glossary term extraction from a single PDF textbook
Loading...
Date
2023-05
Authors
Journal Title
Journal ISSN
Volume Title
Type
Article
Publisher
Series Info
Education and Information Technologies;
Scientific Journal Rankings
Abstract
Term extraction from textbooks is the cornerstone of many diferent intelligent natu-
ral language processing systems, especially those that support learners and educators
in the education system. This paper proposes a novel unsupervised domain-inde-
pendent model that automatically extracts relevant and domain-related key terms
from a single PDF textbook, without relying on a statistical technique or external
knowledge base. It only relies on the basic linguistic techniques of the natural lan-
guage processing: pattern recognition, sentence tokenization, part-of-speech tag-
ging, and chunking. The model takes a PDF textbook as an input and produces a list
of key terms as an output. Furthermore, the model proposes a novel classifcation of
sentences from which the concept of defning sentences is proposed. The defning
sentences are the main textual units that the model revolves around to identify the
key terms. The architecture of the proposed work consists of 21 processes distrib-
uted across three phases. The frst phase consists of fve processes for extracting text
from a PDF textbook and cleaning it for the next phases. The second phase consists
of eight processes for identifying the defning sentences and extracting them from
all the textbook’s sentences. The last phase consists of eight processes for identify-
ing and extracting the key terms from every defning sentence. The proposed work
was evaluated by two experiments in which two PDF textbooks from diferent felds
are used. The experimental evaluation showed that the results were promising.
Description
Keywords
Term extraction ·, Linguistic techniques ·, Natural language processing ·, Unsupervised machine learning ·, PDF textbooks ·, Intelligent Tutoring