An unsupervised linguistic‑based model for automatic  glossary term extraction from a single PDF textbook

Soliman, Ashraf

doi:https://doi.org/10.1007/s10639-023-11818-1

An unsupervised linguistic‑based model for automatic glossary term extraction from a single PDF textbook

Files

s10639-023-11818-1.pdf (1.59 MB)

Date

2023-05

Authors

Soliman, Ashraf

Series Info

Education and Information Technologies;

Doi

https://doi.org/10.1007/s10639-023-11818-1

Scientific Journal Rankings

https://www.scimagojr.com/journalsearch.php?q=144955&tip=sid&clean=0

Abstract

Term extraction from textbooks is the cornerstone of many diferent intelligent natu- ral language processing systems, especially those that support learners and educators in the education system. This paper proposes a novel unsupervised domain-inde- pendent model that automatically extracts relevant and domain-related key terms from a single PDF textbook, without relying on a statistical technique or external knowledge base. It only relies on the basic linguistic techniques of the natural lan- guage processing: pattern recognition, sentence tokenization, part-of-speech tag- ging, and chunking. The model takes a PDF textbook as an input and produces a list of key terms as an output. Furthermore, the model proposes a novel classifcation of sentences from which the concept of defning sentences is proposed. The defning sentences are the main textual units that the model revolves around to identify the key terms. The architecture of the proposed work consists of 21 processes distrib- uted across three phases. The frst phase consists of fve processes for extracting text from a PDF textbook and cleaning it for the next phases. The second phase consists of eight processes for identifying the defning sentences and extracting them from all the textbook’s sentences. The last phase consists of eight processes for identify- ing and extracting the key terms from every defning sentence. The proposed work was evaluated by two experiments in which two PDF textbooks from diferent felds are used. The experimental evaluation showed that the results were promising.

Keywords

Term extraction ·, Linguistic techniques ·, Natural language processing ·, Unsupervised machine learning ·, PDF textbooks ·, Intelligent Tutoring

URI

http://repository.msa.edu.eg/xmlui/handle/123456789/5571

Collections

Faculty Of Management Sciences Research Paper

Full item page

An unsupervised linguistic‑based model for automatic glossary term extraction from a single PDF textbook

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Series Info

Doi

Scientific Journal Rankings

Orcid

Abstract

Description

Keywords

Citation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By