Confidence-Credibility Aware Weighted Ensembles of Small LLMs Outperform Large LLMs in Emotion Detection
Loading...
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Springer International Publishing AG
Series Info
Lecture Notes on Data Engineering and Communications Technologies ; Volume 293 , Pages 170 - 179
Scientific Journal Rankings
Orcid
Abstract
This paper introduces a confidence-weighted, credibility-aware ensemble framework for text-based emotion detection, inspired by Condorcet’s Jury Theorem (CJT). Unlike conventional ensembles that often rely on homogeneous architectures, our approach combines architecturally diverse small transformer-based large language models (sLLMs)—BERT, RoBERTa, DistilBERT, DeBERTa, and ELECTRA—each fully fine-tuned for emotion classification. To preserve error diversity, we minimize parameter convergence while taking advantage of the unique biases of each model. A dual-weighted voting mechanism integrates both global credibility (validation F1- score) and local confidence (instance-level probability) to dynamically weight model contributions. Experiments on the DAIR-AI dataset demonstrate that our credibility-confidence ensemble achieves a macro F1-score of 93.5%, surpassing state-of-the-art benchmarks and significantly outperforming large-scale LLMs, including Falcon, Mistral, Qwen, and Phi, even after task-specific Low-Rank Adaptation (LoRA). With only 595 M parameters in total, our small LLMs ensemble proves more parameter-efficient and robust than models up to 7B parameters, establishing that carefully designed ensembles of small, fine-tuned models can outperform much larger LLMs in specialized natural language processing (NLP) tasks such as emotion detection.
Description
SJR 2025
0.119
Q4
H-Index
40
Subject Area and Category:
Computer Science
Computer Networks and Communications
Computer Science Applications
Information Systems
Engineering
Electrical and Electronic Engineering
Media Technology
Citation
Elgabry, M., & Hamdi, A. (2026). Confidence-Credibility Aware Weighted Ensembles of Small LLMs Outperform Large LLMs in Emotion Detection. Lecture Notes on Data Engineering and Communications Technologies, 170–179. https://doi.org/10.1007/978-3-032-23035-5_16
