MSA at ImageCLEF 2025 Multimodal Reasoning: Multilingual Multimodal Reasoning With Ensemble Vision-Language Models
| dc.Affiliation | October University for modern sciences and Arts MSA | |
| dc.contributor.author | Seif Ahmed | |
| dc.contributor.author | Mohamed T. Younes | |
| dc.contributor.author | Abdelrahman Moustafa | |
| dc.contributor.author | Abdulrahman Allam | |
| dc.contributor.author | Hamza Moustafa | |
| dc.date.accessioned | 2025-10-26T11:15:36Z | |
| dc.date.issued | 2025-09-09 | |
| dc.description | SJR 2024 0.166 H-Index 69 | |
| dc.description.abstract | We present a robust ensemble-based system for multilingual multimodal reasoning, designed for the ImageCLEF 2025 EXAMS-V challenge. Our approach integrates Gemini 2.5 Flash for visual description, Gemini 1.5 Pro for caption refinement and consistency checks, and Gemini 2.5 Pro as a reasoner which handles final answer selection, all coordinated through carefully engineered few-shot and zero-shot prompts. We conducted an extensive ablation study, training several large language models (Gemini 2.5 Flash, Phi-4, Gemma-3, Mistral) on an English dataset and its multilingual augmented version. Additionally, we evaluated Gemini 2.5 Flash in a zero-shot setting for comparison and found it to substantially outperform the trained models. Prompt design also proved critical: enforcing concise, language-normalized formats and prohibiting explanatory text boosted model accuracy on the English validation set from 55.9% to 61.7%. On the official leaderboard, our system (Team MSA) achieved first place overall in the multilingual track with 81.4% accuracy, and led 11 out of 13 individual language tracks, with top results such as 95.07% for Croatian and 92.12% for Italian. These findings highlight that lightweight OCR–VLM ensembles, when paired with precise prompt strategies and cross-lingual augmentation, can outperform heavier end-to-end models in high-stakes, multilingual educational settings. | |
| dc.description.uri | https://www.scimagojr.com/journalsearch.php?q=21100218356&tip=sid&clean=0 | |
| dc.identifier.issn | 16130073 | |
| dc.identifier.uri | https://repository.msa.edu.eg/handle/123456789/6572 | |
| dc.language.iso | en_US | |
| dc.publisher | CEUR-WS | |
| dc.relation.ispartofseries | 26th Working Notes of the Conference and Labs of the Evaluation Forum, CLEF 2025 Conference city Madrid ; 9 September - 12 September , 2025 , Volume 4038 , Pages 2275 - 2283 , Conference code 213051 | |
| dc.subject | EXAMS-V 2025 Challenge | |
| dc.subject | ImageCLEF 2025 | |
| dc.subject | Large Language Models | |
| dc.subject | Multilingual QA | |
| dc.subject | Multimodal Reasoning | |
| dc.subject | Vision-Language Models | |
| dc.title | MSA at ImageCLEF 2025 Multimodal Reasoning: Multilingual Multimodal Reasoning With Ensemble Vision-Language Models | |
| dc.type | Article |
