MSA at ImageCLEF 2025 Multimodal Reasoning: Multilingual Multimodal Reasoning With Ensemble Vision-Language Models

dc.AffiliationOctober University for modern sciences and Arts MSA
dc.contributor.authorSeif Ahmed
dc.contributor.authorMohamed T. Younes
dc.contributor.authorAbdelrahman Moustafa
dc.contributor.authorAbdulrahman Allam
dc.contributor.authorHamza Moustafa
dc.date.accessioned2025-10-26T11:15:36Z
dc.date.issued2025-09-09
dc.descriptionSJR 2024 0.166 H-Index 69
dc.description.abstractWe present a robust ensemble-based system for multilingual multimodal reasoning, designed for the ImageCLEF 2025 EXAMS-V challenge. Our approach integrates Gemini 2.5 Flash for visual description, Gemini 1.5 Pro for caption refinement and consistency checks, and Gemini 2.5 Pro as a reasoner which handles final answer selection, all coordinated through carefully engineered few-shot and zero-shot prompts. We conducted an extensive ablation study, training several large language models (Gemini 2.5 Flash, Phi-4, Gemma-3, Mistral) on an English dataset and its multilingual augmented version. Additionally, we evaluated Gemini 2.5 Flash in a zero-shot setting for comparison and found it to substantially outperform the trained models. Prompt design also proved critical: enforcing concise, language-normalized formats and prohibiting explanatory text boosted model accuracy on the English validation set from 55.9% to 61.7%. On the official leaderboard, our system (Team MSA) achieved first place overall in the multilingual track with 81.4% accuracy, and led 11 out of 13 individual language tracks, with top results such as 95.07% for Croatian and 92.12% for Italian. These findings highlight that lightweight OCR–VLM ensembles, when paired with precise prompt strategies and cross-lingual augmentation, can outperform heavier end-to-end models in high-stakes, multilingual educational settings.
dc.description.urihttps://www.scimagojr.com/journalsearch.php?q=21100218356&tip=sid&clean=0
dc.identifier.issn16130073
dc.identifier.urihttps://repository.msa.edu.eg/handle/123456789/6572
dc.language.isoen_US
dc.publisherCEUR-WS
dc.relation.ispartofseries26th Working Notes of the Conference and Labs of the Evaluation Forum, CLEF 2025 Conference city Madrid ; 9 September - 12 September , 2025 , Volume 4038 , Pages 2275 - 2283 , Conference code 213051
dc.subjectEXAMS-V 2025 Challenge
dc.subjectImageCLEF 2025
dc.subjectLarge Language Models
dc.subjectMultilingual QA
dc.subjectMultimodal Reasoning
dc.subjectVision-Language Models
dc.titleMSA at ImageCLEF 2025 Multimodal Reasoning: Multilingual Multimodal Reasoning With Ensemble Vision-Language Models
dc.typeArticle

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
2507.11114v1.pdf
Size:
1.78 MB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
51 B
Format:
Item-specific license agreed upon to submission
Description: