Semantic Fusion of Text and Images: A Novel Multimodal-RAG Framework for Document Analysis
Konferenz: ICUMT 2024 - 16th International Congress on Ultra Modern Telecommunications and Control Systems and Workshops
26.11.2024 - 28.11.2024 in Meloneras, Gran Canaria, Spain
Tagungsband: ICUMT 2024
Seiten: Sprache: EnglischTyp: PDF
Persönliche VDE-Mitglieder erhalten auf diesen Artikel 10% Rabatt
Autoren:
Nandi, Tuhina; Gupta, Sidharth; Kaushal, Abhishek; Burget, Radim; Jezek, Stepan; Dutta, Malay Kishore
Inhalt:
The presented multimodal Retrieval-Augmented Generation (RAG) model combines FAISS with Gemini 1.5 Flash to quickly recover and synthesize information from text and image data in PDFs. This methodology allows for high-accuracy searches within a unified vector space by using distinct FAISS indices for text and images, while text-embedding-004 processes queries and stored data. Following a user query submission, FAISS retrieves the most relevant text chunks and images based on similarity metrics. These are then submitted to the Gemini model, which produces a coherent and informative response. A recursive textchunking technique optimizes the handling of token limitations, increasing processing performance and reducing redundancy. This framework excels in multimodal synthesis, extracting valuable insights from both textual and visual content while remaining computationally efficient. This method focuses on the model's utility in domains needing integrated analysis from several data sources, providing a comprehensive but simplified information retrieval solution.