Intelligent Document Processing • March 21, 2025

From Documents to Insights: How Multimodal LLMs Elevate Key Information Extraction (KIE)

Key Information Extraction (KIE) is a fundamental process in document processing that involves identifying and extracting essential data points from unstructured or semi-structured documents. In a world where businesses and organizations handle vast amounts of documents—such as contracts, invoices, receipts, and forms—automating the extraction of key information saves time, reduces manual effort, and enhances accuracy. KIE helps convert documents into structured data formats, making it easier to analyze, search, and utilize the information. Whether extracting names, dates, amounts, or other critical fields, KIE plays a crucial role in streamlining workflows, boosting productivity, and supporting decision-making in sectors like finance, healthcare, legal, and more.

The importance of KIE in document processing is growing as companies strive to optimize their data management practices, ensuring that valuable insights are quickly accessible without the need for extensive manual data entry or review. This blog will focus specifically on the applications of multimodal LLMs in Key Information Extraction, illustrating their transformative potential in enhancing document processing workflows.

From Documents to Insights-infographic-1

Source: OCR-free Document Understanding Transformer

Leveraging Multimodal LLMs for KIE without Fine-Tuning

Frontier LLMs with Multimodal capabilities, such as GPT-4, Claude, or Gemini, have revolutionized key information extraction by leveraging their robust document understanding and generation capabilities. Unlike traditional models, which often require explicit fine-tuning on specific datasets, these multimodal LLMs can process both text and images, enabling them to understand documents in a holistic way. This allows them to extract key information with minimal or no additional training, making them incredibly versatile for various document types, including scanned PDFs, forms, and structured or semi-structured text.

The strength of multimodal LLMs is their ability to interpret context, semantics, and relationships within the document content. For instance, when dealing with invoices, contracts, or medical records, these models can understand the document layout, correlate related information, and accurately identify key entities like names, dates, monetary values, etc. Their pre-trained knowledge, and the ability to generate human-like language, allow them to infer the necessary details without requiring task-specific adjustments. This makes multimodal LLMs highly effective for KIE tasks in document processing, offering a flexible and scalable solution across different industries.

Experimenting with KIE

In our exploration of Key Information Extraction (KIE) using multimodal LLMs, we tested three different approaches to evaluate how well these methods can extract key information from documents. Below are the details of each approach, including the prompts used and the evaluation metrics.

	Approach 1 - KIE as Visual Question Answering (VQA):	Approach 2 - Asking for Specific Keys in a Single Request:	Approach 3 - Extracting All Information in Key-Value Pairs:
Definition	In this approach, we treated KIE as a Visual Question Answering (VQA) task, where we asked the model questions about the entities we wanted to extract from the document. For every entity to be extracted, a separate question was asked, in a single LLM call.	In this approach, we consolidated the requests by asking the model to extract multiple key-value pairs in one go and return them in a JSON format. To ensure the generated JSON response was valid, we used an additional LLM layer for “JSON correction,” which validated and corrected the format if needed.	For the final approach, we requested the model to extract all available information from the document in a key-value format, allowing for a more comprehensive extraction of data. Similar to approach 2, we used an additional LLM layer to validate and correct the generated JSON response, ensuring that it was in the correct format.
Sample Prompts Used	What is the “company” in the given document? What is the “date” in the given document? What is the “address” in the given document? What is the “total” in the given document?	What are the values for “company”, “date”, “address”, “total” in the given document? Provide the output in a JSON format. Example output format: {“company”: null, “date”: null, “address” null, “total”: null}	Extract all the information present in the given document in a structured format, preferably key-value pairs. Provide the output in a JSON format.
Evaluation Metric	F1 Score, ANLS	F1 Score	F1 Score

Each of these approaches provides different insights into the capabilities of multimodal LLMs in handling KIE tasks. By evaluating their performance across various metrics, we can better understand which method is more effective for different document processing scenarios.

Multimodal LLMs Considered for KIE

In our experiments, we explored both open-source and API-based multimodal LLMs to assess their performance in Key Information Extraction (KIE) tasks.

Open-Source Models:

API Models:

GPT-4 Vision (GPT4V)
GPT-4 OpenAPI (GPT4o)
Gemini 1.5 Pro

Datasets Used for KIE Experiments

To evaluate the performance of different multimodal LLMs in Key Information Extraction (KIE), we conducted experiments on three well-known datasets that vary in complexity and document types.

CORD (48 samples): The CORD (Consolidated Receipt Dataset) is a comprehensive dataset specifically designed for post-OCR parsing tasks, containing annotated Indonesian receipts. It supports a variety of entities crucial for understanding receipt data, including merchant names, dates, receipt numbers, item descriptions, and total prices. The dataset is structured with 30 different entities, allowing for detailed entity recognition.
FUNSD (47 samples): The FUNSD (Form Understanding in Noisy Scanned Documents) dataset includes scanned forms that contain both structured and unstructured information. It challenges models with complex layouts, including form fields, handwritten text, and irregular structures, testing their ability to extract key-value pairs in noisy environments.
SROIE (347 samples): The SROIE (Scanned Receipts OCR and Information Extraction) dataset is larger, with a focus on extracting key information from scanned receipts. It involves various fields such as company names, dates, and totals, similar to CORD but with more diverse samples, making it ideal for evaluating scalability and generalization in KIE models.

Results and Observations

	CORD			FUNSD			SROIE
	Approach 1	Approach 2	Approach 3	Approach 1	Approach 2	Approach 3	Approach 1	Approach 2	Approach 3
Doc-Owl1.5-Chat	0.4	0.339	0.1643	0.4004	0.109	0.054	0.3976	0.374	0.1829
Idefics2	0.3483	0.2392	0.1388	0.282	0.2662	0.1102	0.4034	0.391	0.2262
LLava-Next	0.471	0.4425	0.2711	0.0728	0.1054	0.0198	0.263	0.24465	0.134
MiniCPM-2.5	0.4064	0.1555	0.167	0.35546	0.1456	0.095	0.4639	0.365	0.2123
‎
GPT4V	0.5419	0.678	0.45104	0.4368	0.4768	0.2157	0.5626	0.63623	0.554
GPT 4o	0.6258	0.7246	0.4476	0.4453	0.494	0.2473	0.5597	0.6786	0.536
Gemini 1.5 Pro	0.4967	0.5856	0.51123	0.3463	0.42038	0.2857	0.6014	0.61754	0.3292

Table 1: Performance comparison across all 3 approaches

	CORD			FUNSD			SROIE
	Approach 1	Approach 2	Approach 3	Approach 1	Approach 2	Approach 3	Approach 1	Approach 2	Approach 3
GPT4V	0.0103	0.01186	0.01305	0.00859	0.01483	0.0207	0.0131	0.01534	0.02661
GPT 4o	0.005164	0.00594	0.00679	0.00428	0.00741	0.0097	0.00655	0.00765	0.0134
Gemini 1.5 Pro	0.00116	0.00169	0.002	0.00124	0.00326	0.003482	0.0012	0.00206	0.00357

Table 2: Cost comparison across all 3 approaches

	CORD			FUNSD			SROIE
	Approach 1	Approach 2	Approach 3	Approach 1	Approach 2	Approach 3	Approach 1	Approach 2	Approach 3
Doc-Owl1.5-Chat	0.7328	5.7431	8.2301	0.8106	13.2359	18.545	0.9243	4.1646	24.8998
Idefics2	2.26	3.6056	12.6807	3.15	8.0691	24.4568	2.26	3.951	35.7991
LLava-Next	2.28	5.03	16.1	17.2869	20.94	57.23	8.6	9.34	49.17
MiniCPM-2.5	0.3844	1.23	4.2	0.4225	2.67	9.98	0.7002	1.76	7.54
‎
GPT4V	5.56	7.7034	9.5744	2.49	9.1676	21.043	4.35	6.9511	16.2363
GPT 4o	5.44	5.0627	5.184	2.56	5.0982	5.9068	4.06	4.212	8.6216
Gemini 1.5 Pro	1.9455	2.7042	3.7555	1.42395	4.43572	5.90058	2.06077	3.30928	6.78923

Table 3: Latency comparison across all 3 approaches

*Note: "The evaluation results above are based on experiments conducted during Aug-Sept 2024. Since then, updated releases have been made, likely bringing further improvements."

Insights from the above tables:

Top Performer: GPT-4o was the best performer in F1 accuracy, while Gemini 1.5 Pro provided an optimal balance of cost and speed.
Instruction Sensitivity: Closed-source models were less sensitive to specific instructions, performing well even with varied prompt structures, whereas open-source models showed more variability in accuracy.
Experiment Impact: Approach 2 generally provided a good balance for extracting multiple keys at once, especially for closed-source models with JSON correction layers enhancing results across all datasets.
Consistency Among Closed-source Models: The closed-source models consistently exhibit better performance than their open-source counterparts in several key areas, including F1 scores, while also demonstrating comparable or superior speed in various approaches. This suggests that, in scenarios focused on performance metrics—such as accuracy, speed, and reliability—closed-source models are often more effective for Knowledge Information Extraction (KIE) tasks.
Significant Performance Drop for FUNSD Dataset: All models experience dramatic declines in F1 scores from approach 1 to approach 3. For instance, Doc-Owl1.5-Chat drops from 0.4004 in approach 1 to 0.0540 in approach 3, indicating severe challenges with this dataset.

Quriosity Corner

Which models demonstrate the best overall performance under format restrictions?

Most models, particularly Gemini 1.5 Pro, GPT-4V, and GPT-4-o, demonstrated the best overall performance under format restrictions by effectively generating valid structured JSON responses. These models not only returned key-value pairs proficiently but also showed improvements after applying an LLM correction layer. While open-source models such as Doc-Owl 1.5 Chat and Idefics 2 were capable of producing structured JSON outputs, the closed-source models significantly outperformed them in terms of accuracy and format validation
To what extent do the models exhibit sensitivity to the precise wording of prompts or instructions, especially Approach 1 (KIE as VQA)?

We conducted an ablation study using two distinct prompting strategies derived from the LayoutLLM and DocLLM papers:

LayoutLLM Prompt: "What is the <key> in the given document?" This prompt is designed to direct the model to identify specific information based on the document's layout.

DocLLM Prompt: "What is the value for the <key>?" This alternative phrasing aims to extract similar key information but focuses on retrieving the associated value.

Key Insights:

The results indicate that open-source models exhibit significant performance variability based on prompt wording, even when semantic meaning remains consistent. Doc-Owl1.5-Chat scores 0.4000 with the LayoutLLM prompt but drops to 0.2709 with the DocLLM prompt. Idefics2 declines from 0.3419 to 0.2000, highlighting their sensitivity to phrasing changes.

In contrast, API-based models like GPT4-o and Gemini1.5-Pro show consistent performance across both prompts: GPT4-o scores slightly higher with DocLLM (0.6322) compared to LayoutLLM (0.6258), indicating minimal impact from prompt variations. Gemini1.5-Pro also maintains stable performance, suggesting greater robustness against changes in prompt structure.
Does extracting everything (Approach 3) really help in KIE?

Extracting everything (Approach 3) helps to some extent but has limitations. While it provides comprehensive data extraction, it also results in a higher processing load and potential noise due to irrelevant information. Closed-source models like Gemini1.5-Pro and GPT4-o handled this task well, especially after LLM correction, but the open-source models struggled with generating accurate key-value pairs when all information was extracted. Additionally, this approach can lead to inefficiency when only specific keys are required.
Does the security/safety filter have any adverse effect on the extraction results?

For closed-source models like GPT-4 variants, no noticeable adverse effects were observed related to security or safety filters. However, Gemini1.5-Pro demonstrated some interesting behavior. In Approach 1, for the SROIE dataset, it did not respond to 442 out of 1388 queries (for fields like company and address), and in Approach 2, it failed to respond to 196 out of 347 queries.

However, in Approach 3, where keys were not explicitly mentioned, Gemini responded to 344 out of 347 queries, suggesting that when specific keys are provided, Gemini tends to apply stricter safety filters. This indicates that Gemini may withhold responses in cases where explicit key extraction could trigger security or privacy concerns.

Conclusion

In our exploration of Key Information Extraction (KIE) using multimodal language models, we compared a range of open-source and closed-source models across three different approaches. Our findings highlight distinct trade-offs in accuracy, cost, and speed, offering insights into model suitability for various document processing scenarios.

Based on the results from the approaches mentioned above, closed-source models, especially GPT-4 OpenAPI (GPT-4o) and Gemini 1.5-Pro, emerged as top choices for accurate, structured JSON extraction. GPT-4o consistently achieved the highest F1 scores across all datasets, proving robust in complex KIE tasks with minimal dependency on exact prompt structure. Gemini1.5-Pro stood out for its balance of performance, cost-efficiency, and speed, making it ideal for cost-sensitive applications requiring scalable solutions.

In contrast, open-source models like Doc-Owl1.5-Chat, Idefics2, and LLava-Next offered moderate accuracy but excelled in scenarios where speed and low cost are priorities. These models performed best when handling simpler instructions, though they required more precise prompts and struggled with complex JSON structuring tasks.

Among the three approaches, Approach 2—extracting multiple keys in a single request—proved the most effective for maintaining accuracy and efficiency. This approach minimized the processing load while leveraging the JSON correction layer to enhance structured output across both open-source and closed-source models.

In summary, based on the results from the experiments mentioned above, closed-source models are preferable for high-stakes, accuracy-demanding KIE applications, while open-source models offer viable solutions for more budget-conscious projects with simpler requirements. However, it is important to note that these conclusions are drawn from a limited sample size, and further validation is necessary through testing on larger datasets. This comparison highlights the flexibility of multimodal LLMs in adapting to KIE tasks, enabling users to select models tailored to their specific needs in terms of performance, cost, and operational scale.

Leveraging Multimodal LLMs for KIE without Fine-Tuning Experimenting with KIE Multimodal LLMs Considered for KIE Datasets Used for KIE Experiments Results and Observations Quriosity Corner Conclusion

Author

Dr. Harikrishnan P.M.

R&D Senior Research Engineer

Co-Author

Rohit Agrawal

Machine Learning Engineer

Co-Author

Vishal Vaddina

Principal Architect - R&D

Start Your Next Gen AI Journey Today

Discover how Quantiphi’s AI-powered solutions can transform your business. Fill out the form, and we’ll help you explore tailored AI strategies to unlock new opportunities for growth.

First Name

Last Name

Work Email

Company Name

Mobile Number

How can we help you?

By submitting this form, you acknowledge that Quantiphi may use your personal information for marketing communications as outlined in its Privacy Policy

From Documents to Insights: How Multimodal LLMs Elevate Key Information Extraction (KIE)

Leveraging Multimodal LLMs for KIE without Fine-Tuning

Experimenting with KIE

Multimodal LLMs Considered for KIE

Datasets Used for KIE Experiments

Results and Observations

Insights from the above tables:

Quriosity Corner

Which models demonstrate the best overall performance under format restrictions?

To what extent do the models exhibit sensitivity to the precise wording of prompts or instructions, especially Approach 1 (KIE as VQA)?

Does extracting everything (Approach 3) really help in KIE?

Does the security/safety filter have any adverse effect on the extraction results?

Conclusion

Start Your Next Gen AI Journey Today

Partners

Solutions

Industries

Resources

Company