Key Information Extraction (KIE) is a fundamental process in document processing that involves identifying and extracting essential data points from unstructured or semi-structured documents. In a world where businesses and organizations handle vast amounts of documents—such as contracts, invoices, receipts, and forms—automating the extraction of key information saves time, reduces manual effort, and enhances accuracy. KIE helps convert documents into structured data formats, making it easier to analyze, search, and utilize the information. Whether extracting names, dates, amounts, or other critical fields, KIE plays a crucial role in streamlining workflows, boosting productivity, and supporting decision-making in sectors like finance, healthcare, legal, and more.
The importance of KIE in document processing is growing as companies strive to optimize their data management practices, ensuring that valuable insights are quickly accessible without the need for extensive manual data entry or review. This blog will focus specifically on the applications of multimodal LLMs in Key Information Extraction, illustrating their transformative potential in enhancing document processing workflows.

Source: OCR-free Document Understanding Transformer
Leveraging Multimodal LLMs for KIE without Fine-Tuning
Frontier LLMs with Multimodal capabilities, such as GPT-4, Claude, or Gemini, have revolutionized key information extraction by leveraging their robust document understanding and generation capabilities. Unlike traditional models, which often require explicit fine-tuning on specific datasets, these multimodal LLMs can process both text and images, enabling them to understand documents in a holistic way. This allows them to extract key information with minimal or no additional training, making them incredibly versatile for various document types, including scanned PDFs, forms, and structured or semi-structured text.
The strength of multimodal LLMs is their ability to interpret context, semantics, and relationships within the document content. For instance, when dealing with invoices, contracts, or medical records, these models can understand the document layout, correlate related information, and accurately identify key entities like names, dates, monetary values, etc. Their pre-trained knowledge, and the ability to generate human-like language, allow them to infer the necessary details without requiring task-specific adjustments. This makes multimodal LLMs highly effective for KIE tasks in document processing, offering a flexible and scalable solution across different industries.
Experimenting with KIE
In our exploration of Key Information Extraction (KIE) using multimodal LLMs, we tested three different approaches to evaluate how well these methods can extract key information from documents. Below are the details of each approach, including the prompts used and the evaluation metrics.
Approach 1 - KIE as Visual Question Answering (VQA): | Approach 2 - Asking for Specific Keys in a Single Request: | Approach 3 - Extracting All Information in Key-Value Pairs: | |
Definition | In this approach, we treated KIE as a Visual Question Answering (VQA) task, where we asked the model questions about the entities we wanted to extract from the document. For every entity to be extracted, a separate question was asked, in a single LLM call. | In this approach, we consolidated the requests by asking the model to extract multiple key-value pairs in one go and return them in a JSON format. To ensure the generated JSON response was valid, we used an additional LLM layer for “JSON correction,” which validated and corrected the format if needed. | For the final approach, we requested the model to extract all available information from the document in a key-value format, allowing for a more comprehensive extraction of data. Similar to approach 2, we used an additional LLM layer to validate and correct the generated JSON response, ensuring that it was in the correct format. |
Sample Prompts Used |
|
|
|
Evaluation Metric | F1 Score, ANLS | F1 Score | F1 Score |
Each of these approaches provides different insights into the capabilities of multimodal LLMs in handling KIE tasks. By evaluating their performance across various metrics, we can better understand which method is more effective for different document processing scenarios.
Multimodal LLMs Considered for KIE
In our experiments, we explored both open-source and API-based multimodal LLMs to assess their performance in Key Information Extraction (KIE) tasks.
Open-Source Models:
API Models:
- GPT-4 Vision (GPT4V)
- GPT-4 OpenAPI (GPT4o)
- Gemini 1.5 Pro
Datasets Used for KIE Experiments
To evaluate the performance of different multimodal LLMs in Key Information Extraction (KIE), we conducted experiments on three well-known datasets that vary in complexity and document types.
- CORD (48 samples): The CORD (Consolidated Receipt Dataset) is a comprehensive dataset specifically designed for post-OCR parsing tasks, containing annotated Indonesian receipts. It supports a variety of entities crucial for understanding receipt data, including merchant names, dates, receipt numbers, item descriptions, and total prices. The dataset is structured with 30 different entities, allowing for detailed entity recognition.
- FUNSD (47 samples): The FUNSD (Form Understanding in Noisy Scanned Documents) dataset includes scanned forms that contain both structured and unstructured information. It challenges models with complex layouts, including form fields, handwritten text, and irregular structures, testing their ability to extract key-value pairs in noisy environments.
- SROIE (347 samples): The SROIE (Scanned Receipts OCR and Information Extraction) dataset is larger, with a focus on extracting key information from scanned receipts. It involves various fields such as company names, dates, and totals, similar to CORD but with more diverse samples, making it ideal for evaluating scalability and generalization in KIE models.
Results and Observations
CORD | FUNSD | SROIE | |||||||
---|---|---|---|---|---|---|---|---|---|
Approach 1 | Approach 2 | Approach 3 | Approach 1 | Approach 2 | Approach 3 | Approach 1 | Approach 2 | Approach 3 | |
Doc-Owl1.5-Chat | 0.4 | 0.339 | 0.1643 | 0.4004 | 0.109 | 0.054 | 0.3976 | 0.374 | 0.1829 |
Idefics2 | 0.3483 | 0.2392 | 0.1388 | 0.282 | 0.2662 | 0.1102 | 0.4034 | 0.391 | 0.2262 |
LLava-Next | 0.471 | 0.4425 | 0.2711 | 0.0728 | 0.1054 | 0.0198 | 0.263 | 0.24465 | 0.134 |
MiniCPM-2.5 | 0.4064 | 0.1555 | 0.167 | 0.35546 | 0.1456 | 0.095 | 0.4639 | 0.365 | 0.2123 |
| |||||||||
GPT4V | 0.5419 | 0.678 | 0.45104 | 0.4368 | 0.4768 | 0.2157 | 0.5626 | 0.63623 | 0.554 |
GPT 4o | 0.6258 | 0.7246 | 0.4476 | 0.4453 | 0.494 | 0.2473 | 0.5597 | 0.6786 | 0.536 |
Gemini 1.5 Pro | 0.4967 | 0.5856 | 0.51123 | 0.3463 | 0.42038 | 0.2857 | 0.6014 | 0.61754 | 0.3292 |
Table 1: Performance comparison across all 3 approaches
CORD | FUNSD | SROIE | |||||||
---|---|---|---|---|---|---|---|---|---|
Approach 1 | Approach 2 | Approach 3 | Approach 1 | Approach 2 | Approach 3 | Approach 1 | Approach 2 | Approach 3 | |
GPT4V | 0.0103 | 0.01186 | 0.01305 | 0.00859 | 0.01483 | 0.0207 | 0.0131 | 0.01534 | 0.02661 |
GPT 4o | 0.005164 | 0.00594 | 0.00679 | 0.00428 | 0.00741 | 0.0097 | 0.00655 | 0.00765 | 0.0134 |
Gemini 1.5 Pro | 0.00116 | 0.00169 | 0.002 | 0.00124 | 0.00326 | 0.003482 | 0.0012 | 0.00206 | 0.00357 |
Table 2: Cost comparison across all 3 approaches
CORD | FUNSD | SROIE | |||||||
---|---|---|---|---|---|---|---|---|---|
Approach 1 | Approach 2 | Approach 3 | Approach 1 | Approach 2 | Approach 3 | Approach 1 | Approach 2 | Approach 3 | |
Doc-Owl1.5-Chat | 0.7328 | 5.7431 | 8.2301 | 0.8106 | 13.2359 | 18.545 | 0.9243 | 4.1646 | 24.8998 |
Idefics2 | 2.26 | 3.6056 | 12.6807 | 3.15 | 8.0691 | 24.4568 | 2.26 | 3.951 | 35.7991 |
LLava-Next | 2.28 | 5.03 | 16.1 | 17.2869 | 20.94 | 57.23 | 8.6 | 9.34 | 49.17 |
MiniCPM-2.5 | 0.3844 | 1.23 | 4.2 | 0.4225 | 2.67 | 9.98 | 0.7002 | 1.76 | 7.54 |
| |||||||||
GPT4V | 5.56 | 7.7034 | 9.5744 | 2.49 | 9.1676 | 21.043 | 4.35 | 6.9511 | 16.2363 |
GPT 4o | 5.44 | 5.0627 | 5.184 | 2.56 | 5.0982 | 5.9068 | 4.06 | 4.212 | 8.6216 |
Gemini 1.5 Pro | 1.9455 | 2.7042 | 3.7555 | 1.42395 | 4.43572 | 5.90058 | 2.06077 | 3.30928 | 6.78923 |
Table 3: Latency comparison across all 3 approaches
*Note: "The evaluation results above are based on experiments conducted during Aug-Sept 2024. Since then, updated releases have been made, likely bringing further improvements."
Insights from the above tables:
- Top Performer: GPT-4o was the best performer in F1 accuracy, while Gemini 1.5 Pro provided an optimal balance of cost and speed.
- Instruction Sensitivity: Closed-source models were less sensitive to specific instructions, performing well even with varied prompt structures, whereas open-source models showed more variability in accuracy.
- Experiment Impact: Approach 2 generally provided a good balance for extracting multiple keys at once, especially for closed-source models with JSON correction layers enhancing results across all datasets.
- Consistency Among Closed-source Models: The closed-source models consistently exhibit better performance than their open-source counterparts in several key areas, including F1 scores, while also demonstrating comparable or superior speed in various approaches. This suggests that, in scenarios focused on performance metrics—such as accuracy, speed, and reliability—closed-source models are often more effective for Knowledge Information Extraction (KIE) tasks.
- Significant Performance Drop for FUNSD Dataset: All models experience dramatic declines in F1 scores from approach 1 to approach 3. For instance, Doc-Owl1.5-Chat drops from 0.4004 in approach 1 to 0.0540 in approach 3, indicating severe challenges with this dataset.
Quriosity Corner
-
Which models demonstrate the best overall performance under format restrictions?
Most models, particularly Gemini 1.5 Pro, GPT-4V, and GPT-4-o, demonstrated the best overall performance under format restrictions by effectively generating valid structured JSON responses. These models not only returned key-value pairs proficiently but also showed improvements after applying an LLM correction layer. While open-source models such as Doc-Owl 1.5 Chat and Idefics 2 were capable of producing structured JSON outputs, the closed-source models significantly outperformed them in terms of accuracy and format validation
-
To what extent do the models exhibit sensitivity to the precise wording of prompts or instructions, especially Approach 1 (KIE as VQA)?
We conducted an ablation study using two distinct prompting strategies derived from the LayoutLLM and DocLLM papers:
LayoutLLM Prompt: "What is the <key> in the given document?" This prompt is designed to direct the model to identify specific information based on the document's layout.
DocLLM Prompt: "What is the value for the <key>?" This alternative phrasing aims to extract similar key information but focuses on retrieving the associated value.
Key Insights:
The results indicate that open-source models exhibit significant performance variability based on prompt wording, even when semantic meaning remains consistent. Doc-Owl1.5-Chat scores 0.4000 with the LayoutLLM prompt but drops to 0.2709 with the DocLLM prompt. Idefics2 declines from 0.3419 to 0.2000, highlighting their sensitivity to phrasing changes.
In contrast, API-based models like GPT4-o and Gemini1.5-Pro show consistent performance across both prompts: GPT4-o scores slightly higher with DocLLM (0.6322) compared to LayoutLLM (0.6258), indicating minimal impact from prompt variations. Gemini1.5-Pro also maintains stable performance, suggesting greater robustness against changes in prompt structure.
-
Does extracting everything (Approach 3) really help in KIE?
Extracting everything (Approach 3) helps to some extent but has limitations. While it provides comprehensive data extraction, it also results in a higher processing load and potential noise due to irrelevant information. Closed-source models like Gemini1.5-Pro and GPT4-o handled this task well, especially after LLM correction, but the open-source models struggled with generating accurate key-value pairs when all information was extracted. Additionally, this approach can lead to inefficiency when only specific keys are required.
-
Does the security/safety filter have any adverse effect on the extraction results?
For closed-source models like GPT-4 variants, no noticeable adverse effects were observed related to security or safety filters. However, Gemini1.5-Pro demonstrated some interesting behavior. In Approach 1, for the SROIE dataset, it did not respond to 442 out of 1388 queries (for fields like company and address), and in Approach 2, it failed to respond to 196 out of 347 queries.
However, in Approach 3, where keys were not explicitly mentioned, Gemini responded to 344 out of 347 queries, suggesting that when specific keys are provided, Gemini tends to apply stricter safety filters. This indicates that Gemini may withhold responses in cases where explicit key extraction could trigger security or privacy concerns.
Conclusion
In our exploration of Key Information Extraction (KIE) using multimodal language models, we compared a range of open-source and closed-source models across three different approaches. Our findings highlight distinct trade-offs in accuracy, cost, and speed, offering insights into model suitability for various document processing scenarios.
Based on the results from the approaches mentioned above, closed-source models, especially GPT-4 OpenAPI (GPT-4o) and Gemini 1.5-Pro, emerged as top choices for accurate, structured JSON extraction. GPT-4o consistently achieved the highest F1 scores across all datasets, proving robust in complex KIE tasks with minimal dependency on exact prompt structure. Gemini1.5-Pro stood out for its balance of performance, cost-efficiency, and speed, making it ideal for cost-sensitive applications requiring scalable solutions.
In contrast, open-source models like Doc-Owl1.5-Chat, Idefics2, and LLava-Next offered moderate accuracy but excelled in scenarios where speed and low cost are priorities. These models performed best when handling simpler instructions, though they required more precise prompts and struggled with complex JSON structuring tasks.
Among the three approaches, Approach 2—extracting multiple keys in a single request—proved the most effective for maintaining accuracy and efficiency. This approach minimized the processing load while leveraging the JSON correction layer to enhance structured output across both open-source and closed-source models.
In summary, based on the results from the experiments mentioned above, closed-source models are preferable for high-stakes, accuracy-demanding KIE applications, while open-source models offer viable solutions for more budget-conscious projects with simpler requirements. However, it is important to note that these conclusions are drawn from a limited sample size, and further validation is necessary through testing on larger datasets. This comparison highlights the flexibility of multimodal LLMs in adapting to KIE tasks, enabling users to select models tailored to their specific needs in terms of performance, cost, and operational scale.