In today’s digital age, businesses are generating and storing a large number of papers. However, these documents, especially visually complex ones like invoices and contracts, can be challenging to understand and extract useful information from.

To address this knowledge gap, Google researchers have developed the Visually Rich Document Understanding (VRDU) dataset. This dataset aims to improve progress tracking in document understanding tasks. It sets five criteria for an effective benchmark, surpassing existing datasets in all aspects. Google has made the VRDU dataset and assessment code available to the public under a Creative Commons license.

The VRDU research branch focuses on finding ways to automatically understand visually rich documents. By utilizing VRDU models, structured information such as names, addresses, dates, and sums can be extracted. These models have practical applications in invoice processing, CRM, and fraud detection.

VRDU faces various challenges due to the wide range of document types and the intricate patterns they contain. The models must be capable of handling imperfect inputs, such as typos and missing data.

Despite these obstacles, VRDU is a rapidly developing field with significant potential. Implementing VRDU models can help businesses reduce costs and increase efficiency while enhancing operational precision.

In recent years, automated systems have been developed to process complex business documents and extract structured data. These systems eliminate the need for manual data entry, which saves time and improves corporate efficiency. Newer models, like those built on the Transformer framework and PaLM 2, have shown improved accuracy. However, existing datasets used in academic publications do not accurately reflect the challenges faced in real-world scenarios.

To address this discrepancy, researchers compared academic benchmarks with state-of-the-art models’ accuracy on real-world use cases. They identified five conditions that a dataset should meet to accurately reflect the complexity of real-world applications.

The VRDU dataset includes documents with diverse layout elements, complex structures, and varying templates. It also ensures high-quality Optical Character Recognition (OCR) results for all documents. Additionally, the dataset includes ground-truth annotations at the token level, facilitating precise training data preparation.

The VRDU collection consists of two public datasets: Registration Forms and Ad-Buy Forms. These datasets contain documents that meet all the benchmark criteria and are relevant to real-world scenarios. They provide insights into political advertisements and foreign agent registrations.

In recent years, VRDU has seen advancements such as large-scale linguistic models (LLMs) and few-shot learning techniques. LLMs trained on large datasets of text and code can represent the text and layout of visually rich documents. Few-shot learning techniques enable models to learn from minimal labeled examples, enhancing their ability to understand new document types.

The introduction of the VRDU dataset and the ongoing developments in VRDU research signify a significant step toward improved document understanding and extraction in visually complex materials.