Financial Planning for Youth

Recording your Spending Simpler by taking a picture and asking any questions using Layout LM Question Answering

Prim Wong
20 min readDec 10, 2023

Empowering youth in financial matters is crucial for their long-term financial well-being and success. Our methodology is utilizing AI financial planning chatbot for borrowing money, track expenses to help empower young individuals in their financial lives.

User research

Improving Financial well being is improving our lives. We are committed to improving youth financial literacy, and make them equipped with their real world.

Shrinking Economy creates a need for financial literacy.

Financial Education

Offer comprehensive financial education programs in schools, community centers, or through online platforms. Cover topics such as budgeting, saving, investing, debt management, taxes, and basic economics.

Create engaging workshops, seminars, or courses specifically designed for youth to build their financial literacy from an early age.

Interactive AI Learning Tools and Resources

Develop interactive tools like apps, games, or simulations that make learning about finances fun and engaging. These tools can help reinforce financial concepts and practical skills. Provide easy access to reliable financial resources, and borrow money for youth through digital platforms, social media, chatbot or mobile apps.

Encouraging Smart Financial Habits by Emphasizing the importance of saving regularly, setting financial goals, and creating budgets. Encourage youth to start early and regularly contribute to savings accounts or investment vehicles.

AI Framework

We offer a unique value proposition for recording their data by taking a picture. They can even ask for any questions from the receipt. This is helping youth to access financial information easier, for borrowing money and encouraging a healthier financial lifestyle.

We have successfully train numerous AI models including the LayoutLMv3, Self-supervised pre-training techniques, involves several key stages, each contributing to the model’s development and performance:

Dataset and Data Collection

The process begins with the collection of a diverse and representative dataset. This dataset includes a wide range of documents with text and image components to ensure that the model can learn from various real-world examples.

Datasets

  1. FUNSD

(Form Understanding in Noisy Scanned Documents): FUNSD is a dataset specifically designed for layout analysis and information extraction from forms in noisy scanned documents. It contains annotations for various elements like text, tables, checkboxes, and other form components.

  1. SQUADv2

(Stanford Question Answering Dataset v2): SQUADv2 is a reading comprehension dataset that contains questions posed on a set of Wikipedia articles. Models are required to provide precise answers based on the given context from the articles.

  1. DocVQA

(Document Visual Question Answering): DocVQA is a dataset designed for the task of visual question answering specifically on documents. It involves asking questions about document images, and the models need to answer these questions based on the content within the documents.

  • Pre-processing: Data is preprocessed to clean and format it for training. This step includes text and image extraction, layout analysis, and data augmentation to create a suitable training input.
  • Word-Patch Alignment Objective: LayoutLMv3 incorporates a word-patch alignment objective. During this stage, the model learns to predict whether an image patch corresponds to a masked word in the text. This helps the model understand the alignment between text and image elements.
  • Fine-tuning: After pre-training, the model is fine-tuned on specific Document AI tasks. Fine-tuning adapts the model to perform well on tasks like form understanding, receipt analysis, and document classification.
  • Evaluation: The model’s performance is rigorously evaluated using various metrics and benchmarks to ensure it meets the desired standards and objectives.
  • Iterative Refinement: The model goes through iterative refinement cycles to improve its performance, fix issues, and adapt to emerging challenges.

AI Models

Our research involves AI models and datasets on improving OCR detection accuracy, enhancing document layout understanding, and advancing text extraction from images.

We aim to make our AI models more robust and effective in addressing challenges of handling noisy or complex document layouts, extracting structured information from documents, improving reading comprehension from textual content, and enabling machines to comprehend and answer questions based on document images.

Models

We have done comprehensive research on our model including the state of the art Layout LM v3 on document understanding and layout.

Baseline

Our Baseline is inferencing EasyOCR and Pytesseract which only gives a result of 70% accuracy on our dataset.

  1. EasyOCR

EasyOCR is an open-source library designed for optical character recognition. It provides an easy-to-use interface for detecting and recognizing text within images using deep learning-based methods. EasyOCR is simple and easy to use, even though it doesn’t have the highest accuracy.

https://github.com/JaidedAI/EasyOCR

  1. Pytesseract

Pytesseract is a widely used OCR engine based on Google’s Tesseract. It’s known for its ease of integration and is frequently used for basic text extraction from images.

https://pypi.org/project/pytesseract/

3.6 LayoutLM
Pre-training of Text and Layout for Document Image Understanding

LayoutLM is the first model where text and layout are jointly learned in a single framework for document-level pre-training, receiving the state of the art accuracy with Receipt understanding (from 94.02 to 95.24).

Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., & Zhou, M. (2019, December 31). LayoutLM: Pre-training of Text and Layout for Document Image Understanding. arXiv.org. https://doi.org/10.1145/3394486.3403172

LayoutLM, the initial model in the series, was designed to perform text extraction while considering the layout information of documents. It combines visual and textual information to understand and extract text in the context of the document’s layout.

It employs a multi-modal architecture that incorporates information from both text and layout tokens, enabling the model to understand the spatial relationship of text elements within a document.

LayoutLM is used for various document understanding tasks, including optical character recognition (OCR), information extraction, and document classification.

LayoutLMv2
Multi-modal Pre-training for Visually-Rich Document Understanding

LayoutLMv2 is an improved version of the original LayoutLM model with the increase in accuracy for numerous dataset FUNSD (0.7895 → 0.8420), CORD (0.9493 → 0.9601), SROIE (0.9524 → 0.9781), Kleister-NDA (0.8340 → 0.8520), RVL-CDIP (0.9443 → 0.9564), and DocVQA (0.7295 → 0.8672). It aims to enhance the accuracy and efficiency of text extraction from documents by refining the model architecture and training methodologies.

LayoutLMv2 introduces enhancements in the model architecture, such as better handling of multi-page documents, improved tokenization strategies, and fine-tuning techniques to achieve higher accuracy.

It serves similar purposes as LayoutLM but with improved performance in handling complex document layouts and achieving better text extraction results.

Xu, Y., Xu, Y., Lv, T., Cui, L., Wei, F., Wang, G., Lu, Y., Florencio, D., Zhang, C., Che, W., Zhang, M., & Zhou, L. (2020, December 29). LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding. arXiv.org. https://arxiv.org/abs/2012.14740v4

LayoutLMv3
Pre-training for Document AI with Unified Text and Image Masking

LayoutLMv3 represents further advancements in document understanding by focusing on improved layout analysis and text extraction accuracy. It aims to address limitations and challenges encountered in previous versions.

LayoutLMv3 incorporates more sophisticated techniques in layout analysis and leverages advancements in pre-training strategies to enhance the model’s ability to understand complex document layouts and extract text accurately. It is specifically tailored for tasks demanding high precision in text extraction, such as extracting structured information from forms, tables, or documents with intricate layouts.

Huang, Y., Lv, T., Cui, L., Lu, Y., & Wei, F. (2022, April 18). LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking. arXiv.org. https://arxiv.org/abs/2204.08387v3

LayoutLM QA (Question Answering)

Document Visual Question Answering task process the image input and response to the question asked by the user. The input of this model is the image and the text question, and the model uses NLP, BERT to understand the context with detectron and OCR to extract the information from the image. Then, the model successfully incorporates different modalities of text, word positions, bounding box and image.

Our LayoutLM QA is fine tuning the Layout LM v2 with DocVQA dataset.

There are 50,000 questions in the DocVQA, which are based on 12, 767 photos. To create training, validation, and test splits, the data is randomly divided in an 80−10−10 ratio. 39, 463 questions and 10, 194 photos make up the train split; 5, 349 questions and 1, 286 images make up the validation split; and 5, 188 questions and 1, 287 images make up the test split.

Mathew, M., Karatzas, D., & Jawahar, C. V. (2020, July 1). DocVQA: A Dataset for VQA on Document Images. arXiv.org. https://arxiv.org/abs/2007.00398v3

Mathew, M., Karatzas, D., & Jawahar, C. V. (2020, July 1). DocVQA: A Dataset for VQA on Document Images. arXiv.org. https://arxiv.org/abs/2007.00398v3

This model combines layout understanding and language comprehension to answer questions related to the content present in document images, thus bridging the gap between visual and textual information.

LayoutLM QA is particularly useful in scenarios where question answering based on document images is required, such as information retrieval from scanned documents or images containing textual content.

Document Question Answering. (n.d.). Document Question Answering. https://huggingface.co/docs/transformers/v4.29.0/tasks/document_question_answering

P. (n.d.). LayoutLMQA/layout_LLM_document_question_answering.ipynb at main · PrimWong/LayoutLMQA. GitHub. https://github.com/PrimWong/LayoutLMQA/blob/main/layout_LLM_document_question_answering.ipynb

Unified Multimodal Learning: LayoutLMv3 seeks to bridge the gap between text and image modalities by unifying the pre-training process. It ensures that the model can understand and represent both types of data effectively.

Cross-Modal Alignment: One of the main goals is to enable the model to learn cross-modal alignment, enhancing its ability to relate text and image components in a document, making it invaluable for various Document AI tasks.

Versatility: LayoutLMv3 strives to be a general-purpose model, capable of excelling in diverse tasks, including text-centric responsibilities like form understanding and document visual question answering, as well as image-centric roles such as document image classification and document layout analysis.

State-of-the-Art Performance: The ultimate goal is for LayoutLMv3 to set new standards in performance across a wide range of Document AI applications, demonstrating its exceptional capabilities and robustness.

Evaluation plan

The traditional F-measure or balanced F-score (F1 score) is the harmonic mean of precision and recall.

Confusion Metrics

Sasaki, Y. (2007). “The truth of the F-measure” (PDF).

3.9 Layout LM Experiments (Hyperparameter Tuning)

Fine-tuning of the LayoutLMv2 pre-trained model using the DocVQA dataset, aiming to enhance the model’s capabilities in understanding document images’ layout and textual content concurrently. The successful fine-tuning process involved multiple experimental runs, each tailored to specific configurations and parameters, aiming to optimize the model’s performance in the context of document visual question answering tasks.

This link refers to the result of each experiment.

https://wandb.ai/fyn-fintech/huggingface?workspace=user-rwongkr

The behavior of the loss function in the hyperparameter tuning process for Layout LM presents intriguing insights into the model’s convergence and performance dynamics. The depicted figure illustrates three distinct runs: one conducted locally, denoted by the blue line, the red line is the experiment that uploads the model to the Hugging Face Model repository with identical hyperparameters, the green line is the experiment that uploads the models to the Hugging Face Model repository with hyperparameters fine tuning.

The blue and red runs exhibit similar loss curves, reaching their minimum around step 30, after which the loss incrementally increases without substantial deviation between the two runs. This consistency in loss trajectories suggests stable behavior but lacks notable improvements in convergence or performance.

In contrast, the green line, representing the latest experiment in hyperparameter tuning, unveils a different trend. This particular run showcases enhanced convergence and better performance compared to the previous ones. Notably, the loss curve perpetuates less, implying a more stable convergence towards the optimal solution.

The graph’s implications suggest that fine-tuning certain hyperparameters resulted in a more efficient optimization process, leading to superior convergence and ultimately enhancing the model’s accuracy. This observation underlines the critical role of hyperparameter optimization in refining model performance, emphasizing the potential for better convergence and overall efficacy in tasks related to Layout LM.

System Hardware

CPU count: 10

GPU count: 1

GPU type : NVIDIA GeForce RTX 4090

Experiment 1 Evaluation Loss: 4.4423

In the context of the conducted experiments, the blue line corresponds to the initial run, utilizing default hyperparameters and employing a batch size of 4. This particular run involved saving the model files locally on a computer equipped with an NVIDIA GeForce RTX 4090 graphics card with substantial memory capacity. The blue line represents the performance trajectory of this run concerning a specific metric or evaluation criterion, showcasing the model’s behavior and predictive capabilities under the defined conditions. This foundational run, stored locally on the computer’s hardware, serves as a benchmark/baseline for further analyses or comparison against subsequent experiment iterations.

  • learning_rate: 5e-05
  • train_batch_size: 4
  • eval_batch_size: 8
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • num_epochs: 20

This figure shows a good convergence in training loss (blue line) as the graph exponentially decreases. However, the evaluation loss does not converge, and loss increases after 30 iterations.

This figure shows the training result, both evaluation and training for this experiment.

Weight and Bias

https://wandb.ai/fyn-fintech/huggingface/runs/sziz6bng?workspace=user-rwongkr

Model

https://drive.google.com/drive/folders/1ojEKpN1Is-RetiQRDScrhUWrr24uoXRw?usp=sharing

Experiment 2 Evaluation Loss: 4.7055

In Experiment 2, the evaluation loss measures at 4.7055. The red line represents the second run conducted using default parameters, and the model file is saved to Hugging Face. This run exhibits a faster training pace compared to previous iterations.

Despite the swiftness in training, the evaluation loss remains at 4.7055 which is lower than the first experiment, signifying a steady yet consistent performance in terms of the model’s predictive accuracy. This suggests that while the model trained faster with default parameters and integration with Hugging Face, there wasn’t a significant improvement in evaluation loss, emphasizing the need for further investigation or potential parameter adjustments to enhance the model’s performance.

  • learning_rate: 5e-05
  • train_batch_size: 4
  • eval_batch_size: 8
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • num_epochs: 20

Weight and Bias

Hugging Face Repository

Results show that the training loss converge as it approaches to 0.

This figure shows a similar result to the first run but training loss (blue line) is 0.057 which is higher than the previous run. However, the evaluation loss still does not converge, and loss increases after 30 iterations.

This figure shows the training result, both evaluation and training for this experiment.

Experiment 3 Evaluation Loss: 3.2672

The hyperparameters tuning, specifically changing from a batch size of 4 to 5 coupled with adjustments in the learning rate from 5e-05 to 5e-06, has noteworthy observations in the training. This modification has led to a great impact on the convergence behavior and performance metrics, particularly evident in the evaluation loss. The model’s convergence is approaching zero, even though it was observed to occur at a slower pace, attributed to the revised batch size and learning rate adjustments.

  • learning_rate: 5e-06
  • train_batch_size: 5
  • eval_batch_size: 8
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • num_epochs: 50

Conclusion

The three experiments in hyperparameter tuning for Layout LM revealed distinct trends in convergence and performance. The first and second run, representing local and Hugging Face Model repository uploads respectively, showcased similar loss curves with stable behavior but limited performance improvements. In contrast, the third run, reflecting fine-tuned hyperparameters, displayed enhanced convergence and superior performance, suggesting that fine-tuning of learning rate, batch size with more epoch led to more efficient optimization and better model convergence towards an optimal solution. These findings underscore the significance of hyperparameter adjustments in improving Layout LM’s efficiency and overall effectiveness in convergence tasks.

System GPU

We leverage GPU resources for training purposes, employing their parallel processing capabilities to expedite computational tasks, particularly in training machine learning models. The system’s GPU usage remains consistently similar during these training sessions. This uniformity in GPU usage indicates a stable and efficient allocation of resources, ensuring that the GPU’s computational power is consistently and optimally utilized throughout the training process. This consistent GPU usage is crucial for maintaining steady performance and maximizing computational efficiency during training sessions.

Future Plan

Our future plans involve conducting further experiments and rigorous hyperparameter tuning to enhance the performance of our models. Specifically, we aim to focus on refining key hyperparameters such as lower the learning rate set, train a larger batch size, using more optimizer as we are utilizing Adam with betas=(0.9,0.999) and epsilon=1e-08 and lr_scheduler_type linear. This meticulous tuning is intended to optimize our models for improved accuracy and efficiency in handling DocVQA tasks.

Moreover, we aim to explore recent advancements in the field, including the DocVQA framework detailed in a recent paper. Notably, there exists a model available on the Hugging Face platform, specifically a Layout LMv3 trained on DocVQA. This model offers multi-page functionality, an essential feature for comprehensive document understanding and question answering tasks. We aim to investigate and integrate this model into our research, leveraging its capabilities to advance our approach in OCR-free multi-page document VQA tasks. https://github.com/rubenpt91/MP-DocVQA-Framework

Additionally, our objective is to strive for excellence by aiming to achieve top-tier performance in multi-page document VQA, specifically targeting the leaderboard position mentioned as the top OCR-free model. This pursuit signifies our commitment to pushing the boundaries of performance and innovation in this domain, aiming to make substantial contributions to the advancement of multi-page document understanding and question answering frameworks.

This is the recent paper for DocVQA framework, there’s a model available in hugging face including layout lmv3 trained on DocVQA. Multi -page also works.

https://github.com/rubenpt91/MP-DocVQA-Framework

(OCR-Free) Multi-Page DocVQA Method performing at the top of the leaderboard on Multi page DocVQA. https://rrc.cvc.uab.es/?ch=17&com=evaluation&task=4

Results

Fine-tuning LayoutLMv2 on the DocVQA dataset

Fine-tuning LayoutLMv2 on the DocVQA dataset involves adapting the pre-trained LayoutLMv2 model specifically to excel at the task of answering questions related to document images. This process involves training the model on the DocVQA dataset, enabling it to understand document layouts and answer questions accurately based on the content within those documents.

Impressive Accuracy (97.09%)

Achieving an accuracy of 97.09% on a dataset DocVQA showcases the effectiveness of the fine-tuned LayoutLMv2 model.

These models can play a crucial role in automated document understanding, helping youth in data collection and managing their finance along with aiding in tasks such as information retrieval, question answering, and data extraction from scanned documents or images in finance.

Demo Video

Our AI Model Journey

LayoutLLM Bounding Box Detection with Key Region of Interests

LayoutLLM Bounding Box Detection with Preprocessed Input

The input image is cropped and aligned horizontally before entering the LayoutLLMv3 model.

Training Results

This is the training result of fine-tuning the model of Layout LMv2 on the FUNSD dataset for document understanding.

The model performs exceptionally well on an English dataset. To perform better for receipts in Thai, more model training with Thai dataset was still needed. We have employed the Layout LLM v3 model and our tax invoice dataset for inference. Please find the link below for the Google Colab notebook that we’ve used for Fine-Tune and Training.

Inference

Model Weight

https://drive.google.com/file/d/1cexYkZA9a73NexUZyL8TmAD80umegZOE/view?usp=drive_link

Further Details on running with Thai Receipts

Github Repository

These github repositories contain all the codes we have used for our AI model, Layout LM and datasets.

Conclusions

Summary of Accomplishments

We have accomplished significant milestones including research, AI model, gathering dataset for training, fine tuning the model and a working question answering model from receipt.

Furthermore, we introduce the LayoutLM Question answering model for youth to easily manage their financial behavior by simply choosing the photo and asking questions.

This Layout LM v3 model redesigns the LayoutLM model architecture and pre-training objectives to pre-train the multimodal Transformer for document AI. In contrast to the current multimodal model in Document AI, LayoutLMv3 extracts visual features without the need for a pre-trained CNN or Faster R-CNN backbone, saving a substantial amount of parameters and doing away with region annotations.

We have gathered diverse datasets for training and evaluation purposes, ensuring relevance to the chatbot’s natural language understanding and the QA model’s document comprehension. Preprocessed and cleaned the collected data, including text normalization, annotation, and structuring to facilitate model training.

Developed and implemented the chatbot architecture with capabilities for natural language understanding, dialogue management, and response generation. Trained the QA model to effectively retrieve accurate answers from documents or knowledge sources based on user queries, leveraging pre-trained embeddings and fine-tuning techniques.

We have an impressive result with the accuracy of 97.09% on Fine tuning the state of the art model Layout LM v3 on the DocVQA. This model inspires a “purpose — driven” on document image analysis while improving the financial lives for youth with a streamlined data collection system.

5.2 Issues and Obstacles

The key challenge in training the AI model is poor quality and incomprehensive data. Garbage in, garbage out. In this case, our model is fully functional in the English dataset. However, with insufficient information in the Thai dataset, our model has poor accuracy in Thai language text due to differences in language structure, syntax, or vocabulary compared to English.

The model’s vocabulary might not accurately capture nuances, character encodings, or uncommon words in Thai, leading to incorrect word recognition or omissions.

In order to enhance recognition accuracy for Thai texts and documents, we will expand the model’s vocabulary and language understanding by including domain-specific Thai language corpora or dictionaries, as well as by increasing the number of Thai datasets with domain-specific language support.

Future Directions

Moving forward, we aim to enhance precision, youth experiences, and the mechanism for financial planning. Beyond the languages already supported by the chatbot and QA model, we will add multilingual capabilities to serve a larger range of languages.

When utilizing multi-modal integration, To enhance the chatbot’s communication skills, consider adding multi-modal features that allow users to input images, videos, or voice commands. This will allow for richer and more thorough interactions between the chatbot and its users.

We can provide more personalized and contextually appropriate responses while retaining privacy using sophisticated knowledge retrieval and implementation techniques for personalizing user experiences. These mechanisms incorporate user preferences, history, and context in discussions.

Implementing a system of continuous learning and adaptability with The chatbot can adjust and enhance its performance based on ongoing user interactions and feedback if continuous learning mechanisms like reinforcement learning or active learning are used.

We will also create an app for users.

Lessons Learned

For better financial behavior, the AI youth financial planning chatbot employs the Layout LM models that have been fine-tuned on DocVQA to assist customers with simple expenditure queries.

We have trained numerous remarkable benchmarks and learned an incredible state-of-the-art model of Layout LMv3. We have also fine-tuned DocVQA.

We have learned about the process and layout of the LM framework. Additionally, we learned about precise data extraction, which was critical to improving the model’s capacity to deal with different document formats, fonts, and financial documents, to be more flexible and resilient.

Fine tuning with DocVQA and other datasets highlights the carefully selected and annotated data for efficient model training. The performance of the model relies heavily on the accuracy of the data.

Artificial intelligence is a dynamic area. The importance of adopting a growth mindset cannot be overstated. For the chatbot to remain relevant and effective over time, it needs to be refined iteratively using user feedback, performance measurement, and new AI techniques. We have great teamwork and working with teammates is a fantastic way to learn and grow.

Ultimately, it has been an incredible and enlightening experience to build an AI chatbot to assist young people with financial planning utilizing state-of-the-art AI models, Layout LM modified on the DocVQA dataset. Building a model that users loved while promoting better financial behaviors among youth has required balancing technical quality, user needs, ethical considerations, and constant progress.

References

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018, October 11). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv.org. https://arxiv.org/abs/1810.04805v2

Document Question Answering. (n.d.). Document Question Answering. https://huggingface.co/docs/transformers/v4.29.0/tasks/document_question_answering

Financial education and youth — OECD. (n.d.). Financial Education and Youth — OECD. https://www.oecd.org/finance/financial-education-and-youth.htm

FUNSD. (n.d.). FUNSD. https://guillaumejaume.github.io/FUNSD/

Ghoshal, D., Kittisilpa, J., & Sriring, O. (2023, May 9). Thai household debt in election focus as millions in “endless struggle.” Reuters. https://www.reuters.com/world/asia-pacific/thai-household-debt-election-focus-millions-endless-struggle-2023-05-09/

Huang, Y., Lv, T., Cui, L., Lu, Y., & Wei, F. (2022, April 18). LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking. arXiv.org. https://arxiv.org/abs/2204.08387v3

Mathew, M., Karatzas, D., & Jawahar, C. V. (2020, July 1). DocVQA: A Dataset for VQA on Document Images. arXiv.org. https://arxiv.org/abs/2007.00398v3

Mathew, M., Karatzas, D., & Jawahar, C. V. (2020, July 1). DocVQA: A Dataset for VQA on Document Images. arXiv.org. https://arxiv.org/abs/2007.00398v3

N. (2023, May 15). Only 2/5 young adults are financially literate — MyBnk. MyBnk. https://www.mybnk.org/report-on-financial-education-in-schools/

Wongkrasaemongkol, R. LayoutLMQA/layout_LLM_document_question_answering.ipynb at main · PrimWong/LayoutLMQA. GitHub. https://github.com/PrimWong/LayoutLMQA/blob/main/layout_LLM_document_question_answering.ipynb

Thailand: population by age group 2023 | Statista. (n.d.). Statista. https://www.statista.com/statistics/1283627/thailand-resident-population-by-age-group/

The Stanford Question Answering Dataset. (2021, June 4). The Stanford Question Answering Dataset. https://rajpurkar.github.io/SQuAD-explorer/

Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., & Zhou, M. (2019, December 31). LayoutLM: Pre-training of Text and Layout for Document Image Understanding. arXiv.org. https://doi.org/10.1145/3394486.3403172

Xu, Y., Xu, Y., Lv, T., Cui, L., Wei, F., Wang, G., Lu, Y., Florencio, D., Zhang, C., Che, W., Zhang, M., & Zhou, L. (2020, December 29). LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding. arXiv.org. https://arxiv.org/abs/2012.14740v4

--

--