UniDoc Unified Pretraining Framework for Document Understanding

#document understanding #natural language processing

Hierarchical Document Textual Representation

By stacking two transformers sequentially, the system models paragraph representations from tokens and then document representations on top of paragraph representations. It:

  • Trades parameters and computation for longer sequences (\(512^2\) tokens for each document), even though they claim it reduces computation complexity.
  • Allows interactions within paragraphs/semantic regions (or as they call it, sentences in the beginning) from a higher level compared to normal token-to-token attentions.
  • Allows interactions between paragraphs and paragraph bounding boxes.

It is unclear what exactly is semantic region by their definition. Based on the paper, It looks like it is whatever JaidedAI/EasyOCR outputs and ideally it maps to paragraphs.

Personally, OCR blocks/paragraphs aren’t great in general. Because documents don’t necessarily follow strict rules of English text. OCR, unless specializes in document images, even then the document domain can vary a lot, does not share the same understanding of a document as human. For example:

  • Not everything can be classified as sentences or paragraphs;
    • You cannot just call a table a sentence or a paragraph. It is even trickier to even pass the table content into the model since it can be a column table, row table or a mix. You sequence of tokens are still sequential.
    • Corner cases like TOC, graph or chart, code blocks, page breaks and formulas make it even harder;
  • Additionally, using bounding box can be a loose visual representation if you have multi-columns on one page.

Multi-modal Objectives

  • Masked Sentence/Paragraph Prediction
    • To be more precise, it is not exactly predicting a sentence itself but the sentence embedding from the first transformer;
  • Masked Region of Interest (RoI) Prediction
    • Because all RoIs are quantized into a finite vocabulary, this is equivalent to the Masked Token Prediction and the prediction probability comes from cosine similarities instead of some MLP output;
  • Multi-modal Similarity Alignment
    • Even though the textual and visual information live in two embedding spaces, the element-wise similarities should be aligned;