UniDoc Unified Pretraining Framework for Document Understanding

Hierarchical Document Textual Representation

By stacking two transformers sequentially, the system models paragraph representations from tokens and then document representations on top of paragraph representations. It:

Trades parameters and computation for longer sequences (\(512^2\) tokens for each document), even though they claim it reduces computation complexity.
Allows interactions within paragraphs/semantic regions (or as they call it, sentences in the beginning) from a higher level compared to normal token-to-token attentions.
Allows interactions between paragraphs and paragraph bounding boxes.

It is unclear what exactly is semantic region by their definition. Based on the paper, It looks like it is whatever JaidedAI/EasyOCR outputs and ideally it maps to paragraphs.

Personally, OCR blocks/paragraphs aren’t great in general. Because documents don’t necessarily follow strict rules of English text. OCR, unless specializes in document images, even then the document domain can vary a lot, does not share the same understanding of a document as human. For example:

Not everything can be classified as sentences or paragraphs;
- You cannot just call a table a sentence or a paragraph. It is even trickier to even pass the table content into the model since it can be a column table, row table or a mix. You sequence of tokens are still sequential.
- Corner cases like TOC, graph or chart, code blocks, page breaks and formulas make it even harder;
Additionally, using bounding box can be a loose visual representation if you have multi-columns on one page.

Masked Sentence/Paragraph Prediction
- To be more precise, it is not exactly predicting a sentence itself but the sentence embedding from the first transformer;
Masked Region of Interest (RoI) Prediction
- Because all RoIs are quantized into a finite vocabulary, this is equivalent to the Masked Token Prediction and the prediction probability comes from cosine similarities instead of some MLP output;
Multi-modal Similarity Alignment
- Even though the textual and visual information live in two embedding spaces, the element-wise similarities should be aligned;

Highlights: Gu et al_2021_UniDoc
Citation: @gu_2021b

UniDoc Unified Pretraining Framework for Document Understanding

Hierarchical Document Textual Representation

Multi-modal Objectives