Hierarchical Document Textual Representation
By stacking two transformers sequentially, the system models paragraph representations from tokens and then document representations on top of paragraph representations. It:
- Trades parameters and computation for longer sequences (\(512^2\) tokens for each document), even though they claim it reduces computation complexity.
- Allows interactions within paragraphs/semantic regions (or as they call it, sentences in the beginning) from a higher level compared to normal token-to-token attentions.
- Allows interactions between paragraphs and paragraph bounding boxes.
It is unclear what exactly is semantic region by their definition. Based on the paper, It looks like it is whatever JaidedAI/EasyOCR outputs and ideally it maps to paragraphs.
Personally, OCR blocks/paragraphs aren’t great in general. Because documents don’t necessarily follow strict rules of English text. OCR, unless specializes in document images, even then the document domain can vary a lot, does not share the same understanding of a document as human. For example:
-
Not everything can be classified as sentences or paragraphs;
- You cannot just call a table a sentence or a paragraph. It is even trickier to even pass the table content into the model since it can be a column table, row table or a mix. You sequence of tokens are still sequential.
- Corner cases like TOC, graph or chart, code blocks, page breaks and formulas make it even harder;
- Additionally, using bounding box can be a loose visual representation if you have multi-columns on one page.
Multi-modal Objectives
-
Masked Sentence/Paragraph Prediction
- To be more precise, it is not exactly predicting a sentence itself but the sentence embedding from the first transformer;
-
Masked Region of Interest (RoI) Prediction
- Because all RoIs are quantized into a finite vocabulary, this is equivalent to the Masked Token Prediction and the prediction probability comes from cosine similarities instead of some MLP output;
-
Multi-modal Similarity Alignment
- Even though the textual and visual information live in two embedding spaces, the element-wise similarities should be aligned;
- Highlights: Gu et al_2021_UniDoc
- Citation: @gu_2021b