HyperAI

Abstract

Image Transformer has recently achieved significant progress for natural image understanding, either using supervised (ViT, DeiT, etc.) or self-supervised (BEiT, MAE, etc.) pre-training techniques. In this paper, we propose \textbf{DiT}, a self-supervised pre-trained \textbf{D}ocument \textbf{I}mage \textbf{T}ransformer model using large-scale unlabeled text images for Document AI tasks, which is essential since no supervised counterparts ever exist due to the lack of human-labeled document images. We leverage DiT as the backbone network in a variety of vision-based Document AI tasks, including document image classification, document layout analysis, table detection as well as text detection for OCR. Experiment results have illustrated that the self-supervised pre-trained DiT model achieves new state-of-the-art results on these downstream tasks, e.g. document image classification (91.11 $\rightarrow$ 92.69), document layout analysis (91.0 $\rightarrow$ 94.9), table detection (94.23 $\rightarrow$ 96.55) and text detection for OCR (93.07 $\rightarrow$ 94.29). The code and pre-trained models are publicly available at \url{https://aka.ms/msdit}.

Benchmarks

Benchmark	Methodology	Metrics
document-image-classification-on-rvl-cdip	DiT-B	Accuracy: 92.11% Parameters: 87M
document-image-classification-on-rvl-cdip	DiT-L	Accuracy: 92.69% Parameters: 304M
document-layout-analysis-on-publaynet-val	DiT-L	Figure: 0.972 List: 0.960 Overall: 0.949 Table: 0.978 Text: 0.944 Title: 0.893
table-detection-on-ctdar	DiT-B (Cascade)	Weighted Average F1-score: 96.14
table-detection-on-ctdar	DiT-L (Cascade)	Weighted Average F1-score: 96.55

Abstract

Benchmarks

Benchmark	Methodology	Metrics
document-image-classification-on-rvl-cdip	DiT-B	Accuracy: 92.11% Parameters: 87M
document-image-classification-on-rvl-cdip	DiT-L	Accuracy: 92.69% Parameters: 304M
document-layout-analysis-on-publaynet-val	DiT-L	Figure: 0.972 List: 0.960 Overall: 0.949 Table: 0.978 Text: 0.944 Title: 0.893
table-detection-on-ctdar	DiT-B (Cascade)	Weighted Average F1-score: 96.14
table-detection-on-ctdar	DiT-L (Cascade)	Weighted Average F1-score: 96.55

DiT: Self-supervised Pre-training for Document Image Transformer

Junlong Li Yiheng Xu Tengchao Lv Lei Cui Cha Zhang Furu Wei

Abstract

Code Repositories

Benchmarks

Build AI with AI

Hyper Newsletters

DiT: Self-supervised Pre-training for Document Image Transformer

Junlong Li Yiheng Xu Tengchao Lv Lei Cui Cha Zhang Furu Wei

Abstract

Code Repositories

Benchmarks

Build AI with AI

Hyper Newsletters

Command Palette

DiT: Self-supervised Pre-training for Document Image Transformer

Junlong Li Yiheng Xu Tengchao Lv Lei Cui Cha Zhang Furu Wei

Abstract

Code Repositories

Benchmarks

Build AI with AI

Hyper Newsletters

Command Palette

DiT: Self-supervised Pre-training for Document Image Transformer

Junlong Li Yiheng Xu Tengchao Lv Lei Cui Cha Zhang Furu Wei

Abstract

Code Repositories

Benchmarks

Build AI with AI

Hyper Newsletters