In this project, we aim to be able to design an OCR handwritten mathematical equations into a LaTeX format. There are already existing models able to reliably convert images of LaTeX equations back into LaTeX, but it seems like a converting handwritten equations reliably is the harder task.
The idea then, is to train a Vision Transformers model on LaTeX images, then fine-tune it for handwritten equations.
As there is much less handwritten data than LaTeX data, we opt to pre-train the model using LaTeX equations.
Data | Dataset Size | Notes | Task |
---|---|---|---|
im2latex-100k | ~60k | HuggingFace Dataset LaTeX images with LaTeX ground truth |
Pretraining |
im2latex-230k | ~230k | Kaggle Dataset LaTeX images (includes matrix) symbols with LaTeX ground truth Not used due to low data quality and similarity to 100k |
Unused |
Wikipedia Dataset | 100k | Scraped this myself | Pretraining |
Handwritten Math Equations | ~1.1k |
Kaggle Dataset Data is in InkML, which is composed of pen strokes and a MathML ground truth |
Downstream |
Aida Calculus Math Handwriting Recognition Dataset | ~100k | Kaggle Dataset Handwritten expression images (containing calculus only) with character bounding boxes and LaTeX ground truth |
Downstream |
Wikipedia remains a good source of data due to
- Diversity: Data contains much more diverse representations
- Representation: Wikipedia latex is much more naturally written
- Consistency: Wikipedia has consistent syntax
- Verification: Wikipedia latex is almost always rendered correctly, as it is checked by humans
Thus a system to scrape Wikipedia was developed, it can be found at latex-scraper.ipynb
We use a Byte-Level BPE for encoding. Word-level tokenizers were tested but yielded significantly worse performance. The reason is not yet known, but it is possibly because of the whitespace pre-processor, which leads to the model incorrectly handling whitespace characters during training.
For the decoder, we train a custom RoBERTa model with Masked Language Modelling, using the training data formulas as the corpus.
The data was reorganized after MLM training, further investigation will be needed to determine whether this caused Data Leakage.
The both the image processor and the encoding layers of the transformer were extracted from Google's ViT Implementation.
The model was then trained partially on the original data, and partially on augmented data with the following augmentations (using PyTorch transforms).
- Random sharpness adjustments
- Random rotation
- Random perspective
- Elastic deformation
- Color jitter
- Random color inversion
Which simulates future conditions wherein paper color is not white, and where text may be warped.
Resources
- Image Captioning Using Hugging Face Vision Encoder Decoder
- How to train a new language model from scratch using Transformers and Tokenizers
- Create a Tokenizer and Train a Huggingface RoBERTa Model from Scratch
- TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models
Testing the model post-pre-training reveals:
The model performance is acceptable
- The model generally does not make breaking mistakes, and generally recognizes examples correctly.
The model generalizes reasonably well on augmented images
- Performance on images with noticeable but not significant warping being equal or above to performance on normal images.
The model cannot understand images that have significant padding (FIXED)
- Add randomized padding to preprocessing
- If model needs to extract latex from larger images, an R-CNN#Faster R-CNN|RPN may be required.
The model generalizes poorly to images with long equations
- The model performs badly 400px in length, but can correctly identify segments
- Processor resolution limitations, possible sliding windows solution
- Eliminate training image quality suspicions
The model cannot handle short sequences
- Definitely an overfitting issue
The model cannot handle stacked fractions
- Model requires better spatial representation
The model perceives unnecessary spaces in large images
- More data augmentation required
The model does not correctly identify strings of text
- More data required
The model has a hard time recognizing non-greyscale text
- More data augmentation required
-
Attention Rollout
- I want to see what the model sees
-
Resolving Limited Resolution Issue
- Conditional Positional Encodings for Vision Transformers
CvT: Introducing Convolutions to Vision Transformers- CvT Convolutional Projection layers lower number of tokens and raise token feature size, making it unfit for typical encoder-decoder architectures
-
Convolutional Network Backbone
- Just seems like a common implementation worth a try, could be good for feature extraction
-
Correction Model
- A Transformer-based Math Language Model for Handwritten Math Expression Recognition
- Some type of attention-based model where the model makes error proposals, and checks the original image to resolve the issues.
-
Two-stage model: character detection model for spatial representations and preliminary classification
List of papers to go over
- Handwritten Mathematical Expression Recognition with Bidirectionally Trained Transformer
- Mathematical expression recognition using a new deep neural model
- ICDAR 2023 CROHME: Competition on Recognition of Handwritten Mathematical Expressions
- ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases