Data & Preprocessing

In this project, we aim to be able to design an OCR handwritten mathematical equations into a LaTeX format. There are already existing models able to reliably convert images of LaTeX equations back into LaTeX, but it seems like a converting handwritten equations reliably is the harder task.

The idea then, is to train a Vision Transformers model on LaTeX images, then fine-tune it for handwritten equations.

Data & Preprocessing

Sources

As there is much less handwritten data than LaTeX data, we opt to pre-train the model using LaTeX equations.

Data	Dataset Size	Notes	Task
im2latex-100k	~60k	HuggingFace Dataset LaTeX images with LaTeX ground truth	Pretraining
im2latex-230k	~230k	Kaggle Dataset LaTeX images (includes matrix) symbols with LaTeX ground truth Not used due to low data quality and similarity to 100k	Unused
Wikipedia Dataset	100k	Scraped this myself	Pretraining
Handwritten Math Equations	~1.1k	Kaggle Dataset Data is in InkML, which is composed of pen strokes and a MathML ground truth	Downstream
Aida Calculus Math Handwriting Recognition Dataset	~100k	Kaggle Dataset Handwritten expression images (containing calculus only) with character bounding boxes and LaTeX ground truth	Downstream

Wikipedia Scraper

Wikipedia remains a good source of data due to

Diversity: Data contains much more diverse representations
Representation: Wikipedia latex is much more naturally written
Consistency: Wikipedia has consistent syntax
Verification: Wikipedia latex is almost always rendered correctly, as it is checked by humans

Thus a system to scrape Wikipedia was developed, it can be found at latex-scraper.ipynb

Vision Transformer Pre-Training

Tokenization

We use a Byte-Level BPE for encoding. Word-level tokenizers were tested but yielded significantly worse performance. The reason is not yet known, but it is possibly because of the whitespace pre-processor, which leads to the model incorrectly handling whitespace characters during training.

Pretraining Custom Decoder

For the decoder, we train a custom RoBERTa model with Masked Language Modelling, using the training data formulas as the corpus.

The data was reorganized after MLM training, further investigation will be needed to determine whether this caused Data Leakage.

Vision Encoder-Decoder Model

The both the image processor and the encoding layers of the transformer were extracted from Google's ViT Implementation.

The model was then trained partially on the original data, and partially on augmented data with the following augmentations (using PyTorch transforms).

Random sharpness adjustments
Random rotation
Random perspective
Elastic deformation
Color jitter
Random color inversion

Which simulates future conditions wherein paper color is not white, and where text may be warped.

Resources

Pre-Training Result Analysis

Testing the model post-pre-training reveals:

The model performance is acceptable

The model generally does not make breaking mistakes, and generally recognizes examples correctly.

The model generalizes reasonably well on augmented images

Performance on images with noticeable but not significant warping being equal or above to performance on normal images.

The model cannot understand images that have significant padding (FIXED)

Add randomized padding to preprocessing
If model needs to extract latex from larger images, an R-CNN#Faster R-CNN|RPN may be required.

The model generalizes poorly to images with long equations

The model performs badly 400px in length, but can correctly identify segments
- Processor resolution limitations, possible sliding windows solution
Eliminate training image quality suspicions

The model cannot handle short sequences

Definitely an overfitting issue

The model cannot handle stacked fractions

Model requires better spatial representation

The model perceives unnecessary spaces in large images

More data augmentation required

The model does not correctly identify strings of text

More data required

The model has a hard time recognizing non-greyscale text

More data augmentation required

Things to Test / Implement

Attention Rollout
- I want to see what the model sees
Resolving Limited Resolution Issue
- Conditional Positional Encodings for Vision Transformers
- ~~CvT: Introducing Convolutions to Vision Transformers~~
  - CvT Convolutional Projection layers lower number of tokens and raise token feature size, making it unfit for typical encoder-decoder architectures
Convolutional Network Backbone
- Just seems like a common implementation worth a try, could be good for feature extraction
Correction Model
- A Transformer-based Math Language Model for Handwritten Math Expression Recognition
- Some type of attention-based model where the model makes error proposals, and checks the original image to resolve the issues.
Two-stage model: character detection model for spatial representations and preliminary classification
- Offline handwritten mathematical expression recognition with graph encoder and transformer decoder
- Syntactic data generation for handwritten mathematical expression recognition

Possible Leads to Research

List of papers to go over

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.gitattributes		.gitattributes
.gitignore		.gitignore
data-builder.ipynb		data-builder.ipynb
decoder.ipynb		decoder.ipynb
latex-scraper.ipynb		latex-scraper.ipynb
readme.md		readme.md
requirements.txt		requirements.txt
vision.ipynb		vision.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data & Preprocessing

Sources

Wikipedia Scraper

Vision Transformer Pre-Training

Tokenization

Pretraining Custom Decoder

Vision Encoder-Decoder Model

Pre-Training Result Analysis

Things to Test / Implement

Possible Leads to Research

About

Releases

Packages

Languages

leo-cf-tian/image-2-latex

Folders and files

Latest commit

History

Repository files navigation

Data & Preprocessing

Sources

Wikipedia Scraper

Vision Transformer Pre-Training

Tokenization

Pretraining Custom Decoder

Vision Encoder-Decoder Model

Pre-Training Result Analysis

Things to Test / Implement

Possible Leads to Research

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages