Skip to content

leo-cf-tian/image-2-latex

Repository files navigation

In this project, we aim to be able to design an OCR handwritten mathematical equations into a LaTeX format. There are already existing models able to reliably convert images of LaTeX equations back into LaTeX, but it seems like a converting handwritten equations reliably is the harder task.

The idea then, is to train a Vision Transformers model on LaTeX images, then fine-tune it for handwritten equations.

Data & Preprocessing

Sources

As there is much less handwritten data than LaTeX data, we opt to pre-train the model using LaTeX equations.

Data Dataset Size Notes Task
im2latex-100k ~60k HuggingFace Dataset
LaTeX images with LaTeX ground truth
Pretraining
im2latex-230k ~230k Kaggle Dataset
LaTeX images (includes matrix) symbols with LaTeX ground truth

Not used due to low data quality and similarity to 100k
Unused
Wikipedia Dataset 100k Scraped this myself Pretraining
Handwritten Math Equations ~1.1k
Kaggle Dataset
Data is in InkML, which is composed of pen strokes and a MathML ground truth
Downstream
Aida Calculus Math Handwriting Recognition Dataset ~100k Kaggle Dataset
Handwritten expression images (containing calculus only) with character bounding boxes and LaTeX ground truth
Downstream

Wikipedia Scraper

Wikipedia remains a good source of data due to

  • Diversity: Data contains much more diverse representations
  • Representation: Wikipedia latex is much more naturally written
  • Consistency: Wikipedia has consistent syntax
  • Verification: Wikipedia latex is almost always rendered correctly, as it is checked by humans

Thus a system to scrape Wikipedia was developed, it can be found at latex-scraper.ipynb

Vision Transformer Pre-Training

Tokenization

We use a Byte-Level BPE for encoding. Word-level tokenizers were tested but yielded significantly worse performance. The reason is not yet known, but it is possibly because of the whitespace pre-processor, which leads to the model incorrectly handling whitespace characters during training.

Pretraining Custom Decoder

For the decoder, we train a custom RoBERTa model with Masked Language Modelling, using the training data formulas as the corpus.

The data was reorganized after MLM training, further investigation will be needed to determine whether this caused Data Leakage.

Vision Encoder-Decoder Model

The both the image processor and the encoding layers of the transformer were extracted from Google's ViT Implementation.

The model was then trained partially on the original data, and partially on augmented data with the following augmentations (using PyTorch transforms).

  • Random sharpness adjustments
  • Random rotation
  • Random perspective
  • Elastic deformation
  • Color jitter
  • Random color inversion

Which simulates future conditions wherein paper color is not white, and where text may be warped.

Resources

Pre-Training Result Analysis

Testing the model post-pre-training reveals:

The model performance is acceptable

  • The model generally does not make breaking mistakes, and generally recognizes examples correctly.

The model generalizes reasonably well on augmented images

  • Performance on images with noticeable but not significant warping being equal or above to performance on normal images.

The model cannot understand images that have significant padding (FIXED)

  • Add randomized padding to preprocessing
  • If model needs to extract latex from larger images, an R-CNN#Faster R-CNN|RPN may be required.

The model generalizes poorly to images with long equations

  • The model performs badly 400px in length, but can correctly identify segments
    • Processor resolution limitations, possible sliding windows solution
  • Eliminate training image quality suspicions

The model cannot handle short sequences

  • Definitely an overfitting issue

The model cannot handle stacked fractions

  • Model requires better spatial representation

The model perceives unnecessary spaces in large images

  • More data augmentation required

The model does not correctly identify strings of text

  • More data required

The model has a hard time recognizing non-greyscale text

  • More data augmentation required

Things to Test / Implement

Possible Leads to Research

List of papers to go over

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published