A deep learning project that combines Vision Transformers (ViT) and GPT-2 for image-to-text generation, specifically designed for OCR (Optical Character Recognition) tasks. This project uses a vision-language model architecture to generate textual descriptions from images.