Font mismatch causes word level bounding boxes to vary in height within a sentence. Google vision may interpret a word with higher height as belonging to an upper line than original and vice versa. This repo aims towards solving this problem
Connect the two bounding boxes whose centroids is differs less than the average word length. This can be achived using both two methods.
The first method is a dsu based approach where you can treat each word's centroid as a node and find connected component.
The second is using an unsupervised learning algorithm to cluster words based on their centroids y-coordinates.