Skip to main content

Use latex to generate layout dataset

· One min read
Andrey Ganyushkin

“Extracting Scientific Figures with Distantly Supervised Neural Networks”

Interesting idea here – it is patching latex source to generate document layout dataset.

For example:

if we have latex sources we can inject some latex commands and set color for title or other document part. Then document will be generated we can generate images from pdf and process these images in OpenCV to find bounding box for title text entry. So simple but we can generate big dataset in this approach.

paper link