Cleaning up Documents with VLMs

by kaeru â€” published 2025/12/02 15:13:00 GMT+8, last modified 2025-12-27T12:00:30+08:00

Cleaning up documents, this time with VLMs and AI generated one off scripts

Previously I explored using docling and LLMs to clean up and extract documents as structured text. Especially for legal documents that have page line numberings.

The current state of Visual Language Models (VLMs) as of Dec, 2025 is now very good at understanding document structure and charts. So they can accurately follow instructions to extract and format text you want, and exclude headers and page lines.

Screenshot From 2025-12-02 13-55-54.png — Department of Justice 1MDB Filing with page lines

Current VLM Models like Qwen3-VL (or Gemini 3), can easily follow clear instructions, to extract what you want from an image.

A prompt like this, for document image above will get you clean structured text:

Convert this document to markdown without code block.
Remove the repeating page line numbers on left column [1,2,3,4,5 .... 28] and page header [Case 2:17-cv-04438].
This is a legal document, preserve paragraph numbers, formatting and don't remove any text.

For 251 page document like this though, we will need to break it apart and process it page by page, and this where unix CLI tools from Cleaning Scanned Documents with Open Source Tools GuideÂ is still relevant.

pdftoppm -png -r 150 ~/Notes/assets/06.15.17_viceroy_complaint_0_1647156206929_0.pdf 1mdb

Next step is to then have a python or whatever script language you're familiar with to process each page to local LLM/VLM model server. Code generated by AI these days are pretty good, and you can expect a well documented working script, that you can tweak easily if you know how to code and have basic understanding of the OpenAI or Ollama API.

Generate a python command line script that:

1. Takes command line parameter of image filename.
2. Sends this image for processing to an OpenAI compatible API that accepts images
3. Has option for custom prompt
4. Make API key optional, we we are using llama.cpp as local server
5. Outputs the results from the API to stdout
6. Export the output as input filename but with .md extension

I got this script (vlm-to.md.py) which I tweaked a bit to have longer context (tokens) and my local lllama.cpp server setup.

Since I already have a directory full of images, I just need a simple bash script to convert each one and then join them all up at the end.

for file in *.png; do     
    if [ -f "$file" ]; then echo "processing: $file";
     python3 vlm-to-md.py "$file" --prompt "Convert this document to markdown without code block. Remove the repeating page line numbers on left column [1,2,3,4,5 .... 28] and page header [Case 2:17-cv-04438]. This is a legal document, preserve paragraph numbers, formatting and don't remove any text.";      
    fi;
done

Then merge it all together:

for f in *.md; do cat "$f"; echo; done > 1mdb.md

It's not perfect, but pretty good. I just spent a bit of time to quickly browse and manually clean up some minor formatting mistakes, since I will probably not spend any more time on this document again.

Here is the markdown file for Case 2:17-cv-04438 United States v. Certain Rights To and Interests In The Viceroy Hotel Group (case connected to 1MDB fraud and embezzlement), for convenience of others.

Need for more sophisticated pipelines

This is a really simple pipeline. A better one, would include ability to detect and describe charts as well, given current VLM capabilities. And further down the line, detect faces and places as well in photos. This is something the document processing tools like docling should eventually do well including dealing with edge cases like legal documents.

But for now we can do quick hacks like these as workarounds.

A VLM can now describe this properly. For example, "explain this chart" gives us the following, which I manually copied and pasted back into the document with a note that it's an AI added addition for a chart:

What this chart suggests:

The chart depicts a â€œmoney launderingâ€ or â€œlayeringâ€ scheme designed to obscure the origin of the funds and make them appear as if they were legitimate business transactions.
The $458M from Deutsche Bank was initially funneled into the Aabar-Seychelles account.
The $378M was then moved to the Fiduciary Fund.
The $375M was transferred through a chain (Fiduciary Fund â†’ Bridge Global â†’ Brazen Sky) before being returned to the 1MDB Global Account.
A direct transfer of $356M also occurred from the Aabar-Seychelles account to 1MDB Global.
The $375M from the Brazen Sky Account to the 1MDB Global Account is likely the same $375M that was transferred through the Bridge Global Account, creating a loop to make the transfer appear more complex and less traceable.

Document Actions

RSS feed