Cleaning up Documents with VLMs
Previously I explored using docling and LLMs to clean up and extract documents as structured text. Especially for legal documents that have page line numberings.
The current state of Visual Language Models (VLMs) as of Dec, 2025 is now very good at understanding document structure and charts. So they can accurately follow instructions to extract and format text you want, and exclude headers and page lines.
Current VLM Models like Qwen3-VL (or Gemini 3), can easily follow clear instructions, to extract what you want from an image.
A prompt like this, for document image above will get you clean structured text:
Convert this document to markdown without code block.
Remove the repeating page line numbers on left column [1,2,3,4,5 .... 28] and page header [Case 2:17-cv-04438].
This is a legal document, preserve paragraph numbers, formatting and don't remove any text.
For 251 page document like this though, we will need to break it apart and process it page by page, and this where unix CLI tools from Cleaning Scanned Documents with Open Source Tools Guide is still relevant.
pdftoppm -png -r 150 ~/Notes/assets/06.15.17_viceroy_complaint_0_1647156206929_0.pdf 1mdb
Next step is to then have a python or whatever script language you're familiar with to process each page to local LLM/VLM model server. Code generated by AI these days are pretty good, and you can expect a well documented working script, that you can tweak easily if you know how to code and have basic understanding of the OpenAI or Ollama API.
Generate a python command line script that: 1. Takes command line parameter of image filename. 2. Sends this image for processing to an OpenAI compatible API that accepts images 3. Has option for custom prompt
4. Make API key optional, we we are using llama.cpp as local server 5. Outputs the results from the API to stdout
6. Export the output as input filename but with .md extension
I got this script (vlm-to.md.py) which I tweaked a bit to have longer context (tokens) and my local lllama.cpp server setup.
Since I already have a directory full of images, I just need a simple bash script to convert each one and then join them all up at the end.
for file in *.png; do
if [ -f "$file" ]; then echo "processing: $file";
python3 vlm-to-md.py "$file" --prompt "Convert this document to markdown without code block. Remove the repeating page line numbers on left column [1,2,3,4,5 .... 28] and page header [Case 2:17-cv-04438]. This is a legal document, preserve paragraph numbers, formatting and don't remove any text.";
fi;
done
Then merge it all together:
for f in *.md; do cat "$f"; echo; done > 1mdb.md
It's not perfect, but pretty good. I just spent a bit of time to quickly browse and manually clean up some minor formatting mistakes, since I will probably not spend any more time on this document again.
Here is the markdown file for Case 2:17-cv-04438 United States v. Certain Rights To and Interests In The Viceroy Hotel Group (case connected to 1MDB fraud and embezzlement), for convenience of others.
Need for more sophisticated pipelines
This is a really simple pipeline. A better one, would include ability to detect and describe charts as well, given current VLM capabilities. And further down the line, detect faces and places as well in photos. But that's for another day.
A VLM can now describe this properly. For example, "explain this chart" gives us the following, which I manually copied and pasted back into the document with a note that it's an AI added addition for a chart:
What this chart suggests:
- The chart depicts a “money laundering” or “layering” scheme designed to obscure the origin of the funds and make them appear as if they were legitimate business transactions.
- The $458M from Deutsche Bank was initially funneled into the Aabar-Seychelles account.
- The $378M was then moved to the Fiduciary Fund.
- The $375M was transferred through a chain (Fiduciary Fund → Bridge Global → Brazen Sky) before being returned to the 1MDB Global Account.
- A direct transfer of $356M also occurred from the Aabar-Seychelles account to 1MDB Global.
- The $375M from the Brazen Sky Account to the 1MDB Global Account is likely the same $375M that was transferred through the Bridge Global Account, creating a loop to make the transfer appear more complex and less traceable.
Document Actions