kaeru


"Stuff I'm working on ..."

Querying Federation of Malaya Budget from 1959

by kaeru published 2025/12/05 22:01:00 GMT+8, last modified 2025-12-05T22:09:12+08:00
Current state of VLMs allows us to query accurately printed tables from the late 1950's

Some time back REFSA and the Japan Society for the Promotion of Science donated scanned budgets and financial statements, some from all the way back to 1957 to Sinar Project.

You can browse or search for them at the Malaysian Government Documents Archive, maintained by yours truly.

I did some some cleanups for some of the documents, including OCR about 10 years ago.

But other than searching for text, it would have been quite tedious to extract the tables manually with technology back then.

Let's try see how well we can query this document when fed into a VLM.

statement-of-assets-liabilities-1959.png

A lot of tables in old documents were formatted for human visual clarity and not for machines. So the structure of the tables, often have merged cells or labels that don't convert cleanly into flat machine readable formats like CSVs.

The trick with VLMs is to try extract the information in a nested data format like JSON.

Extract the figures of this Malaysian government statement of assets and 
liabilities document as JSON

You'll get something like this, and from here you can already get an idea on how well the VLM is parsing the tables.

{
  "consolidated_revenue_account": {
    "balance_31_12_58": 82087297.59,
    "add_appropriations_written_back": 17150000.00,
    "surplus_1959": {
      "revenue_for_year": 890972163.39,
      "expenditure_for_year": 838092333.35,
      "appreciation_of_investments": 3699676.09
    },
    "balance_31_12_59": 155816803.72
  },
  "consolidated_loan_account": {
    "premium_bonds": 1139450.00,
    "treasury_deposits": 104400000.00,
    "treasury_bills": 93395804.21,
    "total_liabilities": 198935254.21
  },
... <snip> ...

Let's ask some questions to see how accurately it answers it:

How much funds left in the development fund account at the end of 1959?

Based on the provided "STATEMENT OF ASSETS AND LIABILITIES", the balance of the Consolidated Revenue Account at the end of 1959 (31-12-59) is $155,816,803.72.

This figure is explicitly stated in the document under the "CONSOLIDATED REVENUE ACCOUNT" section.

Screenshot From 2025-12-05 21-44-53.png

Checks out.

The next one is more impressive. The Consolidated Trust Account table is not so straightforward, because there is no header for the last column for the subtotals. And some of the row labels are on two lines, so there is a blank row above it.

Screenshot From 2025-12-05 21-50-35.png

Under the Consolidated Trusts Account, how much funds are left for Advances, Government Officers Accounts?

Based on the "STATEMENT OF ASSETS AND LIABILITIES", under the CONSOLIDATED TRUST ACCOUNT section, the final balance for the Advances, Government Officers Accounts at the end of 1959 is $2,536,147.50.

This is found in the "STATUTORY FUNDS" subsection, in the final column labeled "final_balance" for that specific account. The calculation is as follows:

  • Starting Balance (31-12-58): $489,141.75
  • Appropriations (1959): $2,600,000.00
  • Disbursements (Net 1959): $552,994.25

The final balance is calculated as: $489,141.75 + $2,600,000.00 - $552,994.25 = $2,536,147.50

This matches the value provided in the table.

Got it right. The full JSON output was also correct.

This is on just consumer level hardware with Qwen3-VL 8B model, and it takes about 1.5 seconds for each answer.

With small modern VLM models, we can now extract and query structured data from a lot of old documents easily, and it's quite accessible.

My setup:

 

Document Actions