"Stuff I'm working on ..."

Using Politikus for Name Matching PEPs

by kaeru published 2021/12/04 21:41:00 GMT+8, last modified 2021-12-07T23:41:38+08:00
politikus relationships on neo4j and ipython3/pandas
politikus relationships on neo4j and ipython3/pandas

As more and more data on beneficial ownership is published or leaked, there is ever increasing data on available names, company addresses and so on.

But they're only useful, if you know what you're looking for, and usually what we're mostly interested in looking for are the business dealings of people in power and their associates.

In finance risky people are defined as Politically Exposed Persons or PEPs. Each country has more specific definitions, Malaysia's central bank Bank Negara defines it as the following:


(a) foreign PEPs – individuals who are or who have been entrusted with prominent public functions by a foreign country.

For example, Heads of State or Government, senior politicians, senior government, judicial or military officials, senior executives of state-owned corporations and important political party officials;

(b) domestic PEPs – individuals who are or have been entrusted domestically with prominent public functions. For example, Heads of State or Government, senior politicians, senior government (includes federal, state and local government), judicial or military officials, senior executives of state-owned corporations and important political party officials; or

(c) persons who are or have been entrusted with a prominent function by an international organisation which refers to members of senior management.
For example, directors, deputy directors and members of the Board or equivalent functions.

The definition of PEPs is not intended to cover middle ranking or more junior individuals in the foregoing categories.

When you're looking for possible abuse of power, you need to also look at family members and also close associates. BNM defines these as:

Family Members

Refers to individuals who are related to a PEP either directly (consanguinity) or through marriage. A family member in this context, includes:
(a) parent;
(b) sibling;
(c) spouse;
(d) child; or
(e) spouse's parent,
for both biological or non-biological relationships.

Close Associates

Refers to any individual closely connected to a politically exposed person (PEP), either socially or professionally.

A close associate in this context includes:

(a) extended family members, such as relatives (biological and non-biological relationship);
(b) financially dependent individuals (e.g. persons salaried by the PEP such as drivers, bodyguards, secretaries);
(c) business partners or associates of the PEP;
(d) prominent members of the same organisation as the PEP;
(e) individuals working closely with the PEP (e.g. work colleagues); or
(f) close friends.

Getting Politikus Data

Sinar Project's Politikus project, continuously adds and updates data on PEPs, family members, associates as well as their relationships and publishes it as open data, via an API.

A simple example to download a CSV of all the latests PEPs

import pandas
import requests
from tqdm import tqdm,trange

HEADERS = {'Accept': 'application/json'}
BASE_URL = "https://politikus.sinarproject.org/@search?" \
          "portal_type=Person&" \

# default batch is 25 items
BATCH = 25
total_items = requests.get(BASE_URL,

peps = []

for i in tqdm(range(0, total_items, 25), desc="Download Progress"):
    b_start = i+1
    URL = BASE_URL + "&b_start=" + str(b_start)
    r = requests.get(URL, headers=HEADERS)

    for person in r.json()['items']:
        pep = {}
        pep['name'] = person['name']
        pep['summary'] = person['summary']
        pep['pepStatusDetails'] = person['pepStatusDetails']
        pep['birth_date'] = person['birth_date']
        if person['gender']:
            pep['gender'] = person['gender']['title']
            pep['gender'] = None

        #tax residencies
        taxResidencies = []
        if person['taxResidencies']:
            for tax_country in person['taxResidencies']:
        if taxResidencies:
            pep['taxResidencies'] = ",".join(taxResidencies)
            pep['taxResidencies'] = None

        nationalities = []
        if person['nationalities']:
            for nationality in person['nationalities']:
        if nationalities:
            pep['nationalities'] = ",".join(taxResidencies)
            pep['nationalities'] = None

        pep['UID'] = person['UID']
        pep['rdf_id'] = person['@id']


df = pandas.DataFrame(peps)
print("Exporting politikus-persons.csv in current directory")

On github

This gets us enough contextual information after matching names, such as basic description of the person.

                                 name                                            summary                                             rdf_id
0 Adly Kamarudin Datuk Malaysian businessman charged with abet... https://politikus.sinarproject.org/persons/adl...
1 Nasrudin Hassan Nasrudin Hassan is a politician https://politikus.sinarproject.org/persons/nas...
2 Shahrir Abdul Samad Shahrir bin Abdul Samad is a Malaysian politic... https://politikus.sinarproject.org/persons/sha...
3 Syed Mohd Fazmi Syed Mohd Fazmi Bin Sayid Mohammad is the aide... https://politikus.sinarproject.org/persons/sye...
4 Hafizuddin Ariff Mohd Hafizuddin Bin Mohd Ariff is the aide to ... https://politikus.sinarproject.org/persons/haf...
... ... ... ...
1549 Elizabeth Kay Nyigor Elizabeth Kay Nyigor is the aide to a politician https://politikus.sinarproject.org/persons/eli...
1550 Farahana Hussain Farahana binti Hussain is the aide to a politi... https://politikus.sinarproject.org/persons/far...
1551 Abdul Khaliq Ruzain Abdul Jalil Abdul Khaliq Ruzain Abdul Jalil is the aide to... https://politikus.sinarproject.org/persons/abd...
1552 Mohd Fadil Muskon Mohd Fadil Muskon is a Malaysian politician an... https://politikus.sinarproject.org/persons/moh...
1553 Noraini Roslan YDH. TPr. Hajah Noraini binti Haji Roslan is t... https://politikus.sinarproject.org/persons/nor...

ICIJ Offshore Leaks

Now that we have some names, we can use some of these beneficial ownership data to match names with. ICIJ Offshore Leaks is a good one, because each time a new PEP or associate is added, there is a possibility that there is match for some that was missed before.

You can download the database of past leaks from ICIJ website.

It's not difficult to quickly import all of it into pandas

files =[

icij_dfs = []
for f in files:
icij_dfs.append(pd.read_csv('/storage/datasets/offshore/' + f, usecols=['name','node_id']))

df_icij = pd.concat(icij_dfs)

This should be enough to get some names (1,531,731) some unique ID we can use later to refer to ICIJ database for other details.

A simple matching exercise

fuzzymatcher is useful for quick name matching and joining two Pandas dataframes.

Some basic cleaning up, to improve matching, you usually want to clean up the data a bit like make it all uppercase, trim etc.

df_icij['name'] = df_icij['name'].str.upper()
df_politikus['name'] = df_politikus['name'].str.upper()

Then match politikus names against ICIJ:

left_on = ['name']
right_on = ['name']

df_match = fuzzymatcher.fuzzy_left_join(df_politikus,df_icij,left_on, right_on)

Update: Note this matches only one ICIJ match, to one Politikus PEP or associate. Your probably want the other way around, icij on left, politikus on right, which try score every ICIJ name against Politikus names. This would need, an additional step of additional filtering for say a certain threshold, so you export only say anything over 0.65 and have a thousand or so rows of the best results, and throw away the 1.4 million or so rows of obviously different names. I'll cover this in upcoming note covering latest data with Pandora papers.

... get some coffee

Once done you can do analysis in Pandas, or export to CSV, maybe for others to then reuse it in whatever tools they're familiar with. Here I loaded it up into spreadsheet, and then sorted it by the best matching scores.You can download this CSV file here.

Screenshot from 2021-12-04 20-33-41.png
Top name matches in ICIJ databases

With some leads, you can now start searching and exploring the highest scoring matches in this case Raja Nong Chik on ICIJ, and Politikus data already provides you with some basic info and url for additional information.

Screenshot from 2021-12-04 20-41-58.png
ICIJ Online Network browser for Raja Nong Chik Bin Raja Zainal Abidin

These exercises help me build snippets of code, and explore methods. Instead of putting it all into one single DataFrame, I could explore doing it concurrently for each separate list of names using all CPU threads, refactor some parts into reusable scripts and so on. Politikus also has other data points such as addresses, identification numbers, faces and so on that could also be used as additional signals for scoring and matching.

Politikus also stores relationships, which provide other interesting methods of uncovering possible abuses of power, such as using network graph to identify cliques.