Conditional Join

Posted on Tue 20 March 2018

In [1]:
import pandas as pd
import numpy as np

Given an example of two tables

In [2]:
A = pd.DataFrame(data={'name': ['John', 'John', 'Alice', 'Alice', 'Bob'], 
                       'surname': ['Smith', 'Johnson', 'Martin', 'Williams', 'Brown'],
                       'year_birth': [1974, 1956, 1983, 1995, 2002]})
A
Out[2]:
name surname year_birth
0 John Smith 1974
1 John Johnson 1956
2 Alice Martin 1983
3 Alice Williams 1995
4 Bob Brown 2002
In [3]:
B = pd.DataFrame(data={'name': ['Alice', 'John', 'John', 'Michael'],
                       'surname': ['Williams', 'Johnson', 'Smith', 'Brown'],
                       'year_marriage': [2015, 1980, 2002, 2000]})
B
Out[3]:
name surname year_marriage
0 Alice Williams 2015
1 John Johnson 1980
2 John Smith 2002
3 Michael Brown 2000

Lets recreate the following SQL statement:

SELECT * from A, B INNER JOIN ON A.name = B.name AND A.surname = B.surname WHERE B.year_marriage - A.year_birth > 20 and A.surname = "Smith"

In [4]:
df = pd.merge(A, B, on=['name', 'surname'])
In [5]:
df[(df.year_marriage - df.year_birth >20) & (df.surname == 'Smith')]
Out[5]:
name surname year_birth year_marriage
0 John Smith 1974 2002