Compare string entries of columns in different pandas dataframes

I have two dataframes, df1 and df2, both with different number of rows.

df1 has a column 'NAME', a short string; and df2 has a column 'LOCAL_NAME', a much longer string that may contain the exact contents of df1.NAME.

I want to compare every entry of df1.NAME with every entry in df2.LOCAL_NAME, and if df1.NAME appears in a particular entry of df2.LOCAL_NAME, I want to create add an entry in a new column df2.NAME_MAP = df1.NAME. If it doesn't appear in the long string df2.LOCAL_NAME, the corresponding entry in df2.NAME_MAP will be df2.LOCAL_NAME

For now, efficiency is not an issue. Here are sample datasets.

df1 = pd.DataFrame({
    NAME : ['222', '111', '444', '333'],
    OTHER_COLUMNS: [3, 6, 7, 34]
})

df2 = pd.DataFrame({
    LOCAL_NAME: ['aac111asd', 'dfse222vdsf', 'adasd689as', 'asdv444grew', 'adsg243df', 'dsfh948dfd']
})

df1:

NAME OTHER_COLUMNS
'222' 3
'111' 6
'444' 7
'333' 34

df2:

LOCAL_NAME
'aac111asd'
'dfse222vdsf'
'adasd689as'
'asdv444grew'
'adsg243df'
'dsfh948dfd'

The goal is to create another column in df2 called NAME_MAP which has the value of df.NAME if that string is contained exactly in the larger df2.LOCAL_NAME string. df2 would now look like this:

LOCAL_NAME NAME_MAP
'aac111asd' '111'
'dfse222vdsf' '222'
'adasd689as' 'adasd689as'
'asdv444grew' '444'
'adsg243df' 'adsg243df'
'dsfh948dfd' 'dsfh948dfd'

Then I can join the two dataframes on NAME_MAP:

LOCAL_NAME NAME_MAP NAME (from df1) OTHER_COLUMNS (from df1)
'aac111asd' '111' '111' 6
'dfse222vdsf' '222' '222' 3
'adasd689as' 'adasd689as' NaN NaN
'asdv444grew' '444' '444' 7
'adsg243df' 'adsg243df' NaN NaN
'dsfh948dfd' 'dsfh948dfd' NaN NaN

How do I go about trying to do this string comparison in two datasets of different sizes?

Topic pandas python

Category Data Science


This can be achieved by looping through the rows using apply and using str.contains to check if the LOCAL_NAME column contains the value from the NAME column:

(
    df1
    # check which LOCAL_NAME values are related
    .assign(matches = lambda x: x["NAME"].apply(lambda y: df2.loc[df2["LOCAL_NAME"].str.contains(y), "LOCAL_NAME"].tolist()))
    # make sure each row only has one key in the case that there are multiple matches found (is this possible?)
    .explode("matches")
    # join with the second dataframe to get the NAME_MAP column
    .merge(df2, how="right", left_on="matches", right_on="LOCAL_NAME")
)

This will give the following output dataframe:

NAME OTHER_COLUMNS matches LOCAL_NAME
111 6 aac111asd aac111asd
222 3 dfse222vdsf dfse222vdsf
nan nan nan adasd689as
444 7 asdv444grew asdv444grew
nan nan nan adsg243df
nan nan nan dsfh948dfd

Here's a way to solve it

Create a df with cartesian product of both dataframes such as here : https://stackoverflow.com/questions/53907526/merge-dataframes-with-the-all-combinations-of-pks

cp = df2.assign(key=0).merge(df1.assign(key=0), how='left')

Keep only the lines where NAME is in LOCAL NAME (just print cp after that so you understand what's done)

cp['key'] = [1 if x in y else 0 for x,y in zip(cp['NAME'],cp['LOCAL_NAME'])]
cp = cp[cp['key'] == 1].drop(['key'], axis=1)

Merge, and fill the ones without combination by the local name

df2 = df2.merge(cp, how='left', on='LOCAL_NAME')
df2['NAME'] = df2['NAME'].fillna('')
df2['NAME'] = [y if x == '' else x for x,y in zip(df2['NAME'],df2['LOCAL_NAME'])]

Result :

    LOCAL_NAME  NAME        OTHER_COLUMNS
0   aac111asd   111         6.0
1   dfse222vdsf 222         3.0
2   adasd689as  adasd689as  NaN
3   asdv444grew 444         7.0
4   adsg243df   adsg243df   NaN
5   dsfh948dfd  dsfh948dfd  NaN

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.