Compare string entries of columns in different pandas dataframes

Question

Compare string entries of columns in different pandas dataframes

MushyMush

2022年5月25日 15:02

I have two dataframes, df1 and df2, both with different number of rows.

df1 has a column 'NAME', a short string; and df2 has a column 'LOCAL_NAME', a much longer string that may contain the exact contents of df1.NAME.

I want to compare every entry of df1.NAME with every entry in df2.LOCAL_NAME, and if df1.NAME appears in a particular entry of df2.LOCAL_NAME, I want to create add an entry in a new column df2.NAME_MAP = df1.NAME. If it doesn't appear in the long string df2.LOCAL_NAME, the corresponding entry in df2.NAME_MAP will be df2.LOCAL_NAME

For now, efficiency is not an issue. Here are sample datasets.

df1 = pd.DataFrame({
    NAME : ['222', '111', '444', '333'],
    OTHER_COLUMNS: [3, 6, 7, 34]
})

df2 = pd.DataFrame({
    LOCAL_NAME: ['aac111asd', 'dfse222vdsf', 'adasd689as', 'asdv444grew', 'adsg243df', 'dsfh948dfd']
})

df1:

NAME	OTHER_COLUMNS
'222'	3
'111'	6
'444'	7
'333'	34

df2:

LOCAL_NAME
'aac111asd'
'dfse222vdsf'
'adasd689as'
'asdv444grew'
'adsg243df'
'dsfh948dfd'

The goal is to create another column in df2 called NAME_MAP which has the value of df.NAME if that string is contained exactly in the larger df2.LOCAL_NAME string. df2 would now look like this:

LOCAL_NAME	NAME_MAP
'aac111asd'	'111'
'dfse222vdsf'	'222'
'adasd689as'	'adasd689as'
'asdv444grew'	'444'
'adsg243df'	'adsg243df'
'dsfh948dfd'	'dsfh948dfd'

Then I can join the two dataframes on NAME_MAP:

LOCAL_NAME	NAME_MAP	NAME (from df1)	OTHER_COLUMNS (from df1)
'aac111asd'	'111'	'111'	6
'dfse222vdsf'	'222'	'222'	3
'adasd689as'	'adasd689as'	NaN	NaN
'asdv444grew'	'444'	'444'	7
'adsg243df'	'adsg243df'	NaN	NaN
'dsfh948dfd'	'dsfh948dfd'	NaN	NaN

How do I go about trying to do this string comparison in two datasets of different sizes?

Topic pandas python

Category Data Science

Oxbowerce · Accepted Answer · 2022年5月25日 15:02

This can be achieved by looping through the rows using apply and using str.contains to check if the LOCAL_NAME column contains the value from the NAME column:

(
    df1
    # check which LOCAL_NAME values are related
    .assign(matches = lambda x: x["NAME"].apply(lambda y: df2.loc[df2["LOCAL_NAME"].str.contains(y), "LOCAL_NAME"].tolist()))
    # make sure each row only has one key in the case that there are multiple matches found (is this possible?)
    .explode("matches")
    # join with the second dataframe to get the NAME_MAP column
    .merge(df2, how="right", left_on="matches", right_on="LOCAL_NAME")
)

This will give the following output dataframe:

NAME	OTHER_COLUMNS	matches	LOCAL_NAME
111	6	aac111asd	aac111asd
222	3	dfse222vdsf	dfse222vdsf
nan	nan	nan	adasd689as
444	7	asdv444grew	asdv444grew
nan	nan	nan	adsg243df
nan	nan	nan	dsfh948dfd

Adept · Accepted Answer · 2022年5月25日 14:49

Here's a way to solve it

Create a df with cartesian product of both dataframes such as here : https://stackoverflow.com/questions/53907526/merge-dataframes-with-the-all-combinations-of-pks

cp = df2.assign(key=0).merge(df1.assign(key=0), how='left')

Keep only the lines where NAME is in LOCAL NAME (just print cp after that so you understand what's done)

cp['key'] = [1 if x in y else 0 for x,y in zip(cp['NAME'],cp['LOCAL_NAME'])]
cp = cp[cp['key'] == 1].drop(['key'], axis=1)

Merge, and fill the ones without combination by the local name

df2 = df2.merge(cp, how='left', on='LOCAL_NAME')
df2['NAME'] = df2['NAME'].fillna('')
df2['NAME'] = [y if x == '' else x for x,y in zip(df2['NAME'],df2['LOCAL_NAME'])]

Result :

    LOCAL_NAME  NAME        OTHER_COLUMNS
0   aac111asd   111         6.0
1   dfse222vdsf 222         3.0
2   adasd689as  adasd689as  NaN
3   asdv444grew 444         7.0
4   adsg243df   adsg243df   NaN
5   dsfh948dfd  dsfh948dfd  NaN

Compare string entries of columns in different pandas dataframes

About