Using LLMs to fuzzy merge

Embeddings to the rescue!

Jul 20, 2023

TL;DR: Here is a public GitHub gist showing how to use OpenAI’s API to do fuzzy string matching in R.

Update: You can now use Melissa Dell’s LinkTransformer Python package do this for free, no OpenAI API key required. See the Python script in the public gist.

A few years ago, I collected some data that required a fuzzy merge with administrative data on schools. I’d never done a fuzzy merge before, and tried many different approaches, but ultimately couldn’t get over the hump.

You can even track this from my twitter feed:

But now it seems LLMs can make this process much, much easier.

LLM embeddings explained

An LLM embedding is a numerical representation of text data that captures the meaning and the context of the text. Essentially, it is an easy way to simultaneously reduce the dimensionality of a text string and convert it to a numeric format.

As OpenAI writes, this has many useful applications:

OpenAI’s text embeddings measure the relatedness of text strings. Embeddings are commonly used for:
Search (where results are ranked by relevance to a query string)
Clustering (where text strings are grouped by similarity)
Recommendations (where items with related text strings are recommended)
Anomaly detection (where outliers with little relatedness are identified)
Diversity measurement (where similarity distributions are analyzed)
Classification (where text strings are classified by their most similar label)

Utilizing OpenAI’s text embeddings

We can use OpenAI’s text embeddings in R or Python by signing up for an API key and using a package to communicate with the API. Calls to the API are extremely cheap; a collaborator of mine recently did about 45,000 calls for about 5¢. Our API calls involved very short snippets of text, so if you want to use it for longer text then the rate will be higher.

Example R script

I’ll walk through an R script below to illustrate how this works. In this example, I have two datasets that each have the name of a high school in the United States. However, the two datasets may have different naming styles for the schools (e.g. “Valley View HS” in one and “Valley View High School”). This makes exact merging impossible. But OpenAI’s text embeddings are astonishingly good at noting the similarity in exactly the way that a human would intuit.

Step 1: Load packages and Invoke API

library(tidyverse)
library(openai)

#------------------------------------------------------------------------------
# Step 1: Open AI API key
#------------------------------------------------------------------------------
# Your OpenAI API key should be an environment variable you set in ~.Renviron
# ... you should never put your API key directly in your code!

There’s not much to say here other than that you should always store your API key in a way that it is invisible to whoever is reading the script.

Step 2: Cosine similarity

OpenAI suggests using cosine similarity to compare the embeddings of two strings. Here I provide two functions:

Compute the cosine similarity between any two embeddings
Find the maximum of the cosine similarity between one row of the primary dataset and every row of the secondary dataset. This is how we find the “best match.”

#------------------------------------------------------------------------------
# Step 2: Define functions for later use:
#------------------------------------------------------------------------------

# Function 1: calculate the cosine similarity between two vectors

cosine_similarity <- function(x, y) {
  sum(x * y) / (sqrt(sum(x^2)) * sqrt(sum(y^2)))
}

# Function 2: compute the maximum cosine similarity and its index between a vector and a list of vectors

max_cosine_similarity_index <- function(x, y_list) {
    # Use map_dbl() to compute the cosine similarity with each vector in y_list
    cos_sim <- map_dbl(y_list, cosine_similarity, x = x)
    # Return a list with the maximum value and its index of cosine similarity
    list(max_cos_sim = max(cos_sim), max_cos_sim_index = which.max(cos_sim))
}

The second function is best accomplished with the purrr package.

Step 3: Read in the data

Nothing too fancy here:

#------------------------------------------------------------------------------
# Step 3: Read in the data
#
# Assume you have two data frames called df1 and df2
# Each has a field with a school name
# For example: df1 has columns "id" and "HS"
#              df2 has columns "schid" and "name"
#------------------------------------------------------------------------------
df1 <- read_csv("test1.csv")
print(df1)

df2 <- read_csv("test2.csv")
print(df2, n=100)

Step 4: Obtain embeddings from OpenAI API

Here is where we get the embeddings using the create_embedding() function from the openai package. OpenAI recommends using the model “text-embedding-ada-002.” This costs 1/100¢ per 1,000 tokens. (As a reference, 1 token is about 4 characters in English.)

It does take quite some time to communicate with the API and get back the embeddings. For this simple example, it took about 5 minutes for the names of about 150 schools.

This code then stores in each row of the dataframe an embedding vector. Each embedding vector has hundreds of entries.

#------------------------------------------------------------------------------
# Step 4: Get the embeddings for the key in each dataset
#
# This may take some time (took about 5 minutes for about 130 API calls)
# Cost is $0.0001 / 1K tokens
#------------------------------------------------------------------------------
# Get the embeddings for the school names in df1 using the model text-embedding-ada-002
df1$embedding <- map(df1$HS, function(x) create_embedding(input = x, model = "text-embedding-ada-002")$data$embedding[[1]])

# Get the embeddings for the school names in df2 using the same model
df2$embedding <- map(df2$name, function(x) create_embedding(input = x, model = "text-embedding-ada-002")$data$embedding[[1]])

Step 5: Do the fuzzy match

Now that we have the embeddings, we compute the cosine similarity between each row of the primary key in our dataframe of interest (df1 in this case) and each row in the dataframe we want to link with (df2 in this case). Then we match based on the element that has the greatest similarity.1

#------------------------------------------------------------------------------
# Step 5: For each primary key, find the maximally similar foreign key in the 
#         other dataset
#------------------------------------------------------------------------------
# Get the school name with the most similar embeddings
df1$similarity        <- map(df1$embedding, max_cosine_similarity_index, y_list = df2$embedding) %>% map_dbl("max_cos_sim")
df1$argmax_similarity <- map(df1$embedding, max_cosine_similarity_index, y_list = df2$embedding) %>% map_int("max_cos_sim_index")
df1$matched_name      <- df2$name[df1$argmax_similarity]

Step 6: Inspect the matches

It’s always a good coding practice to manually inspect a random set of matches to ensure that things are working properly. Here’s what the matches look like in my sample data:

#------------------------------------------------------------------------------
# Step 6: Inspect match performance
#------------------------------------------------------------------------------
df1 %>% select(HS,matched_name,similarity) %>% print

The embeddings for “Timpanogos High School” and “Timpanogos HS” were highly similar. Moreover, the embeddings worked well even though the name was in all caps in one dataset and proper capitalization in the other.

Note that it would be easy to modify the max_cosine_similarity_index() function to return the 2nd most similar, 3rd most similar, … Nth most similar elements. This should be a good idea in more complicated matching situations.

Jacob Hutchings

Aug 26, 2023Liked by Tyler Ransom

Currently working on linking schools for the Record Linking Lab. This is very helpful, thank you!

Expand full comment

1 reply by Tyler Ransom

lucharo

lucharo’s Substack

Apr 24

This is super useful thanks! I am thinking of an application to clean up an ID column in a single dataframe, in this case would you merge the dataframe with...itself?

```

import linktransformer as ltf

# load df

df = pd.read_csv(...)

ltf.merge(df, df, ... )

2 more comments...

Tyler’s Substack

Discussion about this post