Using LLMs to fuzzy merge

Tyler Ransom

Jul 20, 2023

Embeddings to the rescue!

Read →

4 Comments

Jacob Hutchings

Aug 26, 2023

Currently working on linking schools for the Record Linking Lab. This is very helpful, thank you!

Expand full comment

Reply (1)

Tyler Ransom

Aug 26, 2023

I'm glad to hear this! Thanks for letting me know.

Expand full comment

lucharo

Apr 24, 2024

This is super useful thanks! I am thinking of an application to clean up an ID column in a single dataframe, in this case would you merge the dataframe with...itself?

```

import linktransformer as ltf

# load df

df = pd.read_csv(...)

ltf.merge(df, df, ... )

```

Expand full comment

Reply (1)

Tyler Ransom

Apr 24, 2024

Glad to hear you found it helpful! Yes, the operation you are interested in is called "de-duplication" and is a common one.

Here is what Claude 3 Opus responded to your query with (after I provided it with the LinkTransformer whitepaper from arXiv):

"""

Yes, you are correct. To clean up an ID column in a single dataframe using LinkTransformer, you would essentially merge the dataframe with itself.

Here's how you could approach it:

```python

import linktransformer as lt

import pandas as pd

# Load your dataframe

df = pd.read_csv("your_dataframe.csv")

# Perform deduplication using LinkTransformer

df_dedup = lt.dedup_rows(df, on="ID_column", model="sentence-transformers/all-MiniLM-L6-v2", cluster_params={'threshold': 0.7})

```

In this code:

1. You load your dataframe using `pd.read_csv()`.

2. You use the `lt.dedup_rows()` function from LinkTransformer to perform deduplication on the dataframe.

- The `on` parameter specifies the ID column that you want to clean up.

- The `model` parameter specifies the pre-trained language model to use for generating embeddings. In this example, it uses the "sentence-transformers/all-MiniLM-L6-v2" model.

- The `cluster_params` parameter allows you to specify additional clustering parameters. Here, it sets the similarity threshold to 0.7, meaning that rows with a similarity score above 0.7 will be considered duplicates.

3. The deduplication process works as follows:

- LinkTransformer generates embeddings for each row in the specified ID column using the chosen language model.

- It then clusters the embeddings based on their similarity, using the specified clustering parameters.

- Rows within the same cluster are considered duplicates and are merged or deduplicated.

4. The resulting deduplicated dataframe is stored in `df_dedup`.

By using LinkTransformer's deduplication functionality, you can effectively clean up the ID column in your dataframe by identifying and merging duplicate rows based on their semantic similarity.

This approach leverages the power of transformer language models to understand the semantic meaning of the IDs and perform deduplication intelligently, even if the IDs have slight variations or noise.

"""

Expand full comment

Tyler’s Substack

Using LLMs to fuzzy merge