Fuzzy Matching in SQL Server 2025

Fuzzy Matching in SQL Server 2025

Data rarely arrives in perfect condition. Typos, regional spelling differences, and inconsistent formats make exact matching unreliable in real-world scenarios. That’s where fuzzy matching comes in; and SQL Server 2025 introduces powerful built-in functions to handle it directly in T-SQL. In this post, we’ll walk through a practical example using UK vs. US spelling variations and explore how to use:

  • EDIT_DISTANCE
  • EDIT_DISTANCE_SIMILARITY
  • JARO_WINKLER_DISTANCE
  • JARO_WINKLER_SIMILARITY

Why Fuzzy Matching Matters

Exact string comparisons (= or JOIN) fail in cases like:

  • “Colour” vs “Color”
  • “Organise” vs “Organize”
  • “Centre” vs “Center”

In data migration, cleansing, or AI-driven workloads, these differences matter. Instead of missing matches, we can assign similarity scores and determine whether two values are “close enough.”

Check Database Compatibility Level

First, let’s check the compatibility level of our database and make sure Preview Features are enabled.

-- Switch to your database context
USE AdventureWorks2025;
GO

-- Check the SQL Server version to ensure it supports the required functions
SELECT @@VERSION;
GO

--If needed, set the compatibility level to 170 to access the latest features
ALTER DATABASE AdventureWorks2025 SET COMPATIBILITY_LEVEL = 170;

-- Enable the preview features if necessary (requires appropriate permissions)
ALTER DATABASE SCOPED CONFIGURATION
SET PREVIEW_FEATURES = ON;
GO

Create a Sample Dataset

We’ll build a simple table to compare UK and US spellings:

 -- Step 1: Create the table
CREATE TABLE WordPairs (
    WordID INT IDENTITY(1,1) PRIMARY KEY, -- Auto-incrementing ID
    WordUK NVARCHAR(50), -- UK English word
    WordUS NVARCHAR(50) -- US English word);

Insert Sample Data


-- Step 2: Insert the data
INSERT INTO WordPairs (WordUK, WordUS) VALUES
('Colour', 'Color'),
('Flavour', 'Flavor'),
('Centre', 'Center'),
('Theatre', 'Theater'),
('Organise', 'Organize'),
('Analyse', 'Analyze'),
('Catalogue', 'Catalog'),
('Programme', 'Program'),
('Metre', 'Meter'),
('Honour', 'Honor'),
('Neighbour', 'Neighbor'),
('Travelling', 'Traveling'),
('Grey', 'Gray'),
('Defence', 'Defense'),
('Practise', 'Practice'), -- Verb form in UK
('Practice', 'Practice'), -- Noun form in both
('Aluminium', 'Aluminum'),
('Cheque', 'Check'); -- Bank cheque vs. check

What is the Levenshtein Distance?

Levenshtein (more precisely, Levenshtein distance) is a concept from computer science used to measure how different two strings (words or sequences of characters) are. It is the minimum number of single-character edits needed to turn one string into another.

Allowed edits are:

  • Insertion (add a character)
  • Deletion (remove a character)
  • Substitution (replace one character with another)

Example

  • Compare: "kitten" to "sitting"
    • Insert g
    • Replace e with i
    • Replace k with s

Total edits = 3, so the Levenshtein distance is 3.

Using the Edit_Distance (Levenshtein) Function

This measures how many single-character changes are required to transform one string into another.

-- EDIT DISTANCE
SELECT WordUK, WordUS, EDIT_DISTANCE(WordUK, WordUS) AS Distance
FROM WordPairs
WHERE EDIT_DISTANCE(WordUK, WordUS) <= 2
ORDER BY Distance ASC;
Results of the Edit_Distance Function

How to interpret:

  • 0 = exact match
  • 1–2 = very similar
  • Higher values = less similar

Use this when:

  • You want strict similarity thresholds
  • You’re cleaning data or validating near matches

Using the Edit_Distance_Similarity Function

Instead of raw distance, this gives a percentage-based similarity score.


--EDIT_DISTANCE_SIMILARITY
SELECT WordUK, WordUS, EDIT_DISTANCE_SIMILARITY(WordUK, WordUS) AS Similarity
FROM WordPairs
WHERE EDIT_DISTANCE_SIMILARITY(WordUK, WordUS) >=75
ORDER BY Similarity DESC;
Using the Edit_Distance_Similarity Function

Why this is useful:

  • Easier to explain to stakeholders
  • Works well for scoring models and ranking matches

Use this when:

  • You want human-readable scoring (0–100)
  • You’re ranking possible matches

What is Jaro‑Winkler Distance?

Jaro‑Winkler distance (or similarity) is another string‑matching technique, similar to Levenshtein, but optimized for comparing short strings like names.

Simple definition

Jaro‑Winkler measures how similar two strings are by:

  • Comparing matching characters
  • Accounting for transpositions (letters out of order)
  • Boosting matches at the beginning of the string (prefix)

That last part (prefix boost) is what makes it different from Levenshtein.

How the scoring works

Typically expressed as a similarity score from 0 to 1: 0 means no similarity and 1.0 equals identical strings. Sometimes you’ll see it expressed as distance = 1 − similarity.

Using the Jaro-Winkler Distance Function

Jaro-Winkler is optimized for short strings and name matching.

--JARO_WINKLER_DISTANCE
SELECT WordUK, WordUS, JARO_WINKLER_DISTANCE(WordUK, WordUS) AS Distance
FROM WordPairs
WHERE JARO_WINKLER_DISTANCE(WordUK, WordUS) <= .05
ORDER BY Distance ASC;
Results of the Jaro_Winkler_Distance Function

Key characteristics:

  • Focuses on matching characters and position
  • Penalizes transpositions less harshly than edit distance

Using the Jaro_Winkler_Similarity Function

This version returns a similarity score between 0 and 1:

-- JARO_WINKLER_SIMILARITY
SELECT WordUK, WordUS, JARO_WINKLER_SIMILARITY(WordUK, WordUS) AS Similarity
FROM WordPairs
WHERE JARO_WINKLER_SIMILARITY(WordUK, WordUS) > .90
ORDER BY Similarity DESC;
Using the Jaro_Winkler_Similarity Function

Why it matters:

  • Provides a normalized similarity score
  • Excellent for ranking and threshold-based filtering

Choosing the Right Algorithm

ScenarioRecommended Function
Data cleansing with strict thresholdsEDIT_DISTANCE
Scoring and ranking matchesEDIT_DISTANCE_SIMILARITY
Matching names or short textJARO_WINKLER_DISTANCE
High-confidence similarity filteringJARO_WINKLER_SIMILARITY

Real-World Use Cases

These functions unlock several practical scenarios:

  • Data Migration Validation:
    • Identify mismatches between source and target systems.
  • Customer Data Deduplication:
    • Match “Jon Smith” vs “John Smith” with confidence scores.
  • AI and Semantic Workloads:
    • Pre-process text before sending to embedding or vector search pipelines.
  • Global Application Support:
    • Handle regional spelling differences automatically.

Key Takeaways

  • SQL Server 2025 brings native fuzzy matching to T-SQL.
  • You no longer need CLR functions or external libraries.
  • Different algorithms serve different use cases, so choose wisely.
  • Similarity scores enable smarter, more flexible matching logic.

Fuzzy matching isn’t just a “nice-to-have”; it’s essential for modern data platforms. With these new built-in functions, SQL Server is closing the gap between traditional relational processing and intelligent data matching scenarios. And the best part? You can implement it using pure T-SQL in just a few lines.

Share and Enjoy !

Shares

Be the first to comment on "Fuzzy Matching in SQL Server 2025"

Leave a comment

Your email address will not be published.


*


This site uses Akismet to reduce spam. Learn how your comment data is processed.