Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of Damerau-Levenshtein Distance Algorithm #460

Merged

Conversation

Kalkwst
Copy link
Contributor

@Kalkwst Kalkwst commented Aug 1, 2024

Summary

This PR introduces the implementation of the Damerau-Levenshtein distance algorithm. The Damerau-Levenshtein distance is a string metric for measuring the difference between two sequences. It is calculated as the minimum number of operations needed to transform one sequence into the other. The possible operations are insertion, deletion, substitution, and transposition.

Algorithm Overview

The Damerau-Levenshtein distance algorithm calculates the number of single-character edits (insertions, deletions, substitutions, or transpositions) required to change one word into another. This implementation is efficient and uses dynamic programming to build a matrix of distances.

Pseudocode

function DamerauLevenshteinDistance(string1, string2):
    len1 = length of string1
    len2 = length of string2

    if len1 == 0:
        return len2
    if len2 == 0:
        return len1

    // Initialize the distance matrix
    matrix = 2D array of size (len1+1) x (len2+1)

    // Initialize the first row and column
    for i from 0 to len1:
        matrix[i][0] = i
    for j from 0 to len2:
        matrix[0][j] = j

    // Calculate distances
    for i from 1 to len1:
        for j from 1 to len2:
            if string1[i-1] == string2[j-1]:
                cost = 0
            else:
                cost = 1

            matrix[i][j] = minimum(
                matrix[i-1][j] + 1,    // deletion
                matrix[i][j-1] + 1,    // insertion
                matrix[i-1][j-1] + cost  // substitution
            )

            if i > 1 and j > 1 and string1[i-1] == string2[j-2] and string1[i-2] == string2[j-1]:
                matrix[i][j] = minimum(
                    matrix[i][j],
                    matrix[i-2][j-2] + cost  // transposition
                )

    return matrix[len1][len2]

Applications of Damerau-Levenshtein Distance

The Damerau-Levenshtein distance has various practical applications in different fields, including:

  1. Spell Checking and Correction:
    • Identifying words that are close to a misspelled word and suggesting corrections.
  2. DNA Sequencing:
    • Comparing DNA sequences to identify mutations or similarities.
  3. Natural Language Processing (NLP):
    • Measuring the similarity between sentences or phrases for tasks like text summarization, machine translation, and sentiment analysis.
  4. Information Retrieval:
    • Enhancing search engines by finding documents or queries that are similar but not identical.
  5. Data Deduplication:
    • Identifying and removing duplicate records in databases.
  6. Plagiarism Detection:
    • Comparing documents to identify copied content.

See Also


  • I have performed a self-review of my code
  • My code follows the style guidelines of this project
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Comments in areas I changed are up to date
  • I have added comments to hard-to-understand areas of my code
  • I have made corresponding changes to the README.md

@Kalkwst Kalkwst requested a review from siriak as a code owner August 1, 2024 18:59
Copy link

codecov bot commented Aug 1, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 94.96%. Comparing base (9eb2196) to head (406e5ad).

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #460      +/-   ##
==========================================
+ Coverage   94.95%   94.96%   +0.01%     
==========================================
  Files         236      237       +1     
  Lines       10022    10058      +36     
  Branches     1416     1422       +6     
==========================================
+ Hits         9516     9552      +36     
  Misses        389      389              
  Partials      117      117              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@siriak siriak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks!

@siriak siriak merged commit 351b95b into TheAlgorithms:master Aug 1, 2024
4 checks passed
@Kalkwst Kalkwst deleted the feature/Damerau-Levenshtein-Distance branch August 1, 2024 20:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants