% Generated by roxygen2 (4.0.1): do not edit by hand
\docType{package}
\name{stringdist-package}
\alias{stringdist-package}
\title{A package for string distance calculation}
\description{
A package for string distance calculation
}
\section{Supported distances}{


The \bold{Hamming distance} (\code{hamming}) counts the number of
character substitutions that turns \code{b} into \code{a}. If \code{a}
and \code{b} have different number of characters or if \code{maxDist} is
exceeded, \code{Inf} is returned.

The \bold{Levenshtein distance} (\code{lv}) counts the number of
deletions, insertions and substitutions necessary to turn \code{b} into
\code{a}. This method is equivalent to \code{R}'s native \code{\link[utils]{adist}}
function. If \code{maxDist} is exceeded \code{Inf}  is returned.

The \bold{Optimal String Alignment distance} (\code{osa}) is like the Levenshtein
distance but also allows transposition of adjacent characters. Here, each
substring  may be edited only once. (For example, a character cannot be transposed twice
to move it forward in the string). If \code{maxDist} is exceeded \code{Inf}  is returned.

The \bold{full Damerau-Levensthein distance} (\code{dl}) allows for multiple
edits on substrings. If \code{maxDist} is exceeded \code{Inf}  is returned.

The \bold{longest common substring} is defined as the longest string that can be
obtained by pairing characters from \code{a} and \code{b} while keeping the order
of characters intact. The \bold{lcs-distance} is defined as the number of unpaired characters.
The distance is equivalent to the edit distance allowing only deletions and insertions,
each with weight one. If \code{maxDist} is exceeded \code{Inf}  is returned.

A \bold{\eqn{q}-gram} is a subsequence of \eqn{q} \emph{consecutive}
characters of a string. If \eqn{x} (\eqn{y}) is the vector of counts
of \eqn{q}-gram occurrences in \code{a} (\code{b}), the \bold{\eqn{q}-gram distance}
is given by the sum over the absolute differences \eqn{|x_i-y_i|}.
The computation is aborted when \code{q} is is larger than the length of
any of the strings. In that case \code{Inf}  is returned.

The \bold{cosine distance} is computed as \eqn{1-x\cdot y/(\|x\|\|y\|)}, where \eqn{x} and
\eqn{y} were defined above.

Let \eqn{X} be the set of unique \eqn{q}-grams in \code{a} and \eqn{Y} the set of unique
\eqn{q}-grams in \code{b}. The \bold{Jaccard distance} is given by \eqn{1-|X\cap Y|/|X\cup Y|}.

The \bold{Jaro distance} (\code{method='jw'}, \code{p=0}), is a number
between 0 (exact match) and 1 (completely dissimilar) measuring
dissimilarity between strings.  It is defined to be 0 when both strings have
length 0, and 1 when  there are no character matches between \code{a} and
\code{b}.  Otherwise, the Jaro distance is defined as
\eqn{1-(1/3)(w_1m/|a| + w_2m/|b| + w_3(m-t)/m)}.
Here,\eqn{|a|} indicates the number of characters in \code{a}, \eqn{m} is
the number of character matches and \eqn{t} the number of transpositions of
matching characters. The \eqn{w_i} are weights associated with the characters
in \code{a}, characters in \code{b} and with transpositions.  A character
\eqn{c} of \code{a} \emph{matches} a character from \code{b} when \eqn{c}
occurs in \code{b}, and the index of \eqn{c} in \code{a} differs less than
\eqn{\max(|a|,|b|)/2 -1} (where we use integer division) from the index of
\eqn{c} in \code{b}. Two matching characters are transposed when they are
matched but they occur in different order in string \code{a} and \code{b}.

The \bold{Jaro-Winkler distance} (\code{method=jw}, \code{0<p<=0.25}) adds a
correction term to the Jaro-distance. It is defined as \eqn{d - l*p*d}, where
\eqn{d} is the Jaro-distance. Here,  \eqn{l} is obtained by counting, from
the start of the input strings, after how many characters the first
character mismatch between the two strings occurs, with a maximum of four. The
factor \eqn{p} is a penalty factor, which in the work of Winkler is often
chosen \eqn{0.1}.

For the \bold{soundex} method, strings are translated to a soundex code (see \code{\link{phonetic}} for a specification). The
distance between strings is 0 when they have the same soundex code,
otherwise 1. Note that soundex recoding is only meaningful for characters
in the ranges a-z and A-Z. A warning is emitted when non-printable or non-ascii
characters are encountered. Also see \code{\link{printable_ascii}}.
}
\references{
\itemize{

\item{
 Mark P.J. van der Loo (2014) Approximate text matching with the stringdist package. The R Journal
 6(1) pp 111-122.
}
\item{
 An extensive overview of offline string matching algorithms is given by L. Boytsov (2011). Indexing
 methods for approximate dictionary searching: comparative analyses. ACM Journal of experimental
 algorithmics 16 1-88.
}
\item{
 An extensive overview of (online) string matching algorithms is given by G. Navarro (2001).
 A guided tour to approximate string matching, ACM Computing Surveys 33 31-88.
}
\item{
Many algorithms are available in pseudocode from wikipedia: \url{http://en.wikipedia.org/wiki/Damerau-Levenshtein_distance}.
}

\item{
A good reference for qgram distances is E. Ukkonen (1992), Approximate string matching with q-grams and maximal matches.
Theoretical Computer Science, 92, 191-211.
}

\item{\href{http://en.wikipedia.org/wiki/Jaro\%E2\%80\%93Winkler_distance}{Wikipedia} describes the Jaro-Winker
distance used in this package. Unfortunately, there seems to be no single
 definition for the Jaro distance in literature. For example Cohen, Ravikumar and Fienberg (Proceeedings of IIWEB03, Vol 47, 2003)
 report a different matching window for characters in strings \code{a} and \code{b}.
}

}
}

