% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/vs_cluster_subseq.R
\name{vs_cluster_subseq}
\alias{vs_cluster_subseq}
\alias{vs_cluster_length}
\title{Cluster FASTA sequences}
\usage{
vs_cluster_subseq(
  fasta_input,
  centroids = NULL,
  strand = "plus",
  sizein = TRUE,
  fasta_width = 0,
  log_file = NULL,
  threads = 1,
  vsearch_options = NULL,
  tmpdir = NULL
)
}
\arguments{
\item{fasta_input}{(Required). A FASTA file path or a FASTA object containing
reads to cluster. See \emph{Details}.}

\item{centroids}{(Optional). A character string specifying the name of the
FASTA output file for the cluster centroid sequences. If \code{NULL}
(default), no output is written to a file and the centroid sequences are
returned as a FASTA object. See \emph{Details}.}

\item{strand}{(Optional). Specifies which strand to consider when comparing
sequences. Can be either \code{"plus"} (default) or \code{"both"}.}

\item{sizein}{(Optional). If \code{TRUE} (default), abundance annotations
present in sequence headers are taken into account.}

\item{fasta_width}{(Optional). Number of characters per line in the output
FASTA file. Defaults to \code{0}, which eliminates wrapping.}

\item{log_file}{(Optional). Name of the log file to capture messages from
\code{VSEARCH}. If \code{NULL} (default), no log file is created.}

\item{threads}{(Optional). Number of computational threads to be used by
\code{VSEARCH}. Defaults to \code{1}.}

\item{vsearch_options}{(Optional). Additional arguments to pass to
\code{VSEARCH}. Defaults to \code{NULL}. See \emph{Details}.}

\item{tmpdir}{(Optional). Path to the directory where temporary files should
be written when tables are used as input or output. Defaults to
\code{NULL}, which resolves to the session-specific temporary directory
(\code{tempdir()}).}
}
\value{
A tibble or \code{NULL}.

If \code{centroids} is specified the centroid sequences are written to the
specified file, and no tibble is returned.

If \code{centroids} is not specified, a FASTA object
is returned. This is a \code{tibble} with columns \code{Header} and
\code{Sequence}, and also the additional column(s) \code{members} and, if
\code{sizein = TRUE}, \code{size}.
}
\description{
\code{vs_cluster_subseq} clusters FASTA sequences from a given
file or object using \code{VSEARCH}´s \code{cluster_fast} method and 100%
identity. The function automatically sorts sequences by decreasing length
before clustering.
}
\details{
After merging/dereplication some sequences may be sub-sequences of longer
sequences. This function will cluster such sequences at 100% identity
(terminal gaps ignored), and keep the longest in each cluster as the
centroid.

\code{fasta_input} can either be a file path to a FASTA file or a FASTA
object. FASTA objects are tibbles that contain the columns \code{Header} and
\code{Sequence}, see \code{\link[microseq]{readFasta}}.

If \code{sizein = TRUE} (default) the FASTA headers must contain text
matching the regular expression \code{"size=[0-9]+"} indicating the copy
number (=size) of each input sequence. This is then summed for each cluster
and added to the output. This text is typically added by de-replication, see
\code{\link{vs_fastx_uniques}}.

The number of distinct sequences in each cluster is output as \code{members}.

\code{vsearch_options} allows users to pass additional command-line arguments
to \code{VSEARCH} that are not directly supported by this function. Refer to
the \code{VSEARCH} manual for more details.
}
\examples{
\dontrun{
# Define arguments
fasta_input <- file.path(file.path(path.package("Rsearch"), "extdata"),
                                   "small.fasta")

# De-replicating
derep.tbl <- vs_fastx_uniques(fasta_input, output_format = "fasta")

# Clustering subsequences
cluster.tbl <- vs_cluster_subseq(fasta_input = derep.tbl)

# Cluster sequences and write centroids to a file
vs_cluster_subseq(fasta_input = derep.tbl,
                  centroids = "distinct.fa")
}

}
\references{
\url{https://github.com/torognes/vsearch}
}
