% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/external.bold.analyze.align.R
\name{bold.analyze.align}
\alias{bold.analyze.align}
\title{Transform and align the sequence data retrieved from BOLD}
\usage{
bold.analyze.align(
  bold_df,
  marker = NULL,
  align_method = c("ClustalOmega", "Muscle"),
  cols_for_seq_names = NULL,
  ...
)
}
\arguments{
\item{bold_df}{A data frame obtained from \code{\link[=bold.fetch]{bold.fetch()}}.}

\item{marker}{A single character value specifying the gene marker for which the output is generated. Default is NULL (all data is used).}

\item{align_method}{Character vector specifying the type of multiple sequence alignment algorithm to be used (ClustalOmega and Muscle available).}

\item{cols_for_seq_names}{A single or multiple character vector specifying the column headers to be used to name each sequence in the fasta file. Default is NULL in which case, only the processid is used as a name.}

\item{...}{additional arguments that can be passed to \code{msa::msa()} function.}
}
\value{
\itemize{
\item bold_df.mod = A modified BCDM data frame with two additional columns (’aligned_seq’ and ’msa.seq.name’).
}
}
\description{
Function designed to transform and align the sequence data retrieved from the function \code{bold.fetch}.
}
\details{
\code{bold.analyze.align} takes the sequence information obtained using \code{\link[=bold.fetch]{bold.fetch()}} function and performs a multiple sequence alignment. It uses the \code{msa::msa()} function with default settings but additional arguments from the \code{msa} function can be passed through the \code{...} argument. The clustering method can be specified using the \code{align_method} argument, with options including  \code{Muscle} and \code{ClustalOmega} (available via the \code{msa} package). The provided marker name must match the standard marker names (Ex. COI-5P) available on the BOLD webpage (Ratnasingham et al. 2024; pg.404). The name for individual sequences in the output can be customized by using the \code{cols_for_seq_names} argument. If multiple fields are specified, the sequence name will follow the order of fields given in the vector. Performing a multiple sequence alignment on large sequence data might slow (or crash) the system. Additionally, users are responsible for verifying the sequence quality and integrity, as the function does not automatically check for issues like STOP codons and indels within the data.

\emph{Note: }. Users are required to install and load the \code{Biostrings}, \code{msa} and \code{muscle} packages using \code{BiocManager} before running this function.
}
\examples{
\dontrun{
# Search for ids
seq.data.ids <- bold.public.search(taxonomy = list("Oreochromis tanganicae",
                                                "Oreochromis karongae"))
# Fetch the data using the ids.
#1. api_key must be obtained from BOLD support before using `bold.fetch()` function.
#2. Use the `bold.apikey()` function  to set the apikey in the global env.

bold.apikey('apikey')

seq.data<-bold.fetch(get_by = "processid",
                     identifiers = seq.data.ids$processid)

# R packages `msa` and `Biostrings` are required for this function to run.
# For `align_method` = "Muscle", package `muscle` is required as well.

# Both the packages are installed using `BiocManager`.

# Align the data (using  bin_uri as the name for each sequence)
seq.align <- bold.analyze.align(seq.data,
                                cols_for_seq_names = c("bin_uri"),
                                align_method="ClustalOmega")

# Dataframe of the sequences (aligned) with their corresponding names
head(seq.align[,c("aligned_seq","msa.seq.name")])
 }

}
\references{
Ratnasingham S, Wei C, Chan D, Agda J, Agda J, Ballesteros-Mejia L, Ait Boutou H, El Bastami Z M, Ma E, Manjunath R, Rea D, Ho C, Telfer A, McKeowan J, Rahulan M, Steinke C, Dorsheimer J, Milton M, Hebert PDN . "BOLD v4: A Centralized Bioinformatics Platform for DNA-Based Biodiversity Data." In DNA Barcoding: Methods and Protocols, pp. 403-441. Chapter 26. New York, NY: Springer US, 2024.
}
