% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/taxonSortPBDBocc.R
\name{taxonSortPBDBocc}
\alias{taxonSortPBDBocc}
\title{Sorting Unique Taxa of a Given Rank from Paleobiology Database Occurrence Data}
\usage{
taxonSortPBDBocc(
  data,
  rank,
  onlyFormal = TRUE,
  cleanUncertain = TRUE,
  cleanResoValues = c(NA, "\\"", "", "n. sp.", "n. gen.", " ", "  ")
)
}
\arguments{
\item{data}{A table of occurrence data collected from the Paleobiology Database.}

\item{rank}{The selected taxon rank; must be
one of 'species', 'genus', 'family', 'order',
'class' or 'phylum'.}

\item{onlyFormal}{If TRUE (the default) only taxa formally accepted by the Paleobiology
Database are returned. If FALSE, then the identified name fields are searched for any
additional 'informal' taxa with the proper taxon. If their taxon name happens to match
any formal taxa, their occurrences are merged onto the formal taxa. This argument generally
has any appreciable effect when rank = species.}

\item{cleanUncertain}{If TRUE (the default) any
occurrences with an entry in the respective
'resolution' field that is *not* found in the
argument \code{cleanResoValue} will be removed from
the dataset. These are assumed to be values
indicating taxonomic uncertainty, i.e. 'cf.' or '?'.}

\item{cleanResoValues}{The set of values that can
be found in a 'resolution' field that do not
cause a taxon to be removed, as they do not
seem to indicate taxonomic uncertainty.}
}
\value{
Returns a list where each element is different unique taxon obtained by the sorting function,
and named with that taxon name. Each element is composed of a table containing all the same
occurrence data fields as the input (potentially with some fields renamed and some field
contents change, due to vocabulary translation).
}
\description{
Functions for sorting out unique taxa from Paleobiology Database occurrence downloads,
which should accept several different formats resulting from different versions of the
PBDB API and different vocabularies available from the API.
}
\details{
Data input for \code{taxonSortPBDBocc} are expected to be from version 1.2 API
with the 'pbdb' vocabulary. However, datasets are
passed to internal function \code{translatePBDBocc},
which attempts to correct any necessary field names and field contents used by
\code{taxonSortPBDBocc}.

This function can pull either \emph{just} the 'formally' identified
and synonymized taxa in a given table of occurrence
data or pull \emph{in addition} occurrences listed under informal
taxa of the sought taxonomic rank. Only formal taxa
are sorted by default; this is controlled by argument \code{onlyFormal}.
Pulling the informally-listed taxonomic
occurrences is often necessary in some groups that have received
little focused taxonomic effort, such that many
species are linked to their generic taxon ID and never received
a species-level taxonomic ID in the PBDB.
Pulling both formal and informally listed taxonomic occurrences
is a hierarchical process and performed in
stages: formal taxa are identified first, informal taxa are
identified from the occurrences that are
'leftover', and informal occurrences with name labels
that match a previously sorted formally listed
taxon are concatenated to the 'formal' occurrences for that same taxon,
rather than being listed under separate elements
of the list as if they were separate taxa.
This function is simpler than similar functions that inspired it
by using the input"rank" to both filter occurrences and directly
reference a taxon's accepted taxonomic placement, rather than a
series of specific \code{if()} checks. Unlike some similar functions
in other packages, such as version 0.3 \code{paleobioDB}'s
\code{pbdb_temp_range}, \code{taxonSortPBDBocc} does not check
if sorted taxa have a single 'taxon_no' ID number. This makes the blanket
assumption that if a taxon's listed name in relevant fields is identical,
the taxon is identical, with the important caveat that occurrences with
accepted formal synonymies are sorted first based on their accepted names, followed by
taxa without formal taxon IDs. This should avoid
linking the same occurrences to multiple taxa by mistake, or assigning
occurrences listed under separate formal taxa to the same taxon
based on their 'identified' taxon name, as long as all
formal taxa have unique names (note: this is an untested assumption).
In some cases, this procedure is helpful, such as when
taxa with identical generic and species names are listed under
separate taxon ID numbers because of a difference in the
listed subgenus for some occurrences (example,
"Pseudoclimacograptus (Metaclimacograptus) hughesi' and
'Pseudoclimacograptus hughesi' in the PBDB as of 03/01/2015).
Presumably any data that would be affected by differences
in this procedure is very minor.

Occurrences with taxonomic uncertainty indicators in
the listed identified taxon name are removed
by default, as controlled by argument \code{cleanUncertain}.
This is done by removing any occurrences that
have an entry in \code{primary_reso} (was
"\code{genus_reso}" in v1.1 API) when \code{rank} is a
supraspecific level, and \code{species_reso} when \code{rank = species},
if that entry is not found in
\code{cleanResoValues}. In some rare cases, when
\code{onlyFormal = FALSE}, supraspecific taxon names may be
returned in the output that have various 'cruft' attached, like 'n.sp.'.

Empty values in the input data table ("") are converted
to NAs, as they may be due to issues
with using read.csv to convert API-downloaded data.
}
\examples{
\donttest{

# getting occurrence data for a genus, sorting it
# Dicellograptus
dicelloData <- getPBDBocc("Dicellograptus")
dicelloOcc2 <- taxonSortPBDBocc(
   data = dicelloData, 
   rank = "species",
   onlyFormal = FALSE
   )
names(dicelloOcc2)

# try a PBDB API download with lots of synonymization
	#this should have only 1 species
# *old* way, using v1.1 of PBDB API:
# acoData <- read.csv(paste0(
#	"http://paleobiodb.org/data1.1/occs/list.txt?",
#	"base_name = Acosarina\%20minuta&show=ident,phylo"))
#
# *new* method - with getPBDBocc, using v1.2 of PBDB API:
acoData <- getPBDBocc("Acosarina minuta")
acoOcc <- taxonSortPBDBocc(
   data = acoData, 
   rank = "species", 
   onlyFormal = FALSE
   )
names(acoOcc)

}

#load example graptolite PBDB occ dataset
data(graptPBDB)

#get formal genera
occGenus <- taxonSortPBDBocc(
   data = graptOccPBDB,
   rank = "genus"
   )
length(occGenus)

#get formal species
occSpeciesFormal <- taxonSortPBDBocc(
   data = graptOccPBDB,
   rank = "species")
length(occSpeciesFormal)

#yes, there are fewer 'formal'
   # graptolite species in the PBDB then genera

#get formal and informal species
occSpeciesInformal <- taxonSortPBDBocc(
   data = graptOccPBDB, 
   rank = "species",
   onlyFormal = FALSE
   )
length(occSpeciesInformal)

#way more graptolite species are 'informal' in the PBDB

#get formal and informal species 
	#including from occurrences with uncertain taxonomy
	#basically everything and the kitchen sink
occSpeciesEverything <- taxonSortPBDBocc(
   data = graptOccPBDB, 
   rank = "species",
   onlyFormal = FALSE, 
   cleanUncertain = FALSE)
length(occSpeciesEverything)



}
\references{
Peters, S. E., and M. McClennen. 2015. The Paleobiology Database
application programming interface. \emph{Paleobiology} 42(1):1-7.
}
\seealso{
Occurrence data as commonly used with \code{paleotree} functions can
be obtained with \code{link{getPBDBocc}}. Occurrence data sorted by
this function might be used with functions \code{\link{occData2timeList}}
and  \code{\link{plotOccData}}. Also, see the example graptolite dataset
at \code{\link{graptPBDB}}
}
\author{
David W. Bapst, but partly inspired by Matthew Clapham's \code{cleanTaxon} (found at
\href{https://github.com/mclapham/PBDB-R-scripts/blob/master/taxonClean.R}{this location}
on github) and R package \code{paleobioDB}'s \code{pbdb_temp_range} function (found at
\href{https://github.com/ropensci/paleobioDB/blob/master/R/pbdb_temporal_functions.R#L64-178 }{this location} 
on github.
}
