% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/crm_pdf.R
\name{crm_pdf}
\alias{crm_pdf}
\title{Get full text PDFs}
\usage{
crm_pdf(url, overwrite = TRUE, read = TRUE, cache = FALSE,
  overwrite_unspecified = FALSE, ...)
}
\arguments{
\item{url}{A URL (character) or an object of class \code{tdmurl} from a call
to \code{\link[=crm_links]{crm_links()}}. If you'll be getting text from the publishers are use
Crossref TDM (which requires authentication), we strongly recommend
using \code{\link[=crm_links]{crm_links()}} first and passing output of that here, as \code{\link[=crm_links]{crm_links()}}
grabs the publisher Crossref member ID, which we use to do authentication
and other publisher specific fixes to URLs}

\item{overwrite}{(logical) Overwrite file if it exists already?
Default: \code{TRUE}}

\item{read}{(logical) If reading a pdf, this toggles whether we extract
text from the pdf or simply download. If \code{TRUE}, you get the text from
the pdf back. If \code{FALSE}, you only get back the metadata.
Default: \code{TRUE}}

\item{cache}{(logical) Use cached files or not. All files are written to
your machine locally, so this doesn't affect that. This only states whether
you want to use cached version so that you don't have to download the file
again. The steps of extracting and reading into R still have to be performed
when \code{cache=TRUE}. Default: \code{TRUE}}

\item{overwrite_unspecified}{(logical) Sometimes the crossref API returns
mime type 'unspecified' for the full text links (for some Wiley dois
for example). This parameter overrides the mime type to be \code{type}.}

\item{...}{Named curl parameters passed on to \code{\link[crul:HttpClient]{crul::HttpClient()}}, see
\code{\link[curl:curl_options]{curl::curl_options()}} for available curl options}
}
\description{
Get full text PDFs
}
\details{
Note that this function is not vectorized. To do many requests
use a for/while loop or lapply family calls, or similar.

Note that some links returned will not in fact lead you to full text
content as you would understandbly think and expect. That is, if you
use the \code{filter} parameter with e.g., \code{\link[rcrossref:cr_works]{rcrossref::cr_works()}}
and filter to only full text content, some links may actually give back
only metadata for an article. Elsevier is perhaps the worst offender,
for one because they have a lot of entries in Crossref TDM, but most
of the links that are apparently full text are not in facct full text,
but only metadata.

Check out \link{auth} for details on authentication.
}
\section{Caching}{

By default we use
\code{paste0(rappdirs::user_cache_dir(), "/crminer")}, but you can
set this directory to something different. Ignored unless getting
pdf. See \link{crm_cache} for caching details.
}

\examples{
\dontrun{
# set a temp dir. cache path
crm_cache$cache_path_set(path = "crminer", type = "tempdir")

## peerj
x <- crm_pdf("https://peerj.com/articles/2356.pdf")

## pensoft
data(dois_pensoft)
(links <- crm_links(dois_pensoft[1], "all"))
### pdf
crm_text(url=links, type="pdf", read = FALSE)
crm_text(links, "pdf")

## hindawi
data(dois_pensoft)
(links <- crm_links(dois_pensoft[1], "all"))
### pdf
crm_text(links, "pdf", read=FALSE)
crm_text(links, "pdf")

### Caching, for PDFs
# out <- cr_members(2258, filter=c(has_full_text = TRUE), works = TRUE)
# (links <- crm_links(out$data$DOI[10], "all"))
# crm_text(links, type = "pdf", cache=FALSE)
# system.time( cacheyes <- crm_text(links, type = "pdf", cache=TRUE) )
### second time should be faster
# system.time( cacheyes <- crm_text(links, type = "pdf", cache=TRUE) )
# system.time( cacheno <- crm_text(links, type = "pdf", cache=FALSE) )
# identical(cacheyes, cacheno)
}
}
