Concepts for disk.frame

There are a number of concepts and terminologies that are useful to understand in order to use disk.frame effectively.

What is a disk.frame and what are chunks?

A disk.frame is nothing more a folder and in that folder there should be fst files named “1.fst”, “2.fst”, “3.fst” etc. Each of the “.fst” file is called a chunk.

Workers and parallelism

Parallelism in disk.frame is achieved using the future package. When performing many tasks, disk.frame uses multiple workers, where each worker is an R session, to perform the tasks in parallel. For example, suppose we wish to compute the number of rows for each chunk, we can clearly perform this simultaneously in parallel. The code to do that is

# use only one column is fastest
df[,.N, keep = "first_col"]

Say there are n chunks in df, and there are m workers. Then the first m chunks will run chunk[,.N] simultaneously.

To see how many workers are at work, use

# see how many workers are available for work
future::nbrOfWorkers()