New Book: Data Management in R
25 November 2020
My new book “Data Management in R: A Guide for Social Scientists” is now in print and is announced by SAGE to appear on 26 December 2020, right on time for Boxing Day!
Supporting material can be found here.
The book has the following contents:
Introduction
This chapter discusses basic concepts and provides an overview over the book.
Basic data structures
The basic data structures are numeric vectors, character vectors, logical vectors, lists, and factors are discussed, as well as basic arithmetic and logical operators. Further basic techniques of data manipulation are discussed, such extracting elements and sequences of elements from vectors or creating samples from their elements. It also briefly discusses generic functions and their methods.
Anyone who has some experience in working with R will be already familiar with these topics. However a book about data management would not be self-contained without their discussion.
Data frames and their management
Data frames are the kind of data structure in R that corresponds to what social scientists know as data sets - a rectangular array of data, where rows correspond to cases or observations and columns correspond to variables, i.e. properties of cases or observations. The chapter discusses the construction of data frames, accessing and modifying variables within them, as well as merging, combining and reshaping them.
Data tables and the Tidyverse
This chapter deals with two widely discussed extensions of data frames and data frame management: the data tables and the Tidyverse collection of packages. Both packages are also critically evaluated in terms of their usefulness for dealing with social science data.
Handling data from social science surveys
This chapter discusses the typical structure of data sets coming from social science surveys, which usually contain meta data such as variable labels, value labels and missing value declarations. It shows how such data can be handled with R with the help of the memisc package.
Managing data from complex samples
For the management of samples that involve stratified samples, cluster samples, and multi-stage samples there is a specific package named survey, which is discussed in this package. Also the chapter shows how to use population-level information to improve inferences using post-stratification, raking, and calibration.
Dates, times, and time series
This chapter discusses special data types for dates, times and time differences, as well as computing on dates (including the automatic handling of leaf years etc). Further, the construction of univariate and multivariate, regular and irregular time series is constructed.
Spatial/Geographical data
This chapter discusses various types of geographical data, i.e. points, lines, polygons and their combinations defined in geographical coordinate systems (i.e. latitudes and longitudes). It also discusses and illustrates spatial relations - such as overlapping, inclusion, etc. - and spatial operations - forming unions and intersections of geographical areas, etc.
Text as data
This chapter looks at the handling of character strings and text as data. It discusses operations specific to character strings, such as searching and replacing sub-strings and character string patterns defined by regular expressions. Further it discusses two major packages that help managing corpora of text documents, the packages tm and quanteda.
The book is supported by R-script files that correspond to the code examples included in the book. These R-scripts can be found here