Data Frames and their Management¶
This chapter describes how a typical data set used in multivariate analysis is composed - i.e. as a rectangular arrangement of variables and observations. The chapter further describes ways to manipulate data within data frames and how data frames can be restricted, combined, and reshaped. The chapter also discusses how data in various formats can be imported. Such data formats include CSV, TAB-delimited, and fixed-column files.
Below is the supporting material for the various sections of the chapter.
The Structure of Data Frames¶
-
Script file:
structure-of-data-frames.R## Data frame construction ########################################################## # First create a few vectors from which we construct the data frame: population <- c(55619400,1885400,5424800,3125000) area.sq.m <- c(50301,5460,30090,8023) GVA.cap <- c(28096,20000,24800,19900) # then we use 'data.frame' to construct the data frame: UK <- data.frame(population,area.sq.m,GVA.cap) UK names(UK) names(UK) <- c("Population","Area","GVA") UK row.names(UK) row.names(UK) <- c("England", "Northern Ireland", "Scotland", "Wales") UK # It is also possible to set the names and row names in the data frame explicitly, when this # appears more convenient: UK <- data.frame( Population = c(55619400,1885400,5424800,3125000), Area = c(50301,5460,30090,8023), GVA = c(28096,20000,24800,19900), row.names = c("England", "Northern Ireland", "Scotland", "Wales")) UK nrow(UK) ncol(UK) dim(UK) # In what follows we treat the data frame 'UK' as a list: # Here we get the variable 'Population': UK$Population # Analoguously, one can use the double bracket-operator ('[[]]') # to get the variable 'Population': UK[["Population"]] # Also the single bracket-operator works as with lists. # We get a data frame of the first two variables in # the data frame UK[1:2] # Now we get a data frame with the variables named 'Population' and # 'GVA' UK[c("Population","GVA")] # In the next few lines show the selection of rows and columns of a data frame # We select the first two rows of the # data frame 'UK' by just using their numbers: UK[1:2,] # By referring to row names, we select Scotland and Wales: UK[c("Scotland","Wales"),] # As in a previous example, we select the first two columns ... UK[,1:2] # and the variables named 'Population' and 'GVA' UK[,c("Population","GVA")]
Accessing and Changing Variables in Data Frames¶
-
Script file:
accessing-and-changing-variables.Rload("bes2010feelings-prepost.RData") c( Brown = mean(bes2010flngs_pre$flng.brown,na.rm=TRUE), Cameron = mean(bes2010flngs_pre$flng.cameron,na.rm=TRUE), Clegg = mean(bes2010flngs_pre$flng.clegg,na.rm=TRUE), Salmond = mean(bes2010flngs_pre$flng.salmond,na.rm=TRUE), Jones = mean(bes2010flngs_pre$flng.jones,na.rm=TRUE) ) ## Use of 'attach' Mean <- function(x,...) mean(x,na.rm=TRUE,...) attach(bes2010flngs_pre) c( Brown = Mean(flng.brown), Cameron = Mean(flng.cameron), Clegg = Mean(flng.clegg), Salmond = Mean(flng.salmond), Jones = Mean(flng.jones) ) attach(bes2010flngs_post) c( Brown = Mean(flng.brown), Cameron = Mean(flng.cameron), Clegg = Mean(flng.clegg), Salmond = Mean(flng.salmond), Jones = Mean(flng.jones) ) detach(bes2010flngs_post) c( Brown = Mean(flng.brown), Cameron = Mean(flng.cameron), Clegg = Mean(flng.clegg), Salmond = Mean(flng.salmond), Jones = Mean(flng.jones) ) detach(bes2010flngs_pre) # 'with()' is a better alternative, because it is clear where the data in the varialbes come from: with(bes2010flngs_pre,c( Brown = Mean(flng.brown), Cameron = Mean(flng.cameron), Clegg = Mean(flng.clegg), Salmond = Mean(flng.salmond), Jones = Mean(flng.jones) )) with(bes2010flngs_post,c( Brown = Mean(flng.brown), Cameron = Mean(flng.cameron), Clegg = Mean(flng.clegg), Salmond = Mean(flng.salmond), Jones = Mean(flng.jones) )) ## Changing variables within a data frame ################################################# bes2010flngs_pre <- within(bes2010flngs_pre,{ ave_flng <- (flng.brown + flng.cameron + flng.clegg)/3 rel_flng.brown <- flng.brown - ave_flng rel_flng.cameron <- flng.cameron - ave_flng rel_flng.clegg <- flng.clegg - ave_flng }) # It is also possible without 'within()' but this is terribly tedious: bes2010flngs_pre$ave_flng <- (bes2010flngs_pre$flng.brown + bes2010flngs_pre$flng.cameron + bes2010flngs_pre$flng.clegg)/3 bes2010flngs_pre$rel_flng.brown <- (bes2010flngs_pre$flng.brown - bes2010flngs_pre$ave_flng) bes2010flngs_pre$rel_flng.cameron <- (bes2010flngs_pre$flng.cameron - bes2010flngs_pre$ave_flng) bes2010flngs_pre$rel_flng.clegg <- (bes2010flngs_pre$flng.clegg - bes2010flngs_pre$ave_flng)
Required data file:
bes2010feelings-prepost.RData(This data set is prepared from the original available athttps://www.britishelectionstudy.com/data-object/2010-bes-cross-section/by removing identifying information and scrambling the data)
Manipulating Data Frames¶
Subsetting¶
-
Script file:
subsetting.R# First we load an R data file that contains data from the 2010 # British election study. load("bes2010feelings-prepost.RData") # We then create a subset with only observations from Scotland # and with parties and party leaders that run in Scotland: bes2010flngs_pre_scotland <- subset(bes2010flngs_pre, region=="Scotland", select=c( flng.brown, flng.cameron, flng.clegg, flng.salmond, flng.labour, flng.cons, flng.libdem, flng.snp, flng.green)) # We can now compare the avarage feeling about Gordon Brown # in the whole sample and in the subsample from Scotland: # First the whole UK: with(bes2010flngs_pre,mean(flng.brown,na.rm=TRUE)) # then the Scotland subsample: with(bes2010flngs_pre_scotland,mean(flng.brown,na.rm=TRUE)) # It is also possible to create a subset of cases and variables with the # bracket operator, but this is pretty tedious: bes2010flngs_pre_scotland <- bes2010flngs_pre[ bes2010flngs_pre$region=="Scotland",c( "flng.labour", "flng.cons", "flng.libdem", "flng.snp", "flng.green", "flng.brown", "flng.cameron", "flng.clegg", "flng.salmond")]
Required data file:
bes2010feelings-prepost.RData(This data set is prepared from the original available athttps://www.britishelectionstudy.com/data-object/2010-bes-cross-section/by removing identifying information and scrambling the data)
Merging¶
-
Merging with data from the British Election Study
-
Script file:
merging-BES.R# Here we merge data from the British Election Study load("bes2010feelings-prepost-for-merge.RData") # A peek into a data frame about respondents' feelings about # parties: head(bes2010flngs_parties_pre) # And anotehr peek into a data frame about respondents' feelings about # party leaders: head(bes2010flngs_leaders_pre) # The variable that identifies individual respondents in both # data frames is 'id', so we use this variable to match the # rows in both data frames: bes2010flngs_pre_merged <- merge( bes2010flngs_parties_pre, bes2010flngs_leaders_pre, by="id" ) # 'merge' also allows for identifier variables that may have # different names in the two data frame. In such cases # one can use the named arguments 'by.x=' and 'by.y=': bes2010flngs_pre_merged <- merge( bes2010flngs_parties_pre, bes2010flngs_leaders_pre, by.x="id", by.y="id" ) # It is not absolutely necessary to provide a 'by=' argument, if # the merged data frames share a variable (with the same name in both) # that idenfies cases or observations. Therefore, we can call 'merge' here # without any 'by=', 'by.x=', or 'by.y=' arguments: bes2010flngs_pre_merged <- merge( bes2010flngs_parties_pre, bes2010flngs_leaders_pre ) head(bes2010flngs_pre_merged) # The data frame 'constwin' contains data about relectoral districts, that is # which party won the respective district seat in 2005 and 2010. The # variable that identifies the electoral district is both in the individual-level # data frame and the district-level data frame is named 'refno', so we use this # as a matching variable. bes2010pre_merged <- merge( bes2010flngs_pre_merged, constwin, by = "refno" # Not necessary in the present case, because ) # it is the same in both data frames. # As can be glimpsed from the output of 'str', the result of 'merge' is # sorted by the matching variable, i.e. "refno" head(bes2010pre_merged)
Required data file:
bes2010feelings-prepost-for-merge.RData(This data set is prepared from the original available athttps://www.britishelectionstudy.com/data-object/2010-bes-cross-section/by removing identifying information and scrambling the data)
-
Merging with artificial data
-
Script file:
merging-artificial.Rdf1 <- data.frame( x = c(1,3,2,4,6,5), y = c(1,1,2,2,2,4) ) df2 <- data.frame( a = c(51,42,22), b = c(1,2,3) ) df1 df2 # In this first attempt at merging, the data frames do not share any variables, # hence there is no way of determining which of the rows of the two data frames # "belong together". In such a case each row of the first data frame is matched # with each of the second data frame. Hence the number of rows of the result # equals the products of the numbers of rows of the two data frames. df12 <- merge(df1,df2) df12 nrow(df1) nrow(df2) nrow(df12) # By explicitly specifying the variables used for matching, the result # is different: It contains only rows for which matches can be found in # both data frames merge(df1,df2,by.x="y",by.y="b") # With the optional argument 'all.x=TRUE' the result has a row for each # row from the first data frame, whether or not a match is find for it: # Missing information (from non-existing rows of the second data frame) # is filled up with NA merge(df1,df2,by.x="y",by.y="b", all.x=TRUE) # With 'all.y=TRUE' the result contains all rows from the second data # frame: merge(df1,df2,by.x="y",by.y="b", all.y=TRUE) # The argument setting 'all=TRUE' is equivalent with # 'all.x=TRUE' and 'all.y=TRUE' merge(df1,df2,by.x="y",by.y="b", all=TRUE)
Appending¶
-
Script file:
appending.R# First we load some example data from the British Election Study 2010 load("bes2010feelings-for-append.RData") # We now have two BES data frames, one from the pre-election wave and another # from the post-election wave. They contain the same variables, but in a different # order: str(bes2010flngs_pre) str(bes2010flngs_post) # If the variables in the two data frames differ trying to use 'rbind' to append # the data frames fails. bes2010flngs_prepost <- rbind(bes2010flngs_pre[-1], bes2010flngs_post[-1]) # If the variables in the two data frame are the same but differ in their order, # 'rbind' succeeds and the variables are sorted all into the same order before the # data frames are combined into a single one: bes2010flngs_prepost <- rbind(bes2010flngs_pre, bes2010flngs_post) # We compare the tail-ends of the data resulting data frame # 'bes2010flngs_prepost' and the data frame given as second argument to # 'rbind'. The tail-ends are identical except for the order of the # variables. tail(bes2010flngs_prepost) tail(bes2010flngs_post)
Required data file:
bes2010feelings-for-append.RData(This data set is prepared from the original available athttps://www.britishelectionstudy.com/data-object/2010-bes-cross-section/by removing identifying information and scrambling the data)
Reshaping¶
-
Reshaping artificial data
-
Script file:
reshaping-artificial.R# Here we construct the data frame that contains the first two rows # of the data in wide format in the previous illustration. example.data.wide <- data.frame( id = 1:2, v = c(35,42), x1 = c(1.1,2.1), x2 = c(1.2,2.2), x3 = c(1.3,2.3), y1 = c(2.5,3.5), y2 = c(2.7,3.7), y3 = c(2.9,3.9)) example.data.wide # We now call 'reshape()' to cast the data into long format example.data.long <- reshape(data=example.data.wide, varying=list( # The first group of variables # in wide format c("x1","x2","x3"), # The second group of variables # in wide format c("y1","y2","y3") ), v.names=c("x","y"), timevar="t", times=1:3, direction="long") example.data.long # In order to change the data from long into wide format, we can # use almost the same function call, the only difference being # the 'direction=' argument. example.data.wide.a <- reshape(data=example.data.long, varying=list( # The first group of variables # in wide format c("x1","x2","x3"), # The second group of variables # in wide format c("y1","y2","y3") ), v.names=c("x","y"), timevar="t", times=1:3, direction="wide") # The second call of 'reshape' does not completely revert the first call, # because the order of the variables now is different: example.data.wide.a
-
Reshaping data from the British Election Study
-
Script file:
reshaping-BES.R# First we load an R data file that contains data from the 2010 # British election study. load("bes2010feelings-prepost.RData") names(bes2010flngs_pre) # A sensible way to bring these data into long format would be to have the # feelings towards the parties and their leaders as multiple measurements. # Therefore we reshape the data in the appropriate long format: bes2010flngs_pre_long <- reshape( within(bes2010flngs_pre, na <- NA), varying=list( # Parties c("flng.cons","flng.labour","flng.libdem", "flng.snp","flng.pcym", "flng.green","flng.ukip","flng.bnp"), # Party leaders c("flng.cameron","flng.brown","flng.clegg", "flng.salmond","flng.jones", "na","na","na") ), v.names=c("flng.parties", "flng.leaders"), times=c("Conservative","Labour","LibDem", "SNP","Plaid Cymru", "Green","UKIP","BNP"), timevar="party", direction="long") head(bes2010flngs_pre_long,n=14) library(memisc) # With the 'Reshape' function the syntax is a bit # simpler than with 'reshape' from the "stats" package: bes2010flngs_pre_long <- Reshape(bes2010flngs_pre, # Note that "empty" places designate measurement # occastions that are to be filled with NAs. # In the present case these are measurement # feelings about party leaders that were not # asked in the BES 2010 questionnaires. flng.leaders=c(flng.cameron,flng.brown, flng.clegg,flng.salmond, flng.jones,,,), flng.parties=c(flng.cons,flng.labour, flng.libdem,flng.snp, flng.pcym,flng.green, flng.ukip,flng.bnp), party=c("Conservative","Labour","LibDem", "SNP","Plaid Cymru", "Green","UKIP","BNP"), direction="long") # In long format the observations are sorted such that the variable that # distinguishes measurement occasions (the party variable) # changes faster than the variable that distinguishes individuals: head(bes2010flngs_pre_long) # Like with 'reshape', reshaping back from long into wide format takes (almost) the # same syntax as reshaping from wide into long format: bes2010flngs_pre_wide <- Reshape(bes2010flngs_pre_long, # Note that "empty" places designate measurement # occastions that are to be filled with NAs. # In the present case these are measurement # feelings about party leaders that were not # asked in the BES 2010 questionnaires. flng.leaders=c(flng.cameron,flng.brown, flng.clegg,flng.salmond, flng.jones,,,), flng.parties=c(flng.cons,flng.labour, flng.libdem,flng.snp, flng.pcym,flng.green, flng.ukip,flng.bnp), party=c("Conservative","Labour","LibDem", "SNP","Plaid Cymru", "Green","UKIP","BNP"), direction="wide") # After reshaping into wide format, the variables that correspond to multiple # measures of the same variable are grouped together: head(bes2010flngs_pre_wide) save(bes2010flngs_pre_long,file="bes2010flngs-pre-long.RData")
Required data file:
bes2010feelings-prepost.RData(This data set is prepared from the original available athttps://www.britishelectionstudy.com/data-object/2010-bes-cross-section/by removing identifying information and scrambling the data)The script makes use of the memisc package, which is available from
https://cran.r-project.org/package=memisc
Sorting¶
-
Script file:
sorting.Rload("bes2010feelings-pre-long.RData") # Here we use 'order()' ii <- with(bes2010flngs_pre_long,order(id,party)) bes2010flngs_pre_long_sorted <- bes2010flngs_pre_long[ii,] head(bes2010flngs_pre_long_sorted[c("party","id", "flng.leaders","flng.parties")], n=15) # Some more convenient altarnatives: # Using a 'Sort()' function: Sort <- function(data,...){ ii <- eval(substitute(order(...)), envir=data, enclos=parent.frame()) data[ii,] } bes2010flngs_pre_long_sorted <- Sort(bes2010flngs_pre_long, id,party) # Using the 'sort()' method function from the 'memisc' package: library(memisc) bes2010flngs_pre_long_sorted <- sort(bes2010flngs_pre_long, by=~party+id) head(bes2010flngs_pre_long_sorted[c("party","id", "flng.leaders","flng.parties")], n=15)
Required data file:
bes2010feelings-pre-long.RData(This data set is prepared from the original available athttps://www.britishelectionstudy.com/data-object/2010-bes-cross-section/by removing identifying information and scrambling the data)The script makes use of the memisc package, which is available from
https://cran.r-project.org/package=memisc
Aggregating Data Frames¶
-
Script file:
aggregating.R# In the following we aggregate data from the British Election Study 2010 load("bes2010feelings.RData") # Here we obtain the average affects towards the major three parties, using an # 'old-style' call of the function 'aggregate()' Mean <- function(x,...)mean(x,...,na.rm=TRUE) aggregate(bes2010feelings[c("flng.brown","flng.cameron", "flng.clegg","flng.salmond")], with(bes2010feelings, list(Region=region,Wave=wave)), Mean) # More recent versions of R also provide a slightly more convenient way of # calling 'aggregate()' using a formula argument: aggregate(cbind(flng.brown, flng.cameron, flng.clegg, flng.salmond )~region+wave, data=bes2010feelings, Mean) # The 'memisc' package has a somewhat more flexible variant of 'aggregate()' # It is possbile to reproduce the results of 'aggregate' library(memisc) Aggregate(c(Brown=Mean(flng.brown), Cameron=Mean(flng.cameron), Clegg=Mean(flng.clegg), Salmond=Mean(flng.salmond))~region+wave, data=bes2010feelings) # However it also allows to used different summary functions. Var <- function(x,...) var(x,...,na.rm=TRUE) Aggregate(c(Mean(flng.brown),Var(flng.brown))~region+wave, data=bes2010feelings)
Required data file:
bes2010feelings.RData(This data set is prepared from the original available athttps://www.britishelectionstudy.com/data-object/2010-bes-cross-section/by removing identifying information and scrambling the data)The script makes use of the memisc package, which is available from
https://cran.r-project.org/package=memisc
Groupwise computations within Data Frames¶
-
Script file:
groupwise-computations.Rload("bes2010feelings-pre-long.RData") ## Groupwise computations using 'split()' ############################################## bes2010flngs_pre_long.splt <- split(bes2010flngs_pre_long, bes2010flngs_pre_long$id) str(bes2010flngs_pre_long.splt[[1]]) Mean <- function(x,...) mean(x,...,na.rm=TRUE) bes2010flngs_pre_long.splt <- lapply( bes2010flngs_pre_long.splt, within,expr={ rel.flng.parties <- flng.parties - Mean(flng.parties) rel.flng.leaders <- flng.leaders - Mean(flng.leaders) }) str(bes2010flngs_pre_long.splt[[1]]) bes2010flngs_pre_long <- unsplit(bes2010flngs_pre_long.splt, bes2010flngs_pre_long$id) str(bes2010flngs_pre_long) ## Groupwise computations using 'withinGroups()' ####################################### library(memisc) Mean <- function(x,...) mean(x,...,na.rm=TRUE) bes2010flngs_pre_long <- withinGroups(bes2010flngs_pre_long, ~id,{ rel.flng.parties <- flng.parties - Mean(flng.parties) rel.flng.leaders <- flng.leaders - Mean(flng.leaders) }) # We use 'head' to look at the first 14 elements of the re-combined data frame head(bes2010flngs_pre_long[-(1:2)],n=14)
Required data file:
bes2010feelings-pre-long.RData(This data set is prepared from the original available athttps://www.britishelectionstudy.com/data-object/2010-bes-cross-section/by removing identifying information and scrambling the data)The script makes use of the memisc package, which is available from
https://cran.r-project.org/package=memisc
Importing Data into Data Frames¶
-
Script file:
importing-data.R## Importing data from text files ##################################################### # Importing CSV data: # We inspect the text file using 'readLines()' readLines("ConstituencyResults2010.csv",n=5) # For the actual import we use 'read.csv()' ConstRes2010 <- read.csv("ConstituencyResults2010.csv") ConstRes2010[1:5,] # A CSV file without a variable name header readLines("ConstituencyResults2010-nohdr.csv",n=5) ConstRes2010 <- read.csv("ConstituencyResults2010-nohdr.csv", header=FALSE) ConstRes2010[1:5,] # Importing tab-delimited data: readLines("ConstituencyResults2010.tsv",n=5) ConstRes2010 <- read.delim("ConstituencyResults2010.tsv") ConstRes2010[1:5,] ## Importing fixed-width data: readLines("ConstituencyResults2010-fwf.txt",n=5) ConstRes2010 <- read.fwf("ConstituencyResults2010-fwf.txt", widths=c(3,4,4,4,4,4,4,4,4)) ## Importing data from other statistics packages #################################### # Importing data using the 'foreign' package library(foreign) # An SPSS 'system' file ConstRes2010 <- read.spss("ConstituencyResults2010.sav", to.data.frame=TRUE) ConstRes2010[1:5,] # An SPSS 'portable' file ConstRes2010 <- read.spss("ConstituencyResults2010.por", to.data.frame=TRUE) ConstRes2010[1:5,] # A Stata file ConstRes2010 <- read.dta("ConstituencyResults2010.dta") ConstRes2010[1:5,] # The following does not work - newer Stata format is not supported ConstRes2010 <- read.dta("ConstResults2010-stata-new.dta")
Required data files:
These Data files (The original CSV data used to be available from
http://www.hks.harvard.edu/fs/pnorris/Data/Data.htm. Updated data are now available fromhttps://www.pippanorris.com/data.):The script makes use of the memisc package, which is available from
https://cran.r-project.org/package=memisc