# This is a chunk of R code. All text after a # symbol is a comment
# Set working directory using setwd() function
setwd('Enter the path to my working directory')
# Clear all variables in R's memory
rm(list=ls()) # Standard code to clear R's memory
Organising Data in R
A tutorial about data analysis using R (Website Version)
How to Read this Tutorial
This tutorial is a mixture of R code chunks and explanations of the code. The R code chunks will appear in boxes.
Below is an example of a chunk of R code:
Sometimes the output from running this R code will be displayed after the chunk of code.
Here is a chunk of code followed by the R output
2 + 4 # Use R to add two numbers
[1] 6
Objectives
The objectives of this tutorial are:
- Introduce the concept of a data frame
- Demonstrate how data frames can be manipulated
- Demonstrate how to reformat data and code for missing data
- Explain data subsetting in R
- Save imported data to a compact binary file
Introduction
This tutorial will show you how to view, subset and manipulate data frames within R. This assumes that the data have been successfully imported into R (if you are unsuccessful at importing data into R then you need to read the data importing worksheet).
The data we’ll be using have been imported from these files:
- WOLF.CSV: This file is a text file of comma separated variables.
- INSECT.TXT:This file is a text file of TAB delimited variables.
These data sets are described athttp://www.ucd.ie/ecomodel/Resources/datasets_WebVersion.html
Viewing a data frame
Finding variable names
Use thels()
function to print a list of variables in R’s memory
ls() # Display the variables in R's memory
[1] "insect" "wolf"
A poor way to view data
Typing the name of a variable will display all the data contained in the variable.
insect # Display the entire insect data frame
Spray.A Spray.B Spray.C Spray.D Spray.E Spray.F X X.1
1 10 11 0 3 3 11 NA NA
2 7 17 1 5 5 9 NA NA
3 20 21 7 12 3 15 NA NA
4 14 11 2 6 5 22 NA NA
5 14 16 3 4 3 15 NA NA
6 12 14 1 3 6 16 NA NA
7 10 17 2 5 1 13 NA NA
8 23 17 1 5 1 10 NA NA
9 17 19 3 5 3 26 NA NA
10 20 21 0 5 2 26 NA NA
11 14 7 1 2 6 24 NA NA
12 13 13 4 4 4 13 NA NA
BEWARE: Printing out the entire data set is rarely useful, because data sets are often too large to fit on a computer screen (for example, the wolf data frame has 178 rows of data, making it hard to read in one go). There are often better ways to view a data frame than to just print out the entire variable.
Good ways to view data
Here are some options for viewing data frames:
head(wolf) # Display the first 6 lines of the wolf data frame
tail(wolf, n=10) # Display the last 10 lines of the wolf data frame
summary(wolf) # Display an overview of the wolf data frame
str(wolf) # Display the structure of the wolf data frame
Thesummary()
function is particularly useful. It displays summary statistics for each variable in a data frame. Later we will see how thesummary()
function has many uses, such as displaying summary results from a data analysis.
The summary output for a data frame depends upon a variable’sdata type.
- Forquantitative data(
num
andint
) the summary shows the minimum, first quartile (25% quantile), the mean, the median (50% quantile or second quartile), the third quartile (75% quantile), the maximum and the number of missing values (missing values are represented asNA
in R). Examples of numerical data in thewolf
data frameCpgmg,TpgmgandPpgmg. - Forqualitative data(
factor
,logi
) the summary shows first five categories of a qualitative variable and the number of data points in each category. Any remaining categories are lumped together as(Other)
. The number of missing values are also shown. Examples of qualitative data in thewolf
data frame areSexandColour. - Forplain text datathat isn’t qualitative the summary displays the type of data (
Class : character
).
The data type of a variable (e.g. quantitative, qualitative, character) is displayed in the output from thestr()
function.
Viewing part of a data frame
Refering to a single column in a data frame using$
A single variable (column) in a data frame can be specified by giving the name of the data frame, followed by a$
followed by the name of the variable.
Here is a example that specifies just the cortisol data in thewolf
data frame
wolf$Cpgmg # Display just the cortisol data
The names of the variables can be seen at the top of each column of data (for example, using thehead()
function)
# Variable names appear above each column of data
head(wolf) # Display first 6 rows of data.
Individual Sex Population Colour Cpgmg Tpgmg Ppgmg
1 1 M 2 W 15.86 5.32 NA
2 2 F 1 D 20.02 3.71 14.37622
3 3 F 2 W 9.95 5.30 21.65902
4 4 F 1 D 25.22 3.71 13.42507
5 5 M 2 D 21.13 5.34 NA
6 6 M 2 W 12.48 4.60 NA
Adding a variable into a data frame
We can add a variable to a data frame using the$
operator.
Here is an example where we add the variableReplicate
(1-12) which codes for each replicate of an experimental treatment
insect$Replicate = c(1:12) # Add a variable called Replicate to the data frame
head(insect) # Display the first 6 rows of the trimmed data frame
Spray.A Spray.B Spray.C Spray.D Spray.E Spray.F X X.1 Replicate
1 10 11 0 3 3 11 NA NA 1
2 7 17 1 5 5 9 NA NA 2
3 20 21 7 12 3 15 NA NA 3
4 14 11 2 6 5 22 NA NA 4
5 14 16 3 4 3 15 NA NA 5
6 12 14 1 3 6 16 NA NA 6
Changing a variable’s data type
Data in statistical analyses are often one of two basicdata types:quantitativeorqualitativedata.
- R calls a continuous quantitative variablenumeric(abbreviated to
num
) - R calls a discrete quantitative variableinteger(abbreviated to
int
) - R calls a qualitative variable afactor
A qualitative variable is a set of labels (e.g. large, medium and small). Each label is called alevelof the factor.
R also has other data types. Some examples are:
- characterdata type = plain text (abbreviated to
chr
) - logicaldata type = a variable that isTRUEorFALSE(abbreviated to
logi
)
In the wolf data frame the variablesPopulation,Individual,SexandColourare qualitative (the labels from each of these variables identify a data point to a population, an individual, a sex and a coat colour, respectively).
The data types that R has assigned each variable can be seen by looking at the structure of the wolf data frame
str(wolf) # Display the structure of the data frame
'data.frame': 178 obs. of 7 variables:
$ Individual: int 1 2 3 4 5 6 7 8 9 10 ...
$ Sex : chr "M" "F" "F" "F" ...
$ Population: int 2 1 2 1 2 2 1 1 1 2 ...
$ Colour : chr "W" "D" "W" "D" ...
$ Cpgmg : num 15.86 20.02 9.95 25.22 21.13 ...
$ Tpgmg : num 5.32 3.71 5.3 3.71 5.34 4.6 4.58 9.27 4.81 5.07 ...
$ Ppgmg : num NA 14.4 21.7 13.4 NA ...
You can see some issues here:
- The variablesPopulationandIndividualhave not been assigned as quantitative variables (R has identified them as numericalintegers,
int
, because the wolf.csv file used whole numbers as labels for these two variables). - The variablesSexandColourhave been identified as containing text (
chr
type), but we want these to be recognised as qualitative nominal data types (R calls this data type afactor
). The variableSexhas two levels ‘M’ and ‘F’. The variableColouralso has two levels ‘D’, ‘W’, and blank should be explicitly recognised as missing data.
We want to redefine the variablesPopulation,SexandColourso that R recognizes it as a factor (unorded factor). We will also redefine the variableIndividualto be plain text (i.e. a character) to demonstrate theas.character()
function.
# Convert Population variable from numeric to a factor (a qualitative variable)
wolf$Population = as.factor(wolf$Population)
# Convert Sex variable from character to a factor (a qualitative variable)
wolf$Sex = as.factor(wolf$Sex)
# Convert Colour variable from character to a factor (a qualitative variable)
wolf$Colour = as.factor(wolf$Colour)
# Convert Individual variable from numeric to plain text
wolf$Individual = as.character(wolf$Individual)
# Display an overview of the data frame
summary(wolf)
Individual Sex Population Colour Cpgmg Tpgmg
Length:178 F:72 1: 45 : 30 Min. : 4.75 Min. : 3.140
Class :character M:76 2:103 D: 37 1st Qu.:12.16 1st Qu.: 4.372
Mode :character U:30 3: 30 W:111 Median :15.61 Median : 5.070
Mean :17.74 Mean : 6.148
3rd Qu.:20.35 3rd Qu.: 6.317
Max. :73.19 Max. :61.790
Ppgmg
Min. :12.76
1st Qu.:19.50
Median :25.00
Mean :25.89
3rd Qu.:30.01
Max. :53.28
NA's :109
Notice how the summary of the variablesPopulation,Sex,IndividualandIndividualhave changed now that they are factors. Also note that missing values,NA’s, are explicitly taken into account when summarizing the data (e.g. the variablePpgmg).
There are a set of related functions for coercing variables into other data types. Here are some examples
as.factor(...) # Coerces a variable to be a factor (qualitative, nominal)
as.numeric(...) # Coerces a variable to be numeric (quantitative, continuous)
as.character(...) # Coerces a variable to be a character (qualitative, unordered)
Removing a variable from a data frame
Sometimes we want to remove a variable from a data frame.
Theinsect
data frame has two variables that should not be part of the data set (X
andX.1
). This is quite common when importing data. In this case the reason is two additional TABs at the end of each line in the text file. These TABs are hard to see, but R recognized them, created two additional variables and named them with default labels.
The columns can be removed by first finding out how many rows and columns the data frame has and then removing the last two columns. Here is the code
ncol(insect) # Number of columns in data frame
nrow(insect) # Number of rows in data frame
dim(insect) # Display number of rows and columns
insect = insect[ ,-c(7,8)] # Remove the last two columns
Set missing data to NA
Always use
NA
to represent missing data
Data on coat colour is missing for population 3. R explicitly represents missing data asNA
, but the WOLF.CSV data file uses a blank space to represent missing data.
The code below sets these blank spaces toNA
# Create a logical variable that is TRUE if an observation is from population 3
bool.index = wolf$Population==3
# Set coat colour variable to be NA for observations from population 3
wolf$Colour[bool.index] = NA
Subset of a data frame
Selecting observations (rows) from a data frame
To select only particular rows from a data frame using a criterion you can use thesubset
function.
For example, to make a subset of the data inwolf
that contains only females,
wolf.F = subset(wolf, Sex=='F') # Create a subset with data on female wolves
Another way to subset the data frame using a logical index:
# Create a logical variable which is TRUE if an observation is for a female
bool.index = wolf$Sex=='F'
# Create a subset containing only data on female wolves
wolf.F2 = wolf[bool.index, ]
Make a subset using several variables
# Create a subset containing only data on female wolves in Population 1
# method 1:
wolf.F3 = subset(wolf, Sex=='F' & Population==1)
# Create a subset containing only data on female wolves in Population 1
# method 2:
bool.index = wolf$Sex=='F' & wolf$Population==1
wolf.F4 = wolf[bool.index,]
Another example using a logical OR (|
)
# Create a subset containing only data on wolves in Population 1 OR Population 2
wolf.F5 = subset(wolf, Population==1 | Population==2)
summary(wolf.F5)
Individual Sex Population Colour Cpgmg Tpgmg
Length:148 F:72 1: 45 : 0 Min. : 4.75 Min. : 3.250
Class :character M:76 2:103 D: 37 1st Qu.:12.16 1st Qu.: 4.378
Mode :character U: 0 3: 0 W:111 Median :15.38 Median : 5.030
Mean :16.61 Mean : 5.617
3rd Qu.:19.98 3rd Qu.: 6.067
Max. :40.43 Max. :15.130
Ppgmg
Min. :12.76
1st Qu.:19.50
Median :25.00
Mean :25.89
3rd Qu.:30.01
Max. :53.28
NA's :79
Dropping unused levels of a factor
The subsetwolf.F5
contains no data from population 3, but thefactorPopulationstill has3 levels. To remove unused levels from a factor use the functiondroplevels()
Using thedroplevels()
function on the data framewolf.F5
will remove the level for population 3, as well as any other levels that contain no data (e.g. wolves with an undetermined sex, level U of variableSex)
wolf.F5 = droplevels(wolf.F5) # Update the levels of factors in wolf.F5
summary(wolf.F5) # The factor Population now has 2 levels
Individual Sex Population Colour Cpgmg Tpgmg
Length:148 F:72 1: 45 D: 37 Min. : 4.75 Min. : 3.250
Class :character M:76 2:103 W:111 1st Qu.:12.16 1st Qu.: 4.378
Mode :character Median :15.38 Median : 5.030
Mean :16.61 Mean : 5.617
3rd Qu.:19.98 3rd Qu.: 6.067
Max. :40.43 Max. :15.130
Ppgmg
Min. :12.76
1st Qu.:19.50
Median :25.00
Mean :25.89
3rd Qu.:30.01
Max. :53.28
NA's :79
Selecting variables (columns) from a data frame
The subset command can be used to extract one or more variables from a data frame. For example, to select only the cortisol (Cpgmg
) andPopulation
variables from thewolf
data frame (these are the third and fifth columns in the data frame)
# Create a subset of the data containing the variables 'Population' and 'Cpgmg'
wolf.subset1 = subset(wolf, select=c('Population','Cpgmg'))
Other ways to select variables from a data frame
# Create a subset of the data containing the variables 'Population' and 'Cpgmg'
wolf.subset2 = wolf[,c('Population','Cpgmg')]
# Create a subset of the data containing the variables 'Population' and 'Cpgmg'
# (columns 3 and 5 in the wolf data frame)
wolf.subset3 = wolf[,c(3,5)]
# Create a subset of the data containing the variable 'Population'
# using the variable name
wolf$Population
Variables (columns) and observations (rows) can be selected at the same time. Here is an example selecting data on population identity and cortisol for just female wolves
# Create a subset of the data containing only female wolves and the
# variables 'Population' and 'Cpgmg'
wolf.subset4 = subset(wolf, Sex=='F', select=c('Population','Cpgmg'))
Saving data
Large data sets can be time consuming to import into R. Once a file has been imported it is a good idea to save the data in R’s native binary format. Data in this format is quick to import and takes up less space on the hard drive. By convention, files containing data in R’s binary format have the suffix.Rdata
.
To save the variableswolf
,insect.tidy
andbees
to a file use thesave()
command
# Save wolf, insect.tidy and bees to a file called 'sheet2_data.Rdata'
save(wolf, insect, file='sheet2_data.Rdata')
We can verify that the data have been correctly saved by clearing R’s memory and re-importing them using theload()
command. Try running the following commands to see if you can reload the data saved in filesheet2_data.Rdata
.
rm(list=ls()) # Clear variables from memory
ls() # Display the variables in R's memory
load(file='sheet2_data.Rdata') # Import R binary data from a file
ls() # Display the variables in R's memory
Summary of the topics covered
- Displaying contents of a data frame
- Manipulating data in a data frame
- Creating subset of data
- Saving a data frame to a file using R’s binary data file format
- Reading data from an R binary data file
Further Reading
All these books can be found in UCD’s library
- Andrew P. Beckerman and Owen L. Petchey, 2012Getting Started with R: An introduction for biologists(Oxford University Press, Oxford) [Chapter 3]
- Michael J. Crawley, 2015Statistics : an introduction using R(John Wiley & Sons, Chichester) [Chapter 2]
- Tenko Raykov and George A Marcoulides, 2013Basic statistics: an introduction with R(Rowman and Littlefield, Plymouth)