Organising Data in R

A tutorial about data analysis using R (Website Version)

AUTHOR

AFFILIATION

Jon Yearsley

School of Biology and Environmental Science, UCD

PUBLISHED

January 1, 2024

How to Read this Tutorial

This tutorial is a mixture of R code chunks and explanations of the code. The R code chunks will appear in boxes.

Below is an example of a chunk of R code:

# This is a chunk of R code. All text after a # symbol is a comment
# Set working directory using setwd() function
setwd('Enter the path to my working directory')

# Clear all variables in R's memory
rm(list=ls())    # Standard code to clear R's memory

Sometimes the output from running this R code will be displayed after the chunk of code.

Here is a chunk of code followed by the R output

2 + 4            # Use R to add two numbers

[1] 6

Objectives

The objectives of this tutorial are:

Introduce the concept of a data frame
Demonstrate how data frames can be manipulated
Demonstrate how to reformat data and code for missing data
Explain data subsetting in R
Save imported data to a compact binary file

Introduction

This tutorial will show you how to view, subset and manipulate data frames within R. This assumes that the data have been successfully imported into R (if you are unsuccessful at importing data into R then you need to read the data importing worksheet).

The data we’ll be using have been imported from these files:

WOLF.CSV: This file is a text file of comma separated variables.
INSECT.TXT:This file is a text file of TAB delimited variables.

These data sets are described athttp://www.ucd.ie/ecomodel/Resources/datasets_WebVersion.html

Viewing a data frame

boy holding onto doorhandles

Finding variable names

Use thels()function to print a list of variables in R’s memory

ls()                    # Display the variables in R's memory

[1] "insect" "wolf"

A poor way to view data

Typing the name of a variable will display all the data contained in the variable.

insect                    # Display the entire insect data frame

   Spray.A Spray.B Spray.C Spray.D Spray.E Spray.F  X X.1
1       10      11       0       3       3      11 NA  NA
2        7      17       1       5       5       9 NA  NA
3       20      21       7      12       3      15 NA  NA
4       14      11       2       6       5      22 NA  NA
5       14      16       3       4       3      15 NA  NA
6       12      14       1       3       6      16 NA  NA
7       10      17       2       5       1      13 NA  NA
8       23      17       1       5       1      10 NA  NA
9       17      19       3       5       3      26 NA  NA
10      20      21       0       5       2      26 NA  NA
11      14       7       1       2       6      24 NA  NA
12      13      13       4       4       4      13 NA  NA

BEWARE: Printing out the entire data set is rarely useful, because data sets are often too large to fit on a computer screen (for example, the wolf data frame has 178 rows of data, making it hard to read in one go). There are often better ways to view a data frame than to just print out the entire variable.

Good ways to view data

Here are some options for viewing data frames:

head(wolf)              # Display the first 6 lines of the wolf data frame
tail(wolf, n=10)        # Display the last 10 lines of the wolf data frame
summary(wolf)           # Display an overview of the wolf data frame
str(wolf)               # Display the structure of the wolf data frame

Thesummary()function is particularly useful. It displays summary statistics for each variable in a data frame. Later we will see how thesummary()function has many uses, such as displaying summary results from a data analysis.

The summary output for a data frame depends upon a variable’sdata type.

Forquantitative data(numandint) the summary shows the minimum, first quartile (25% quantile), the mean, the median (50% quantile or second quartile), the third quartile (75% quantile), the maximum and the number of missing values (missing values are represented asNAin R). Examples of numerical data in thewolfdata frameCpgmg,TpgmgandPpgmg.
Forqualitative data(factor,logi) the summary shows first five categories of a qualitative variable and the number of data points in each category. Any remaining categories are lumped together as(Other). The number of missing values are also shown. Examples of qualitative data in thewolfdata frame areSexandColour.
Forplain text datathat isn’t qualitative the summary displays the type of data (Class : character).

The data type of a variable (e.g. quantitative, qualitative, character) is displayed in the output from thestr()function.

Viewing part of a data frame

Refering to a single column in a data frame using`$`

A single variable (column) in a data frame can be specified by giving the name of the data frame, followed by a$followed by the name of the variable.

Here is a example that specifies just the cortisol data in thewolfdata frame

wolf$Cpgmg     # Display just the cortisol data

The names of the variables can be seen at the top of each column of data (for example, using thehead()function)

# Variable names appear above each column of data
head(wolf)     # Display first 6 rows of data.

  Individual Sex Population Colour Cpgmg Tpgmg    Ppgmg
1          1   M          2      W 15.86  5.32       NA
2          2   F          1      D 20.02  3.71 14.37622
3          3   F          2      W  9.95  5.30 21.65902
4          4   F          1      D 25.22  3.71 13.42507
5          5   M          2      D 21.13  5.34       NA
6          6   M          2      W 12.48  4.60       NA

Adding a variable into a data frame

We can add a variable to a data frame using the$operator.
Here is an example where we add the variableReplicate(1-12) which codes for each replicate of an experimental treatment

insect$Replicate = c(1:12)   # Add a variable called Replicate to the data frame

head(insect)                 # Display the first 6 rows of the trimmed data frame

  Spray.A Spray.B Spray.C Spray.D Spray.E Spray.F  X X.1 Replicate
1      10      11       0       3       3      11 NA  NA         1
2       7      17       1       5       5       9 NA  NA         2
3      20      21       7      12       3      15 NA  NA         3
4      14      11       2       6       5      22 NA  NA         4
5      14      16       3       4       3      15 NA  NA         5
6      12      14       1       3       6      16 NA  NA         6

Changing a variable’s data type

Data in statistical analyses are often one of two basicdata types:quantitativeorqualitativedata.

R calls a continuous quantitative variablenumeric(abbreviated tonum)
R calls a discrete quantitative variableinteger(abbreviated toint)
R calls a qualitative variable afactor

A qualitative variable is a set of labels (e.g. large, medium and small). Each label is called alevelof the factor.

R also has other data types. Some examples are:

characterdata type = plain text (abbreviated tochr)
logicaldata type = a variable that isTRUEorFALSE(abbreviated tologi)

In the wolf data frame the variablesPopulation,Individual,SexandColourare qualitative (the labels from each of these variables identify a data point to a population, an individual, a sex and a coat colour, respectively).

The data types that R has assigned each variable can be seen by looking at the structure of the wolf data frame

str(wolf)                    # Display the structure of the data frame

'data.frame':   178 obs. of  7 variables:
 $ Individual: int  1 2 3 4 5 6 7 8 9 10 ...
 $ Sex       : chr  "M" "F" "F" "F" ...
 $ Population: int  2 1 2 1 2 2 1 1 1 2 ...
 $ Colour    : chr  "W" "D" "W" "D" ...
 $ Cpgmg     : num  15.86 20.02 9.95 25.22 21.13 ...
 $ Tpgmg     : num  5.32 3.71 5.3 3.71 5.34 4.6 4.58 9.27 4.81 5.07 ...
 $ Ppgmg     : num  NA 14.4 21.7 13.4 NA ...

You can see some issues here:

The variablesPopulationandIndividualhave not been assigned as quantitative variables (R has identified them as numericalintegers,int, because the wolf.csv file used whole numbers as labels for these two variables).
The variablesSexandColourhave been identified as containing text (chrtype), but we want these to be recognised as qualitative nominal data types (R calls this data type afactor). The variableSexhas two levels ‘M’ and ‘F’. The variableColouralso has two levels ‘D’, ‘W’, and blank should be explicitly recognised as missing data.

We want to redefine the variablesPopulation,SexandColourso that R recognizes it as a factor (unorded factor). We will also redefine the variableIndividualto be plain text (i.e. a character) to demonstrate theas.character()function.

# Convert Population variable from numeric to a factor (a qualitative variable)
wolf$Population = as.factor(wolf$Population)

# Convert Sex variable from character to a factor (a qualitative variable)
wolf$Sex = as.factor(wolf$Sex)

# Convert Colour variable from character to a factor (a qualitative variable)
wolf$Colour = as.factor(wolf$Colour)

# Convert Individual variable from numeric to plain text
wolf$Individual = as.character(wolf$Individual) 

# Display an overview of the data frame
summary(wolf)

  Individual        Sex    Population Colour      Cpgmg           Tpgmg       
 Length:178         F:72   1: 45       : 30   Min.   : 4.75   Min.   : 3.140  
 Class :character   M:76   2:103      D: 37   1st Qu.:12.16   1st Qu.: 4.372  
 Mode  :character   U:30   3: 30      W:111   Median :15.61   Median : 5.070  
                                              Mean   :17.74   Mean   : 6.148  
                                              3rd Qu.:20.35   3rd Qu.: 6.317  
                                              Max.   :73.19   Max.   :61.790  
                                                                              
     Ppgmg      
 Min.   :12.76  
 1st Qu.:19.50  
 Median :25.00  
 Mean   :25.89  
 3rd Qu.:30.01  
 Max.   :53.28  
 NA's   :109

Notice how the summary of the variablesPopulation,Sex,IndividualandIndividualhave changed now that they are factors. Also note that missing values,NA’s, are explicitly taken into account when summarizing the data (e.g. the variablePpgmg).

There are a set of related functions for coercing variables into other data types. Here are some examples

as.factor(...)    # Coerces a variable to be a factor (qualitative, nominal)
as.numeric(...)   # Coerces a variable to be numeric (quantitative, continuous)
as.character(...) # Coerces a variable to be a character (qualitative, unordered)

Removing a variable from a data frame

Sometimes we want to remove a variable from a data frame.

Theinsectdata frame has two variables that should not be part of the data set (XandX.1). This is quite common when importing data. In this case the reason is two additional TABs at the end of each line in the text file. These TABs are hard to see, but R recognized them, created two additional variables and named them with default labels.

The columns can be removed by first finding out how many rows and columns the data frame has and then removing the last two columns. Here is the code

ncol(insect)                # Number of columns in data frame
nrow(insect)                # Number of rows in data frame
dim(insect)                 # Display number of rows and columns

insect = insect[ ,-c(7,8)]  # Remove the last two columns

Set missing data to NA

Always useNAto represent missing data

Data on coat colour is missing for population 3. R explicitly represents missing data asNA, but the WOLF.CSV data file uses a blank space to represent missing data.

The code below sets these blank spaces toNA

# Create a logical variable that is TRUE if an observation is from population 3
bool.index = wolf$Population==3 

# Set coat colour variable to be NA for observations from population 3 
wolf$Colour[bool.index] = NA

Subset of a data frame

Selecting observations (rows) from a data frame

To select only particular rows from a data frame using a criterion you can use thesubsetfunction.

For example, to make a subset of the data inwolfthat contains only females,

wolf.F = subset(wolf, Sex=='F') # Create a subset with data on female wolves

Another way to subset the data frame using a logical index:

# Create a logical variable which is TRUE if an observation is for a female
bool.index = wolf$Sex=='F'  

# Create a subset containing only data on female wolves
wolf.F2 = wolf[bool.index, ]

Make a subset using several variables

# Create a subset containing only data on female wolves in Population 1
# method 1: 
wolf.F3 = subset(wolf, Sex=='F' & Population==1)

# Create a subset containing only data on female wolves in Population 1
# method 2:
bool.index = wolf$Sex=='F' & wolf$Population==1
wolf.F4 = wolf[bool.index,]

Another example using a logical OR (|)

# Create a subset containing only data on wolves in Population 1 OR Population 2
wolf.F5 = subset(wolf, Population==1 | Population==2)

summary(wolf.F5)

  Individual        Sex    Population Colour      Cpgmg           Tpgmg       
 Length:148         F:72   1: 45       :  0   Min.   : 4.75   Min.   : 3.250  
 Class :character   M:76   2:103      D: 37   1st Qu.:12.16   1st Qu.: 4.378  
 Mode  :character   U: 0   3:  0      W:111   Median :15.38   Median : 5.030  
                                              Mean   :16.61   Mean   : 5.617  
                                              3rd Qu.:19.98   3rd Qu.: 6.067  
                                              Max.   :40.43   Max.   :15.130  
                                                                              
     Ppgmg      
 Min.   :12.76  
 1st Qu.:19.50  
 Median :25.00  
 Mean   :25.89  
 3rd Qu.:30.01  
 Max.   :53.28  
 NA's   :79

Dropping unused levels of a factor

The subsetwolf.F5contains no data from population 3, but thefactorPopulationstill has3 levels. To remove unused levels from a factor use the functiondroplevels()

Using thedroplevels()function on the data framewolf.F5will remove the level for population 3, as well as any other levels that contain no data (e.g. wolves with an undetermined sex, level U of variableSex)

wolf.F5 = droplevels(wolf.F5) # Update the levels of factors in wolf.F5
summary(wolf.F5)              # The factor Population now has 2 levels

  Individual        Sex    Population Colour      Cpgmg           Tpgmg       
 Length:148         F:72   1: 45      D: 37   Min.   : 4.75   Min.   : 3.250  
 Class :character   M:76   2:103      W:111   1st Qu.:12.16   1st Qu.: 4.378  
 Mode  :character                             Median :15.38   Median : 5.030  
                                              Mean   :16.61   Mean   : 5.617  
                                              3rd Qu.:19.98   3rd Qu.: 6.067  
                                              Max.   :40.43   Max.   :15.130  
                                                                              
     Ppgmg      
 Min.   :12.76  
 1st Qu.:19.50  
 Median :25.00  
 Mean   :25.89  
 3rd Qu.:30.01  
 Max.   :53.28  
 NA's   :79

Selecting variables (columns) from a data frame

The subset command can be used to extract one or more variables from a data frame. For example, to select only the cortisol (Cpgmg) andPopulationvariables from thewolfdata frame (these are the third and fifth columns in the data frame)

# Create a subset of the data containing the variables 'Population' and 'Cpgmg'
wolf.subset1 = subset(wolf, select=c('Population','Cpgmg'))

Other ways to select variables from a data frame

# Create a subset of the data containing the variables 'Population' and 'Cpgmg'
wolf.subset2 = wolf[,c('Population','Cpgmg')]


# Create a subset of the data containing the variables 'Population' and 'Cpgmg' 
# (columns 3 and 5 in the wolf data frame)
wolf.subset3 = wolf[,c(3,5)]


# Create a subset of the data containing the variable 'Population'
# using the variable name
wolf$Population

Variables (columns) and observations (rows) can be selected at the same time. Here is an example selecting data on population identity and cortisol for just female wolves

# Create a subset of the data containing only female wolves and the 
# variables 'Population' and 'Cpgmg'
wolf.subset4 = subset(wolf, Sex=='F', select=c('Population','Cpgmg'))

Saving data

Large data sets can be time consuming to import into R. Once a file has been imported it is a good idea to save the data in R’s native binary format. Data in this format is quick to import and takes up less space on the hard drive. By convention, files containing data in R’s binary format have the suffix.Rdata.

To save the variableswolf,insect.tidyandbeesto a file use thesave()command

# Save wolf, insect.tidy and bees to a file called 'sheet2_data.Rdata'
save(wolf, insect, file='sheet2_data.Rdata')

We can verify that the data have been correctly saved by clearing R’s memory and re-importing them using theload()command. Try running the following commands to see if you can reload the data saved in filesheet2_data.Rdata.

rm(list=ls())                           # Clear variables from memory
ls()                                    # Display the variables in R's memory
load(file='sheet2_data.Rdata')          # Import R binary data from a file
ls()                                    # Display the variables in R's memory

Summary of the topics covered

Displaying contents of a data frame
Manipulating data in a data frame
Creating subset of data
Saving a data frame to a file using R’s binary data file format
Reading data from an R binary data file

Ecological Modelling

University College Dublin, Belfield, Dublin 4, Ireland.

T: +353 1 716 7777 |

Explore UCD

About UCD

Students

Research & Innovation

Colleges

Engage

Key Services

Contents

Organising Data in R

How to Read this Tutorial

Objectives

Introduction

Viewing a data frame

Finding variable names

A poor way to view data

Good ways to view data

Viewing part of a data frame

Refering to a single column in a data frame using`$`

Adding a variable into a data frame

Changing a variable’s data type

Removing a variable from a data frame

Set missing data to NA

Subset of a data frame

Selecting observations (rows) from a data frame

Dropping unused levels of a factor

Selecting variables (columns) from a data frame

Saving data

Summary of the topics covered

Further Reading

Ecological Modelling

Explore UCD

About UCD

Students

Research & Innovation

Colleges

Engage

Key Services

Organising Data in R

Contents

How to Read this Tutorial

Objectives

Introduction

Viewing a data frame

Finding variable names

A poor way to view data

Good ways to view data

Viewing part of a data frame

Refering to a single column in a data frame using$

Adding a variable into a data frame

Changing a variable’s data type

Removing a variable from a data frame

Set missing data to NA

Subset of a data frame

Selecting observations (rows) from a data frame

Dropping unused levels of a factor

Selecting variables (columns) from a data frame

Saving data

Summary of the topics covered

Further Reading

Ecological Modelling

Refering to a single column in a data frame using`$`