# This is a chunk of R code. All text after a # symbol is a comment
# Set working directory using setwd() function
setwd('Enter the path to my working directory')
# Clear all variables in R's memory
rm(list=ls()) # Standard code to clear R's memory
Manipulating data in R
Using arrays and data frames
How to Read this Tutorial
This tutorial is a mixture of R code chunks and explanations of the code. The R code chunks will appear in boxes.
Below is an example of a chunk of R code:
Sometimes the output from running this R code will be displayed after the chunk of code.
Here is a chunk of code followed by the R output
2 + 4 # Use R to add two numbers
[1] 6
Introduction
This worksheet covers the different ways to manipulate arrays of numbers and data frames in R.
Creating an array of numbers
One-dimensional arrays
array() function
Use thearray()
function to create an array of numbers.
For example, to create an array of 13 zeros we would type
array(0, dim=13) # Create an array with 13 elements, all set to zero
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0
This is called aone dimensional arraywith 13 elements.
Concatenate function, c()
We can also create a 1D array of numbers using the concatenate functionc()
c(3, 100, 30, 23, -9, 6.3) # Concatenate six numbers
[1] 3.0 100.0 30.0 23.0 -9.0 6.3
Two-dimensional arrays
A matrix is a two dimensional array. A matrix can also be created using thearray()
function, but now thedim=
argument has two numbers: the first being the number of rows in the matrix and the second the number of columns in the matrix.
# Create a 2D array of zeros with 3 rows and 5 columns
array(0, dim=c(3,5))
[,1] [,2] [,3] [,4] [,5]
[1,] 0 0 0 0 0
[2,] 0 0 0 0 0
[3,] 0 0 0 0 0
Using [ ] to extract data from an array
Example 1
Using[ ]
to specify a single element
# Create a 1D array called a1 with 10 elements all set to be zero
a1 = array(0, dim=10)
If we want to set the 3rd element of the arraya1to be the number 8 we use square brackets[]
as follows
# Set the 3rd elements of the array a1 to be the number 10
a1[3] = 8
a1 # Display the modified array a1
[1] 0 0 8 0 0 0 0 0 0 0
Example 2
Using[ ]
to specify a several elements
We can use[ ]
to specify several elements of thea1array. For example, to display just the 3rd, 5th and 8th elements
# Display the 3rd, 5th and 8th elements of a1
a1[ c(3, 5, 8) ]
[1] 8 0 0
And we can use this approach to set the values of several elements at the same time
# Set the 3rd, 5th and 8th elements of a1 to be 2.5, -76 and 109, respectively
a1[ c(3, 5, 8) ] = c( 2.5, -76, 109)
a1 # Display the modified array a1
[1] 0.0 0.0 2.5 0.0 -76.0 0.0 0.0 109.0 0.0 0.0
Example 3
Using[ ]
to specify an element of a 2D array
First create a 2D array with 3 rows and 5 columns (i.e 15 elements in total) that contain random numbers draw from a standard normal distribution (mean=0, standard deviation=1)
# Create a 3 by 5 array containing random data from a normal distribution
a2 = array( rnorm(15), dim=c(3,5))
a2 # Display the array a2
[,1] [,2] [,3] [,4] [,5]
[1,] -0.7533743 1.1237166 1.0655912 1.4276775 -0.931480
[2,] 0.8206040 0.4790957 0.6671623 -0.1306744 -1.771047
[3,] -0.9179084 0.4961658 0.2490151 0.7301902 -1.867561
The row and column indices of an element can be specified inside square brackets, separated by a comma
# Display the element in the 2nd row, 4th column of a2
a2[ 2, 4]
[1] -0.1306744
We could reassign specific elements of a 2D array
# Reassign the element in the 2nd row, 4th column of a2 to be zero
a2[ 2, 4] = 0
a2 # Display the array a2
[,1] [,2] [,3] [,4] [,5] [1,] -0.7533743 1.1237166 1.0655912 1.4276775 -0.931480 [2,] 0.8206040 0.4790957 0.6671623 0.0000000 -1.771047 [3,] -0.9179084 0.4961658 0.2490151 0.7301902 -1.867561
Example 4
Using[ ]
to specify an entire row/column of a 2D array
An entire row can be specified by leaving the index blank
# Display all the elements in the 2nd row of a2
a2[ 2, ]
[1] 0.8206040 0.4790957 0.6671623 0.0000000 -1.7710466
and similarly for an entire column
# Display all the elements in the 4th column of a2
a2[ , 4]
[1] 1.4276775 0.0000000 0.7301902
Creating a data frame
Method 1
Importing data
Data frames are produced when you import data into R using one of the data import functions (e.g. read.csv()
,read.table()
).
Here is the R code to import the data in file WOLF.CSV. Thedata frameis calledwolfand contains the imported data.
# Import WOLF.CSV file using read.table function
wolf = read.table('WOLF.CSV', header=TRUE, sep=',')
Looking at the first 6 rows of thewolfdata frame shows that it contains 7 variables (i.e. 7 columns). The columns are called: Individual, Sex, Population, Colour, Cpgmg, Tpgmg, Ppgmg.
head(wolf)
Individual Sex Population Colour Cpgmg Tpgmg Ppgmg
1 1 M 2 W 15.86 5.32 NA
2 2 F 1 D 20.02 3.71 14.37622
3 3 F 2 W 9.95 5.30 21.65902
4 4 F 1 D 25.22 3.71 13.42507
5 5 M 2 D 21.13 5.34 NA
6 6 M 2 W 12.48 4.60 NA
Method 2
Converting a 2D array into a data frame
Use theas.data.frame()
function to convert a 2D array into a data frame. Each column of the array will become a different variable. By default the columns will be labelled V1, V2, V3, etc
# Convert array a2 into a data frame
d2 = as.data.frame(a2)
str(d2) # Display the structure of the new data frame
'data.frame': 3 obs. of 5 variables: $ V1: num -0.753 0.821 -0.918 $ V2: num 1.124 0.479 0.496 $ V3: num 1.066 0.667 0.249 $ V4: num 1.43 0 0.73 $ V5: num -0.931 -1.771 -1.868
Method 3
Creating a data frame from scratch
Thedata.frame()
function can be used to create a data frame from scratch.
# Create data frame containing information from a food diary
d3 = data.frame(day=c('Mon','Tues','Wed','Thurs','Mon'),
week=c(1,1,1,2,3),
meal=c('Breakfast','Lunch','Lunch','Dinner','Dinner'),
portions = c(2,1,0,0,1))
Above is an example where a data frame with 4 variables (from a food diary) is created (variable names are day, week, meal, portions). Each variable has 5 observations (rows).
Looking at the structure of the data frame shows that the text has been taken as data typecharacter(abbreviated tochr
) and numbers are data typenumeric
str(d3) # Display the structure of data frame d3
'data.frame': 5 obs. of 4 variables:
$ day : chr "Mon" "Tues" "Wed" "Thurs" ...
$ week : num 1 1 1 2 3
$ meal : chr "Breakfast" "Lunch" "Lunch" "Dinner" ...
$ portions: num 2 1 0 0 1
If we wantday
andmeal
to be factors we could use theas.factor()
function.
d3$day = as.factor(d3$day) # Convert day to be a factor
d3$meal = as.factor(d3$meal) # Convert meal to be a factor
or we could use the argumentstringsAsFactors=TRUE
in the originaldata.frame()
function.
Specifying a column
A column of a data frame corresponds to a single variable.
The$
symbol
The dollar symbol ($
) allows you to specify individual variables in a data frame. The dollar is preceded by the name of the data frame and followed by the name of the variable:<name of data frame>$<name of variable>
To view themealandweekvariables in the data framed3(above) type
d3$week # Display variable week in the d3 data frame
[1] 1 1 1 2 3
d3$meal # Display variable meal in the d3 data frame
[1] Breakfast Lunch Lunch Dinner Dinner
Levels: Breakfast Dinner Lunch
Note that for factors (e.g. themealvariable) R displays a list of the levels in the factor (i.e. the unique values of the qualitative variable).
Square brackets[ ]
Just as for arrays, square brackets can be used to specify a column from a data frame. To specify the 3rd column in thed3data frame type
d3[ ,3] # Display the 3rd column in data frame d3
[1] Breakfast Lunch Lunch Dinner Dinner
Levels: Breakfast Dinner Lunch
Note that the row index is blank in order to specify all rows
Specifying a row
An observation (a row in a data frame) can be specified using[ ]
To specify the 2nd observation in thewolfdata frame type
wolf[2, ]
Individual Sex Population Colour Cpgmg Tpgmg Ppgmg
2 2 F 1 D 20.02 3.71 14.37622
Note that the column index is left blank in order to specify all columns
We can use the same notation to specify a single value. To specify the 2nd observation of the 5th column in thewolfdata frame type
wolf[2, 5]
[1] 20.02
Logical indexing
Logical indexing uses a logical variable to specify elements in an array or data frame. A logical variable is eitherTRUE
orFALSE
.
A logical index will extract values from a data frame at positions where the logical index isTRUE
.
Data frame example
Here is an example using thed3data frame. We want to extract all observations from week 1.
d3 # Display data in the d3 data frame
day week meal portions
1 Mon 1 Breakfast 2
2 Tues 1 Lunch 1
3 Wed 1 Lunch 0
4 Thurs 2 Dinner 0
5 Mon 3 Dinner 1
# Create a logical array that is TRUE if an observation is from week 1
logInd = d3$week==1
str(logInd) # Display the structure of the logical array
logi [1:5] TRUE TRUE TRUE FALSE FALSE
# Use the logical array inside square brackets to specify rows corresponding to week 1
d3[logInd, ]
day week meal portions
1 Mon 1 Breakfast 2
2 Tues 1 Lunch 1
3 Wed 1 Lunch 0
Rows corresponding the wherelogIndisTRUE
are extracted.
The same result could be obtained using thesubset()
function
# Use the subset function to specify rows corresponding to week 1
subset(d3, week==1)
day week meal portions
1 Mon 1 Breakfast 2
2 Tues 1 Lunch 1
3 Wed 1 Lunch 0
Array example
Here is an example where we want to extract all negative numbers from the 2D arraya2
a2 # Display the 2D array a2
[,1] [,2] [,3] [,4] [,5]
[1,] -0.7533743 1.1237166 1.0655912 1.4276775 -0.931480
[2,] 0.8206040 0.4790957 0.6671623 0.0000000 -1.771047
[3,] -0.9179084 0.4961658 0.2490151 0.7301902 -1.867561
# Create a logical array that is TRUE when an element of a2 is negative
logInd = a2<0
# Extract elements of a2 that are negative
a2[logInd]
[1] -0.7533743 -0.9179084 -0.9314800 -1.7710466 -1.8675614
Reassigning values in a data frame using [ ]
Above we used logical indexing to extract all observations from week 1 in the data framed3
. We can use this approach to reassign the values in a data frame. This can be useful when correcting errors in a data frame.
Here is how you reassign the values of portions for week 1 in data frame d3 to be equal to 0
# Create a logical array that is TRUE if an observation is from week 1
loInd = d3$week == 1
# Reassign portions to be 0 in week 1
d3$portions[logInd] = 0
Reorganising a data frame
This section shows how to pivot data. This means combining several columns into one column (i.e. putting data in a longer format) and splitting data from one column across several columns (i.e. putting data into a wider format). We will be using thepivot_longer()
andpivot_wider()
functions from the packagetidyr
.
We will use three more data sets (INSECT.TXT, MALIN_HEAD.TXT and BEEKEEPER.TXT). These data sets are explained athttp://www.ucd.ie/ecomodel/Resources/datasets_WebVersion.html
First import the data
# TAB delimited variables
insect = read.table('INSECT.TXT', header=T, skip=3, sep='\t')
insect = insect[,-c(7,8)] # Remove the last two empty columns
rainfall = read.table('MALIN_HEAD.TXT', header=T, sep='\t')
# White space delimited variables
bees = read.table('BEEKEEPER.TXT', header=T, skip=6, sep='')
Several columns into one column (usingpivot_longer()
)
Look at the insect data frame
insect
Spray.A Spray.B Spray.C Spray.D Spray.E Spray.F
1 10 11 0 3 3 11
2 7 17 1 5 5 9
3 20 21 7 12 3 15
4 14 11 2 6 5 22
5 14 16 3 4 3 15
6 12 14 1 3 6 16
7 10 17 2 5 1 13
8 23 17 1 5 1 10
9 17 19 3 5 3 26
10 20 21 0 5 2 26
11 14 7 1 2 6 24
12 13 13 4 4 4 13
Theinsect
data frame is nottidy: it is not organised so that each row is an observation and each column is a variable.
It has two variables:spray name(A-F) andinsect count, so ideally we want a data frame with two columns (one column would contain the type of spray used on a plot and the second column would contain the insect count from the plot). As it stands thespray nameappears as the names of the columns, and theinsect countis spread across six columns.Ideally we need to reorganize the data into a long-format, so that there are just two columns, one for each variable.
To reorganize the data frame so that each column corresponds to a single variable we use thepivot_longer()
function in thetidyr
package. This function will:
- Specify which columns to combine into one column with the
cols=
argument. - Create a new variable containing the names of the columns from the insect data frame. This column’s name can be set with the
names_to=
argument of thepivot_longer()
function. - Create a second new variable containing the numerical data in the columns. This column’s name can be set with the
values_to=
argument of thepivot_longer()
function.
library('tidyr') # Load tidyr package
# Tidy the insect data frame
# Gather all the count data into one column
insect.long = pivot_longer(data=insect,
cols=c(1:6),
values_to="Count",
names_to="Spray")
head(insect.long) # Display the reshaped data frame
# A tibble: 6 × 2
Spray Count
<chr> <int>
1 Spray.A 10
2 Spray.B 11
3 Spray.C 0
4 Spray.D 3
5 Spray.E 3
6 Spray.F 11
This reorganized data frame is inlong-formatand is sometimes called amolten data frame.
Example: Reorganising climate data
A common example of using thepivot_longer()
function to combine multiple columns of data into a single column occurs with monthly climate data (e.g. Met Eireann’s monthly data;(opens in a new window)https://www.met.ie/climate/available-data). It is common for raw data to be displayed with columns to represent individual months. To analyse these data we will want all rainfall data in one column, and a second column added giving the month.
rainfall # Display the data
Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1 2016 166.3 130.7 72.7 52.9 65.6 72.4 109.9 63.4 NA NA NA NA
2 2015 176.0 85.8 123.1 64.7 137.0 56.1 132.7 111.0 29.7 71.9 222.9 272.9
3 2014 162.2 189.9 71.6 33.4 86.8 48.6 86.0 95.3 23.0 131.4 134.4 150.5
4 2013 140.9 74.1 61.7 61.6 102.5 85.5 56.5 92.6 69.7 103.8 116.0 178.6
With these data we will combine the monthly rainfall data into two columns: one with the name of the month and one with the rainfall for the month.
However we do not want to change theYearcolumn, so we instruct R to not use the year column, using the argumentcols=-Year
(see below).
# Organise the rainfall data
rainfall.long = pivot_longer(data=rainfall,
cols=-Year,
names_to="Month",
values_to="Rainfall")
# Display the first 6 lines of the organised data frame
head(rainfall.long)
# A tibble: 6 × 3
Year Month Rainfall
<int> <chr> <dbl>
1 2016 Jan 166.
2 2016 Feb 131.
3 2016 Mar 72.7
4 2016 Apr 52.9
5 2016 May 65.6
6 2016 Jun 72.4
One column across several columns (usingpivot_wider()
)
Sometimes you want to take well formatted data and spread one variable across several columns. One option to do this you can use thepivot_wider()
function intidyr
.
Thebees
data frame is well formatted (one variable in one column and each row an observation). In this well formatted data set we have a column calledFlowers(density of flowers), and a column calledField(the field where the flowers density was measured). We will spread theFlowersdata out so that the density of flowers is in two columns: one columns for field 1 and one column for field 2.
We will use thepivot_wider()
function. This function will:
-
Create column names from the values in variableField(categorical variable). This is done with the
names_from=
andnames_prefix=
arguments of thepivot_wider()
function. -
Indicate that theFlowersvariable will be used to fill the newly created columns. This is done with the
values_from=
argument of thepivot_wider()
function.
# Spread the Field across multiple columns
# (see below for an explanation of names_prefix="Field_")
bees.wide = pivot_wider(data=bees,
values_from="Flowers",
names_from="Field",
names_prefix="Field_")
head(bees.wide) # Display the spread out data frame
# A tibble: 6 × 4
Treatment Plot Field_1 Field_2
<int> <int> <int> <int>
1 1 1 23 31
2 1 2 26 32
3 1 3 20 35
4 1 4 21 29
5 2 1 20 30
6 2 2 24 32
Thenames_prefix="Field_"
argument is used to create the new column names, by addingField_as a prefix to the field number.
Example: Reorganising climate data
We could use thepivot_wider()
function to return the monthly rainfall data (see above) from its tidy format (data framerainfall.long
) to a wide format where each month’s rainfall has its own column.
# Reorganise the rainfall data back into a wide format
rainfall.wide = pivot_wider(data=rainfall.long,
values_from="Rainfall",
names_from="Month")
rainfall.wide # Display the wide-format rainfall data
# A tibble: 4 × 13
Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2016 166. 131. 72.7 52.9 65.6 72.4 110. 63.4 NA NA NA NA
2 2015 176 85.8 123. 64.7 137 56.1 133. 111 29.7 71.9 223. 273.
3 2014 162. 190. 71.6 33.4 86.8 48.6 86 95.3 23 131. 134. 150.
4 2013 141. 74.1 61.7 61.6 102. 85.5 56.5 92.6 69.7 104. 116 179.
Other data organisation packages
There are several other data manipulation tools in R. Two common packages for data manipulation are:
reshape2
dplyr
See also(opens in a new window)http://www.r-bloggers.com/reshape-and-aggregate-data-with-the-r-package-reshape2/and(opens in a new window)http://www.cookbook-r.com/Manipulating_data/Converting_data_between_wide_and_long_format/
Summarising a variable by group
Thesummary()
function produces a summary of an entire data frame. However, it doesn’t allow you to calculate summaries for groups within the data frame.
If you want to calculate a mean across different grouping (e.g. for each value of a qualitative factor) theaggregate()
function is very useful.
To calculate the mean rainfall across years for each month:
# Mean rainfall across years 2013-2016 for each month
aggregate(Rainfall~Month, # Define aggregation using a formula
data=rainfall.long, # Specify data frame to be aggregated
FUN=mean, # Specify function to use for aggregation
na.rm=TRUE, # Ignore missing values
na.action=na.pass) # Pass missing data to the aggregation function
Month Rainfall
1 Apr 53.1500
2 Aug 90.5750
3 Dec 200.6667
4 Feb 120.1250
5 Jan 161.3500
6 Jul 96.2750
7 Jun 65.6500
8 Mar 82.2750
9 May 97.9750
10 Nov 157.7667
11 Oct 102.3667
12 Sep 40.8000
The order of the months is alphabetical, making the summary harder to read. This could be improved by specifying theMonth
variable as a factor with the levels in a specified order. Below is the code to do this.
# Define Month as a factor with levels in a specified order
rainfall.long$Month = factor(rainfall.long$Month,
levels = c('Jan','Feb','Mar','Apr','May','Jun',
'Jul','Aug','Sep','Oct','Nov','Dec'))
Theaggregate
function can be used to produce a summary for each month
# Summary of rainfall across years 2013-2016 for each month
aggregate(Rainfall~Month,
data=rainfall.long,
FUN=summary)
Month Rainfall.Min. Rainfall.1st Qu. Rainfall.Median Rainfall.Mean
1 Jan 140.9000 156.8750 164.2500 161.3500
2 Feb 74.1000 82.8750 108.2500 120.1250
3 Mar 61.7000 69.1250 72.1500 82.2750
4 Apr 33.4000 48.0250 57.2500 53.1500
5 May 65.6000 81.5000 94.6500 97.9750
6 Jun 48.6000 54.2250 64.2500 65.6500
7 Jul 56.5000 78.6250 97.9500 96.2750
8 Aug 63.4000 85.3000 93.9500 90.5750
9 Sep 23.0000 26.3500 29.7000 40.8000
10 Oct 71.9000 87.8500 103.8000 102.3667
11 Nov 116.0000 125.2000 134.4000 157.7667
12 Dec 150.5000 164.5500 178.6000 200.6667
Rainfall.3rd Qu. Rainfall.Max.
1 168.7250 176.0000
2 145.5000 189.9000
3 85.3000 123.1000
4 62.3750 64.7000
5 111.1250 137.0000
6 75.6750 85.5000
7 115.6000 132.7000
8 99.2250 111.0000
9 49.7000 69.7000
10 117.6000 131.4000
11 178.6500 222.9000
12 225.7500 272.9000
Summary of topics
- Creating an array (1D and 2D)
array()
functionc()
function
- Creating a data frame
- Importing data into a data frame
as.data.frame()
functiondata.frame()
function
- Specifying and modifying elements in arrays and data frames
- Square brackets
[]
- Dollar symbol
$
- Logical indices
- Square brackets
- Reorganising a data frame
- Gathering data into one column
- Spreading data across several columns
- Groupwise summary a data frame
Further Reading
All these books can be found in UCD’s library
- Andrew P. Beckerman and Owen L. Petchey, 2012Getting Started with R: An introduction for biologists(Oxford University Press, Oxford)
- Mark Gardner, 2012Statistics for Ecologists Using R and Excel(Pelagic, Exeter)
- Tenko Raykov and George A Marcoulides, 2013Basic statistics: an introduction with R(Rowman and Littlefield, Plymouth)
- John Verzani, 2005Using R for introductory statistics(Chapman and Hall, London)
- Andy P. Field, Jeremy Miles, Zoë Field, 2013Discovering statistics using R(Sage, London)