R Data Types Detailed Walkthrough
insights Stats
R implements a number of useful data types to support complex analytics and calculations. This articles focus on String, Vector, List, Matrix, Array, Factory and Data Frame. It also shows examples about expanding data frame, for example, add or drop columns for data frames, add rows for data frames, etc.
String data type is one of the most frequently used data type in most programming languages. In R, string value can be single quoted ('') or double quoted(""). To escape the quote characters, '\' can be used.
String to raw (binary) conversion is very common in programming. In R, we can use charToRaw and rawToChar functions.
Some of the other commonly used String functions include:
- format: format other data types to certain string format.
- paste: concatenating strings.
- nchar: returns number of chars in the string.
- toupper and tolower: convert the cases of string.
- substring: extract a sub string.
- strsplit: split the string.
The following script R08.String.R includes examples about string data type related transformations or conversion:
> v <- '"i am a character"' > print(v) [1] "\"i am a character\"" > print(class(v)) [1] "character" > v <- '"i am a \'character"' > print(v) [1] "\"i am a 'character\"" > print(class(v)) [1] "character" > > # raw (binary) > v <- charToRaw('"i am a character"') > print(v) [1] 22 69 20 61 6d 20 61 20 63 68 61 72 61 63 74 65 72 22 > print(class(v)) [1] "raw" > # encoding > v <- rawToChar(v) > print(v) [1] "\"i am a character\"" > print(class(v)) [1] "character" > > toupper(v) [1] "\"I AM A CHARACTER\"" > > tolower("R PROGRAMMING") [1] "r programming" > > nchar(v) [1] 18 > > s<-"R PROGRAMMING" > substr(s,3,nchar(s)) [1] "PROGRAMMING" > > strsplit("R:PROGRAMMING", split=":") [[1]] [1] "R" "PROGRAMMING" > > format(13.7, nsmall = 3) [1] "13.700" > format(c(6.0, 13.1), digits = 2) [1] " 6" "13" > format(2^31 - 1, scientific = TRUE) [1] "2.147484e+09" > > format("R PROGRAMMING",width=20, justify = "right") [1] " R PROGRAMMING" > format("R PROGRAMMING",width=20, justify = "centre") [1] " R PROGRAMMING "
In R Programming Basics, I mentioned vector is a list of data elements with same basic data types. Function length() can be used to find the element count in the vector. Function c() can be used to combine values to a vector or list. Arithmetic operations can be applied to vectors. If the object length of operands are different from each other, the longer object's length must be a multiple of the shorter object length. To sort elements, function sort() can be used.
The following are some examples (from script file R09.Vectors.R).
Create sequences
One thing to pay attention to is the final element specified will be discarded if it doesn't belong to the sequence.
> # Creating a sequence from 1 to 10. > v <- 1L:10L > print(v) [1] 1 2 3 4 5 6 7 8 9 10 > print(class(v)) [1] "integer" > > # Creating a sequence from 1.1 to 11.1. > v <- 1.1:11.1 > print(v) [1] 1.1 2.1 3.1 4.1 5.1 6.1 7.1 8.1 9.1 10.1 11.1 > print(class(v)) [1] "numeric" > print(length(v)) [1] 11 > > # If the final element specified does not belong to the sequence then it is discarded. > v <- 2.3:10.5 > print(v) [1] 2.3 3.3 4.3 5.3 6.3 7.3 8.3 9.3 10.3 > print(class(v)) [1] "numeric" > print(length(v)) [1] 9
Use function c()
If the input elements are not the same atomic data type, the subsequent ones will be converted to the same type as the first element.
> # use c to combine values into a Vector or List > v <- c(1:5) > print(v) [1] 1 2 3 4 5 > print(class(v)) [1] "integer" > print(length(v)) [1] 5 > > v <- c("Apple", "Pear") > print(v) [1] "Apple" "Pear" > print(class(v)) [1] "character" > print(length(v)) [1] 2 > > v <- c('Apple',1,TRUE) > print(v) [1] "Apple" "1" "TRUE" > print(class(v)) [1] "character" > print(length(v)) [1] 3
Use sort() function
The following example sort the elements alphabetically in the decreasing order.
> # sorting > c("Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec") -> v > sort(v, decreasing = TRUE) [1] "Sep" "Oct" "Nov" "May" "Mar" "Jun" "Jul" "Jan" "Feb" "Dec" "Aug" "Apr" > # sorting > c("Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec") -> v > sort(v, decreasing = TRUE) [1] "Sep" "Oct" "Nov" "May" "Mar" "Jun" "Jul" "Jan" "Feb" "Dec" "Aug" "Apr" >
Element recycling
The following are two examples of element recycling in arithmetic operation. The second one errors out the x length is 10 while y length is 4.
> # Recycling > x<- 1:10 > y<- 2:6 > y+x [1] 3 5 7 9 11 8 10 12 14 16 > x+y [1] 3 5 7 9 11 8 10 12 14 16 > > # Recycling error > x<- 1:10 > y<- 2:5 > x+y [1] 3 5 7 9 7 9 11 13 11 13 Warning message: In x + y : longer object length is not a multiple of shorter object length
Access vector elements
Once vectors are created, elements of them can be accessed using indexing. The [ ] are used for indexing. Different from C based languages, R index starts from 1. Elements can also be accessed via TRUE or FALSE.
The following are some examples of accessing elements (script R10.AccessingVectorElement.R).
> # Accessing vectors > v <- c("Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec") > print(v) [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec" > print(class(v)) [1] "character" > print(length(v)) [1] 12 > > # Accessing index > v[1] [1] "Jan" > > v[1:5] [1] "Jan" "Feb" "Mar" "Apr" "May" > > v[10:12] [1] "Oct" "Nov" "Dec" > > v[c(1,3,5)] [1] "Jan" "Mar" "May" > > # Exclude Janary and March > v[c(-3,-1)] [1] "Feb" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec" > > # Exclude using FALSE > v[c(TRUE,FALSE,FALSE,FALSE,FALSE,TRUE,FALSE,FALSE)] [1] "Jan" "Jun" "Sep" > > # Vector will be reused. > v[c(TRUE,FALSE,FALSE)] [1] "Jan" "Apr" "Jul" "Oct"
For vectors, the data elements must be the same. If your data elements are different, list is the option. A list is the R object which contains elements of all different R objects. List is created using list() function. List can be converted to vector using unlist() function. List can be merged using function c().
The following are some examples about R list (script R11.Lists.R).
> # create list > myList<- list(1:10,"Test", "b", TRUE, list(1,3:100,"string")) > print(myList) [[1]] [1] 1 2 3 4 5 6 7 8 9 10 [[2]] [1] "Test" [[3]] [1] "b" [[4]] [1] TRUE [[5]] [[5]][[1]] [1] 1 [[5]][[2]] [1] 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 [22] 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 [43] 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 [64] 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 [85] 87 88 89 90 91 92 93 94 95 96 97 98 99 100 [[5]][[3]] [1] "string" > > listA <- list(1:5,10) > listB <- list(10:26, 100) > c(listA, listB) [[1]] [1] 1 2 3 4 5 [[2]] [1] 10 [[3]] [1] 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 [[4]] [1] 100 > > unlist(listA) + unlist(listB) [1] 11 13 15 17 19 25 17 19 21 23 25 31 23 25 27 29 31 110 > > listA[1] [[1]] [1] 1 2 3 4 5
Matrix is a rectangular array of numbers, symbols, or expressions, arranged in rows and columns. Matrices containing numeric elements can be used in mathematical calculations.
The following diagram is cited from Wikipedia about Matrix:
Create and access matrix
Matrix can be created using function:
matrix(data, nrow, ncol, byrow, dimnames)
For example, the following script (R12.Matrices.R) creates three matrices using different approaches:
# Matrices > MA <- matrix(1:10,nrow=2,ncol=5, byrow = TRUE) > class(MA) [1] "matrix" > MB <- matrix(11:20,nrow=2,ncol=5) > > MC <- matrix(31:40, nrow=5, ncol=2, dimnames = list(c('row1','row2','row3','row4','row5'),c('col1','col2'))) > View(MC)
Accessing matrix
Matrix can be accessed via row and column index.
> # Accessing > MC[1,2] [1] 36 > MC[1,] col1 col2 31 36 > # class is integer vector > class(MC[1,]) [1] "integer" > MC[,2] row1 row2 row3 row4 row5 36 37 38 39 40
Matrix arithmetic operations
Matrix supports mathematical operation like +, -, * ,/, which results also a matrix with same rows and columns.
> # matrix operations > MA+MB [,1] [,2] [,3] [,4] [,5] [1,] 12 15 18 21 24 [2,] 18 21 24 27 30 > MA-MB [,1] [,2] [,3] [,4] [,5] [1,] -10 -11 -12 -13 -14 [2,] -6 -7 -8 -9 -10 > MA*MB [,1] [,2] [,3] [,4] [,5] [1,] 11 26 45 68 95 [2,] 72 98 128 162 200 > MA/MB [,1] [,2] [,3] [,4] [,5] [1,] 0.09090909 0.1538462 0.2 0.2352941 0.2631579 [2,] 0.50000000 0.5000000 0.5 0.5000000 0.5000000
There are only two dimensions for a Matrix. To store data in more than two dimensions, array data type can be used.
The following is a diagram from Wikipedia about array of matrix:
Create array
Array can be created using function array(). If an array is created using dimension (m,n,k), the total element count is: m*n*k.
The signature of the function is:
array(data = NA, dim = length(data), dimnames = NULL)
> # Array > > vector<- c(1:30) > > column.names=c("col1","col2","col3") > row.names=c("row1","row2") > matrix.names=c("Matrix1","Matrix2","Matrix3","Matrix4","Matrix5") > > myArray <- array(vector,dim=c(2,3,5), dimnames = list(row.names,column.names, matrix.names)) > > print(myArray) , , Matrix1 col1 col2 col3 row1 1 3 5 row2 2 4 6 , , Matrix2 col1 col2 col3 row1 7 9 11 row2 8 10 12 , , Matrix3 col1 col2 col3 row1 13 15 17 row2 14 16 18 , , Matrix4 col1 col2 col3 row1 19 21 23 row2 20 22 24 , , Matrix5 col1 col2 col3 row1 25 27 29 row2 26 28 30
Accessing array
We can used index to access array data and matrix can be created from array.
> # Access array > > myArray[1,2,3] [1] 15 > > myArray[,,3] col1 col2 col3 row1 13 15 17 row2 14 16 18 > class(myArray[,,3]) [1] "matrix" > > myArray[1,,3] col1 col2 col3 13 15 17 > > # Create matrices from array > m1 <- myArray[,,3] > m2 <- myArray[,,4] > m1+m2 col1 col2 col3 row1 32 36 40 row2 34 38 42
Apply arithmetic calculations
Function apply() can be used to apply calculation across all array elements.
Here are some examples:
> # Apply > > # Use apply to calculate the sum of the rows across all the matrices. > apply(myArray, c(1), sum) row1 row2 225 240 > # Use apply to calculate the sum of the columns across all the matrices. > apply(myArray, c(2), sum) col1 col2 col3 135 155 175 > # Use apply to calculate the sum of the rows and columns across all the matrices. > apply(myArray, c(1,2), sum) col1 col2 col3 row1 65 75 85 row2 70 80 90
The second parameters controls whether calculate the sum of the rows, columns or both across all the matrices.
Factor is used to categorize the data and store it as levels. Factor can be created via function factor(). The following are some examples of factors:
- Male, Female
- West, East, North, South
- …
Factors are commonly used in data frames (will be covered in the following section). Some of the commonly used functions about factor are: factor(), levels() and is_factor().
The following are some examples about factor (script R14.Factors.R):
> # Factors > > d1<-c("Male","Female","Male","Male") > d1f<-factor(d1) > levels(d1f) [1] "Female" "Male" > length(d1f) [1] 4 > > # Change order > d2f<-factor(d1, levels=c("Male","Female")) > levels(d2f) [1] "Male" "Female" > > # Change label > d3f<-factor(d1, levels=c("F","M")) > levels(d3f) [1] "F" "M" >
Data Frame
Data Frame is the most important data type for most of data workers using R. A data frame is a table or a two-dimensional array-like structure. Each column contains values of one variable, of which the values can be numeric, factor or character. Each row contains a set of values from each column.
The following table is a semantic data frame:
CustomerID | CustomerName | DateOfBirth | Balance | |
1 | 10001 | John | 1990-01-01 | 600.11 |
2 | 10002 | Tom | 1991-01-01 | 1278.10 |
3 | 10003 | Lily | 2000-07-06 | 3000 |
Create data frame
Data frame can be created using function data.frame(). By default, character variables will be created as factors. Structure of the frame can be viewed via function str(). Statistical summary can be obtained via function summary().
> # Create data frames > > customer.data <- data.frame( + CustomerID = 10001:10003, + CustomerName = c('John','Tom','Lily'), + DateOfBirth = as.Date(c("1990-01-01","1991-01-01","2000-07-26")), + Balance = c(600.11,1278.10,3000), + stringsAsFactors = FALSE + ) > > # Vector of date > class(customer.data$DateOfBirth) [1] "Date" > > # display the structure of data frame > str(customer.data) 'data.frame': 3 obs. of 4 variables: $ CustomerID : int 10001 10002 10003 $ CustomerName: chr "John" "Tom" "Lily" $ DateOfBirth : Date, format: "1990-01-01" "1991-01-01" ... $ Balance : num 600 1278 3000 > > # statistic summary > summary(customer.data) CustomerID CustomerName DateOfBirth Balance Min. :10001 Length:3 Min. :1990-01-01 Min. : 600.1 1st Qu.:10002 Class :character 1st Qu.:1990-07-02 1st Qu.: 939.1 Median :10002 Mode :character Median :1991-01-01 Median :1278.1 Mean :10002 Mean :1993-11-09 Mean :1626.1 3rd Qu.:10002 3rd Qu.:1995-10-14 3rd Qu.:2139.1 Max. :10003 Max. :2000-07-26 Max. :3000.0
Accessing and expanding data frame
Data can be extracted from data frame through column name and indices via function data.frame(). Columns can be dynamically added into existing data frame: dataFrame$variableName. Columns can be dropped: dataFrame$variableName <- c(). Rows can be added using function rbind(). Vectors can be combined into data frame using cbind().
The following are some examples (script R16.AccessingAndExpandDataFrame.R):
1) Extract only a few columns:
> # Extract only a few columns
> data.frame(customer.data$CustomerID, customer.data$Balance)
customer.data.CustomerID customer.data.Balance
1 10001 600.11
2 10002 1278.10
3 10003 3000.00
2) Extract only a few rows:
> # EXtract only a few rows > # extract the first row > customer.data[1,] CustomerID CustomerName DateOfBirth Balance 1 10001 John 1990-01-01 600.11 > # extract the second the the third rows > customer.data[2:3,] CustomerID CustomerName DateOfBirth Balance 2 10002 Tom 1991-01-01 1278.1 3 10003 Lily 2000-07-26 3000.0
3) Extract via column name:
> # extract via column name > customer.data[2:3,c("CustomerID","Balance")] CustomerID Balance 2 10002 1278.1 3 10003 3000.0 > customer.data[c("CustomerID","Balance")] CustomerID Balance 1 10001 600.11 2 10002 1278.10 3 10003 3000.00
4) Extract via column index:
> # extract via column index > customer.data[2:3, c(1,2,4)] CustomerID CustomerName Balance 2 10002 Tom 1278.1 3 10003 Lily 3000.0
5) Add new column:
> # Add new column > customer.data$IsClosed <- c(FALSE,TRUE,FALSE) > View(customer.data)
The data frame looks like the following screenshot:
6) Drop column:
# Drop column customer.data$IsClosed <- c()
7) Add rows:
> # Add rows > customer.data2 <- data.frame( + CustomerID = 10004:10006, + CustomerName = c('Lucy','Jack','Rose'), + DateOfBirth = as.Date(c("1997-01-01","1998-02-07","2001-07-12")), + Balance = c(700.11,27,1937.22), + stringsAsFactors = FALSE + ) > View(customer.data2) > > # Row bind > customer.data.all <- rbind(customer.data, customer.data2) > summary(customer.data.all) CustomerID CustomerName DateOfBirth Balance Min. :10001 Length:6 Min. :1990-01-01 Min. : 27.0 1st Qu.:10002 Class :character 1st Qu.:1992-07-02 1st Qu.: 625.1 Median :10004 Mode :character Median :1997-07-21 Median : 989.1 Mean :10004 Mean :1996-05-14 Mean :1257.1 3rd Qu.:10005 3rd Qu.:1999-12-14 3rd Qu.:1772.4 Max. :10006 Max. :2001-07-12 Max. :3000.0
Data frame customer.data2 looks like the following:
Function rbind was used to combine these two data frames to be one.
8) Column bind:
The following shows an example of creating data frame using cbind function:
> # Column bind > CustomerID = 10001:10003 > CustomerName = c('John','Tom','Lily') > DateOfBirth = as.Date(c("1990-01-01","1991-01-01","2000-07-26")) > Balance = c(600.11,1278.10,3000) > > dataFrame <- cbind(CustomerID, CustomerName, DateOfBirth, Balance) > print(dataFrame) CustomerID CustomerName DateOfBirth Balance [1,] "10001" "John" "7305" "600.11" [2,] "10002" "Tom" "7670" "1278.1" [3,] "10003" "Lily" "11164" "3000"
Now you should be familiar with these frequently used data types in R. We will continue in this series to provide more usages about these advanced data types.