R Data Types Detailed Walkthrough

Raymond Raymond event 2020-09-23 visibility 227
more_vert

R implements a number of useful data types to support complex analytics and calculations. This articles focus on String, Vector, List, Matrix, Array, Factory and Data Frame. It also shows examples about expanding data frame, for example, add or drop columns for data frames, add rows for data frames, etc.

String

String data type is one of the most frequently used data type in most programming languages. In R, string value can be single quoted ('') or double quoted(""). To escape the quote characters, '\' can be used. 

String to raw (binary) conversion is very common in programming. In R, we can use charToRaw and rawToChar functions. 

Some of the other commonly used String functions include:

  • format: format other data types to certain string format.
  • paste: concatenating strings.
  • nchar: returns number of chars in the string.
  • toupper and tolower: convert the cases of string.
  • substring: extract a sub string.
  • strsplit: split the string.

Examples

The following script R08.String.R includes examples about string data type related transformations or conversion:

> v <- '"i am a character"'
> print(v)
[1] "\"i am a character\""
> print(class(v))
[1] "character"
> v <- '"i am a \'character"'
> print(v)
[1] "\"i am a 'character\""
> print(class(v))
[1] "character"
> 
> # raw (binary)
> v <- charToRaw('"i am a character"')
> print(v)
 [1] 22 69 20 61 6d 20 61 20 63 68 61 72 61 63 74 65 72 22
> print(class(v))
[1] "raw"
> # encoding
> v <- rawToChar(v)
> print(v)
[1] "\"i am a character\""
> print(class(v))
[1] "character"
> 
> toupper(v)
[1] "\"I AM A CHARACTER\""
> 
> tolower("R PROGRAMMING")
[1] "r programming"
> 
> nchar(v)
[1] 18
> 
> s<-"R PROGRAMMING"
> substr(s,3,nchar(s))
[1] "PROGRAMMING"
> 
> strsplit("R:PROGRAMMING", split=":")
[[1]]
[1] "R"           "PROGRAMMING"

> 
> format(13.7, nsmall = 3)
[1] "13.700"
> format(c(6.0, 13.1), digits = 2)
[1] " 6" "13"
> format(2^31 - 1, scientific = TRUE)
[1] "2.147484e+09"
> 
> format("R PROGRAMMING",width=20, justify = "right")
[1] "       R PROGRAMMING"
> format("R PROGRAMMING",width=20, justify = "centre")
[1] "   R PROGRAMMING    "

Vector

In R Programming Basics, I mentioned vector is a list of data elements with same basic data types. Function length() can be used to find the element count in the vector. Function c() can be used to combine values to a vector or list. Arithmetic operations can be applied to vectors. If the object length of operands are different from each other, the longer object's length must be a multiple of the shorter object length. To sort elements, function sort() can be used. 

The following are some examples (from script file R09.Vectors.R).

Create sequences

One thing to pay attention to is the final element specified will be discarded if it doesn't belong to the sequence. 

> # Creating a sequence from 1 to 10.
> v <- 1L:10L
> print(v)
 [1]  1  2  3  4  5  6  7  8  9 10
> print(class(v))
[1] "integer"
> 
> # Creating a sequence from 1.1 to 11.1.
> v <- 1.1:11.1
> print(v)
 [1]  1.1  2.1  3.1  4.1  5.1  6.1  7.1  8.1  9.1 10.1 11.1
> print(class(v))
[1] "numeric"
> print(length(v))
[1] 11
> 
> # If the final element specified does not belong to the sequence then it is discarded.
> v <- 2.3:10.5
> print(v)
[1]  2.3  3.3  4.3  5.3  6.3  7.3  8.3  9.3 10.3
> print(class(v))
[1] "numeric"
> print(length(v))
[1] 9

Use function c()

If the input elements are not the same atomic data type, the subsequent ones will be converted to the same type as the first element. 

> # use c to combine values into a Vector or List
> v <- c(1:5)
> print(v)
[1] 1 2 3 4 5
> print(class(v))
[1] "integer"
> print(length(v))
[1] 5
> 
> v <- c("Apple", "Pear")
> print(v)
[1] "Apple" "Pear" 
> print(class(v))
[1] "character"
> print(length(v))
[1] 2
> 
> v <- c('Apple',1,TRUE)
> print(v)
[1] "Apple" "1"     "TRUE" 
> print(class(v))
[1] "character"
> print(length(v))
[1] 3

Use sort() function

The following example sort the elements alphabetically in the decreasing order. 

> # sorting
> c("Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec") -> v
> sort(v, decreasing = TRUE)
 [1] "Sep" "Oct" "Nov" "May" "Mar" "Jun" "Jul" "Jan" "Feb" "Dec" "Aug" "Apr"
> # sorting
> c("Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec") -> v
> sort(v, decreasing = TRUE)
 [1] "Sep" "Oct" "Nov" "May" "Mar" "Jun" "Jul" "Jan" "Feb" "Dec" "Aug" "Apr"
> 

Element recycling

The following are two examples of element recycling in arithmetic operation. The second one errors out the x length is 10 while y length is 4.

> # Recycling
> x<- 1:10
> y<- 2:6
> y+x
 [1]  3  5  7  9 11  8 10 12 14 16
> x+y
 [1]  3  5  7  9 11  8 10 12 14 16
> 
> # Recycling error
> x<- 1:10
> y<- 2:5
> x+y
 [1]  3  5  7  9  7  9 11 13 11 13
Warning message:
In x + y : longer object length is not a multiple of shorter object length

Access vector elements

Once vectors are created, elements of them can be accessed using indexing. The [ ] are used for indexing. Different from C based languages, R index starts from 1. Elements can also be accessed via TRUE or FALSE. 

The following are some examples of accessing elements (script R10.AccessingVectorElement.R).

> # Accessing vectors
> v <- c("Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec")
> print(v)
 [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
> print(class(v))
[1] "character"
> print(length(v))
[1] 12
> 
> # Accessing index 
> v[1]
[1] "Jan"
> 
> v[1:5]
[1] "Jan" "Feb" "Mar" "Apr" "May"
> 
> v[10:12]
[1] "Oct" "Nov" "Dec"
> 
> v[c(1,3,5)]
[1] "Jan" "Mar" "May"
> 
> # Exclude Janary and March
> v[c(-3,-1)]
 [1] "Feb" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
> 
> # Exclude using FALSE
> v[c(TRUE,FALSE,FALSE,FALSE,FALSE,TRUE,FALSE,FALSE)]
[1] "Jan" "Jun" "Sep"
> 
> # Vector will be reused.
> v[c(TRUE,FALSE,FALSE)]
[1] "Jan" "Apr" "Jul" "Oct"

List

For vectors, the data elements must be the same. If your data elements are different, list is the option. A list is the R object which contains elements of all different R objects. List is created using list() function. List can be converted to vector using unlist() function. List can be merged using function c()

The following are some examples about R list (script R11.Lists.R).

> # create list
> myList<- list(1:10,"Test", "b", TRUE, list(1,3:100,"string"))
> print(myList)
[[1]]
 [1]  1  2  3  4  5  6  7  8  9 10

[[2]]
[1] "Test"

[[3]]
[1] "b"

[[4]]
[1] TRUE

[[5]]
[[5]][[1]]
[1] 1

[[5]][[2]]
 [1]   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21  22  23
[22]  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44
[43]  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65
[64]  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86
[85]  87  88  89  90  91  92  93  94  95  96  97  98  99 100

[[5]][[3]]
[1] "string"


> 
> listA <- list(1:5,10)
> listB <- list(10:26, 100)
> c(listA, listB)
[[1]]
[1] 1 2 3 4 5

[[2]]
[1] 10

[[3]]
 [1] 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

[[4]]
[1] 100

> 
> unlist(listA) + unlist(listB)
 [1]  11  13  15  17  19  25  17  19  21  23  25  31  23  25  27  29  31 110
> 
> listA[1]
[[1]]
[1] 1 2 3 4 5
As the above example shows, the member of a list can be any data type including list data type itself.

Matrix

Matrix is a rectangular array of numbers, symbols, or expressions, arranged in rows and columns. Matrices containing numeric elements can be used in mathematical calculations.

The following diagram is cited from Wikipedia about Matrix:

2020092361919-image.png

Create and access matrix

Matrix can be created using function:

matrix(data, nrow, ncol, byrow, dimnames)

For example, the following script (R12.Matrices.R) creates three matrices using different approaches:

# Matrices
> MA <- matrix(1:10,nrow=2,ncol=5, byrow = TRUE)
> class(MA)
[1] "matrix"
> MB <- matrix(11:20,nrow=2,ncol=5)
> 
> MC <- matrix(31:40, nrow=5, ncol=2, dimnames = list(c('row1','row2','row3','row4','row5'),c('col1','col2')))
> View(MC)
Matrix MC is created with dimension names:
2020092362351-image.png

Accessing matrix

Matrix can be accessed via row and column index.

> # Accessing
> MC[1,2]
[1] 36
> MC[1,]
col1 col2 
  31   36 
> # class is integer vector
> class(MC[1,])
[1] "integer"
> MC[,2]
row1 row2 row3 row4 row5 
  36   37   38   39   40 

Matrix arithmetic operations

Matrix supports mathematical operation like +, -, * ,/, which results also a matrix with same rows and columns.

> # matrix operations
> MA+MB
     [,1] [,2] [,3] [,4] [,5]
[1,]   12   15   18   21   24
[2,]   18   21   24   27   30
> MA-MB
     [,1] [,2] [,3] [,4] [,5]
[1,]  -10  -11  -12  -13  -14
[2,]   -6   -7   -8   -9  -10
> MA*MB
     [,1] [,2] [,3] [,4] [,5]
[1,]   11   26   45   68   95
[2,]   72   98  128  162  200
> MA/MB
           [,1]      [,2] [,3]      [,4]      [,5]
[1,] 0.09090909 0.1538462  0.2 0.2352941 0.2631579
[2,] 0.50000000 0.5000000  0.5 0.5000000 0.5000000

Array

There are only two dimensions for a Matrix. To store data in more than two dimensions, array data type can be used. 

The following is a diagram from Wikipedia about array of matrix:

2020092362801-image.png

Create array

Array can be created using function array(). If an array is created using dimension (m,n,k), the total element count is: m*n*k.

The signature of the function is:

array(data = NA, dim = length(data), dimnames = NULL)
The following are some examples of creating array (script R13.Arrays.R):
> # Array
> 
> vector<- c(1:30)
> 
> column.names=c("col1","col2","col3")
> row.names=c("row1","row2")
> matrix.names=c("Matrix1","Matrix2","Matrix3","Matrix4","Matrix5")
> 
> myArray <- array(vector,dim=c(2,3,5), dimnames = list(row.names,column.names, matrix.names))
> 
> print(myArray)
, , Matrix1

     col1 col2 col3
row1    1    3    5
row2    2    4    6

, , Matrix2

     col1 col2 col3
row1    7    9   11
row2    8   10   12

, , Matrix3

     col1 col2 col3
row1   13   15   17
row2   14   16   18

, , Matrix4

     col1 col2 col3
row1   19   21   23
row2   20   22   24

, , Matrix5

     col1 col2 col3
row1   25   27   29
row2   26   28   30

Accessing array

We can used index to access array data and matrix can be created from array.

> # Access array
> 
> myArray[1,2,3]
[1] 15
> 
> myArray[,,3]
     col1 col2 col3
row1   13   15   17
row2   14   16   18
> class(myArray[,,3])
[1] "matrix"
> 
> myArray[1,,3]
col1 col2 col3 
  13   15   17 
> 
> # Create matrices from array
> m1 <- myArray[,,3]
> m2 <- myArray[,,4]
> m1+m2
     col1 col2 col3
row1   32   36   40
row2   34   38   42

Apply arithmetic calculations

Function apply() can be used to apply calculation across all array elements.

Here are some examples:

> # Apply
> 
> # Use apply to calculate the sum of the rows across all the matrices.
> apply(myArray, c(1), sum)
row1 row2 
 225  240 
> # Use apply to calculate the sum of the columns across all the matrices.
> apply(myArray, c(2), sum)
col1 col2 col3 
 135  155  175 
> # Use apply to calculate the sum of the rows and columns across all the matrices.
> apply(myArray, c(1,2), sum)
     col1 col2 col3
row1   65   75   85
row2   70   80   90

The second parameters controls whether calculate the sum of the rows, columns or both across all the matrices. 

Factor

Factor is used to categorize the data and store it as levels. Factor can be created via function factor(). The following are some examples of factors:

  • Male, Female
  • West, East, North, South

Factors are commonly used in data frames (will be covered in the following section). Some of the commonly used functions about factor are: factor()levels() and is_factor().

The following are some examples about factor (script R14.Factors.R):

> # Factors
> 
> d1<-c("Male","Female","Male","Male")
> d1f<-factor(d1)
> levels(d1f)
[1] "Female" "Male"  
> length(d1f)
[1] 4
> 
> # Change order
> d2f<-factor(d1, levels=c("Male","Female"))
> levels(d2f)
[1] "Male"   "Female"
> 
> # Change label
> d3f<-factor(d1, levels=c("F","M"))
> levels(d3f)
[1] "F" "M"
> 

Data Frame

Data Frame is the most important data type for most of data workers using R. A data frame is a table or a two-dimensional array-like structure. Each column contains values of one variable, of which the values can be numeric, factor or character. Each row contains a set of values from each column. 

The following table is a semantic data frame:


CustomerIDCustomerNameDateOfBirthBalance
110001John1990-01-01600.11
210002Tom1991-01-011278.10
310003Lily2000-07-063000

Create data frame

Data frame can be created using function data.frame()By default, character variables will be created as factors. Structure of the frame can be viewed via function str()Statistical summary can be obtained via function summary().

The following shows an example of creating data frame and then display the structure and stats summary information (script R15.CreatingDataFrame.R):
> # Create data frames
> 
> customer.data <- data.frame(
+   CustomerID = 10001:10003,
+   CustomerName = c('John','Tom','Lily'),
+   DateOfBirth = as.Date(c("1990-01-01","1991-01-01","2000-07-26")),
+   Balance = c(600.11,1278.10,3000),
+   stringsAsFactors = FALSE
+ )
> 
> # Vector of date
> class(customer.data$DateOfBirth)
[1] "Date"
> 
> # display the structure of data frame
> str(customer.data)
'data.frame':	3 obs. of  4 variables:
 $ CustomerID  : int  10001 10002 10003
 $ CustomerName: chr  "John" "Tom" "Lily"
 $ DateOfBirth : Date, format: "1990-01-01" "1991-01-01" ...
 $ Balance     : num  600 1278 3000
> 
> # statistic summary
> summary(customer.data)
   CustomerID    CustomerName        DateOfBirth            Balance      
 Min.   :10001   Length:3           Min.   :1990-01-01   Min.   : 600.1  
 1st Qu.:10002   Class :character   1st Qu.:1990-07-02   1st Qu.: 939.1  
 Median :10002   Mode  :character   Median :1991-01-01   Median :1278.1  
 Mean   :10002                      Mean   :1993-11-09   Mean   :1626.1  
 3rd Qu.:10002                      3rd Qu.:1995-10-14   3rd Qu.:2139.1  
 Max.   :10003                      Max.   :2000-07-26   Max.   :3000.0  

Accessing and expanding data frame

Data can be extracted from data frame through column name and indices via function data.frame(). Columns can be dynamically added into existing data frame: dataFrame$variableNameColumns can be dropped: dataFrame$variableName <- c()Rows can be added using function rbind()Vectors can be combined into data frame using cbind().

The following are some examples (script R16.AccessingAndExpandDataFrame.R): 

1) Extract only a few columns:

> # Extract only a few columns
> data.frame(customer.data$CustomerID, customer.data$Balance)
  customer.data.CustomerID customer.data.Balance
1                    10001                600.11
2                    10002               1278.10
3                    10003               3000.00

2) Extract only a few rows:

> # EXtract only a few rows
> # extract the first row
> customer.data[1,]
  CustomerID CustomerName DateOfBirth Balance
1      10001         John  1990-01-01  600.11
> # extract the second the the third rows
> customer.data[2:3,]
  CustomerID CustomerName DateOfBirth Balance
2      10002          Tom  1991-01-01  1278.1
3      10003         Lily  2000-07-26  3000.0

3) Extract via column name:

> # extract via column name
> customer.data[2:3,c("CustomerID","Balance")]
  CustomerID Balance
2      10002  1278.1
3      10003  3000.0
> customer.data[c("CustomerID","Balance")]
  CustomerID Balance
1      10001  600.11
2      10002 1278.10
3      10003 3000.00

4) Extract via column index:

> # extract via column index
> customer.data[2:3, c(1,2,4)]
  CustomerID CustomerName Balance
2      10002          Tom  1278.1
3      10003         Lily  3000.0

5) Add new column:

> # Add new column
> customer.data$IsClosed <- c(FALSE,TRUE,FALSE)
> View(customer.data)

The data frame looks like the following screenshot:

2020092370821-image.png

6) Drop column:

# Drop column
customer.data$IsClosed <- c()

7) Add rows:

> # Add rows
> customer.data2 <- data.frame(
+   CustomerID = 10004:10006,
+   CustomerName = c('Lucy','Jack','Rose'),
+   DateOfBirth = as.Date(c("1997-01-01","1998-02-07","2001-07-12")),
+   Balance = c(700.11,27,1937.22),
+   stringsAsFactors = FALSE
+ )
> View(customer.data2)
> 
> # Row bind
> customer.data.all <- rbind(customer.data, customer.data2)
> summary(customer.data.all)
   CustomerID    CustomerName        DateOfBirth            Balance      
 Min.   :10001   Length:6           Min.   :1990-01-01   Min.   :  27.0  
 1st Qu.:10002   Class :character   1st Qu.:1992-07-02   1st Qu.: 625.1  
 Median :10004   Mode  :character   Median :1997-07-21   Median : 989.1  
 Mean   :10004                      Mean   :1996-05-14   Mean   :1257.1  
 3rd Qu.:10005                      3rd Qu.:1999-12-14   3rd Qu.:1772.4  
 Max.   :10006                      Max.   :2001-07-12   Max.   :3000.0 

Data frame customer.data2 looks like the following:

2020092371320-image.png

Function rbind was used to combine these two data frames to be one.

8) Column bind:

The following shows an example of creating data frame using cbind function:

> # Column bind
> CustomerID = 10001:10003
> CustomerName = c('John','Tom','Lily')
> DateOfBirth = as.Date(c("1990-01-01","1991-01-01","2000-07-26"))
> Balance = c(600.11,1278.10,3000)
> 
> dataFrame <- cbind(CustomerID, CustomerName, DateOfBirth, Balance)
> print(dataFrame)
     CustomerID CustomerName DateOfBirth Balance 
[1,] "10001"    "John"       "7305"      "600.11"
[2,] "10002"    "Tom"        "7670"      "1278.1"
[3,] "10003"    "Lily"       "11164"     "3000"  

Summary

Now you should be familiar with these  frequently used data types in R. We will continue in this series to provide more usages about these advanced data types. 

More from Kontext
comment Comments
No comments yet.

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts