Generating and Transforming Data with R
In many scenarios, we need to generate data directly in memory. This article provides examples about generating regular and random sequences with R. It also shows you how to reshape or restructure data.
Generate regular sequence
In the preceding articles, we already used a quite a few functions to generate regular sequence data in R. The following are the commonly used ones:
- n:m
- seq(from, to, by, ...)
- scan()
- sequence()
- rep()
- gl()
- expand.grid()
- sample()
The following are some code examples using the above functions (script R21.GenerateSequenceData.R).
n:m
> # n:m > v <- 1:100 > print(v) [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 [21] 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 [41] 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 [61] 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 [81] 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
seq
> # seq > v <- seq(from=10,to=20.3,by=0.5) > print(v) [1] 10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5 14.0 14.5 15.0 15.5 16.0 16.5 17.0 17.5 [17] 18.0 18.5 19.0 19.5 20.0
scan
> v <- scan() 1: 12 2: 12 3: 11 4: 10 5: Read 4 items > print(v) [1] 12 12 11 10
sequence
> # sequence specifies the end point of each element > sequence(3:6) [1] 1 2 3 1 2 3 4 1 2 3 4 5 1 2 3 4 5 6 > sequence(c(3,5,6)) [1] 1 2 3 1 2 3 4 5 1 2 3 4 5 6
rep
> # rep > rep(1:4, 2) [1] 1 2 3 4 1 2 3 4 > rep(1:4, each = 2) # not the same. [1] 1 1 2 2 3 3 4 4 > rep(1:4, c(2,2,2,2)) # same as second. [1] 1 1 2 2 3 3 4 4 > rep(1:4, c(2,1,2,1)) [1] 1 1 2 3 3 4 > rep(1:4, each = 2, len = 4) # first 4 only. [1] 1 1 2 2 > rep(1:4, each = 2, len = 10) # 8 integers plus two recycled 1's. [1] 1 1 2 2 3 3 4 4 1 1 > rep(1:4, each = 2, times = 3) # length 24, 3 complete replications [1] 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 > > rep(1, 40*(1-.8)) # length 7 on most platforms [1] 1 1 1 1 1 1 1 > rep(1, 40*(1-.8)+1e-7) # better [1] 1 1 1 1 1 1 1 1
We can also replicate a list:
> ## replicate a list > test <- list(l1 = 1:10, name = "test") > rep(test, 5) $l1 [1] 1 2 3 4 5 6 7 8 9 10 $name [1] "test" $l1 [1] 1 2 3 4 5 6 7 8 9 10 $name [1] "test" $l1 [1] 1 2 3 4 5 6 7 8 9 10 $name [1] "test" $l1 [1] 1 2 3 4 5 6 7 8 9 10 $name [1] "test" $l1 [1] 1 2 3 4 5 6 7 8 9 10 $name [1] "test"
gl
> ### gl > gl(2, 8, labels = c("Apple", "Pear")) [1] Apple Apple Apple Apple Apple Apple Apple Apple Pear Pear Pear Pear Pear Pear [15] Pear Pear Levels: Apple Pear
expand.grid
> ### expand.grid to create a data frame from all combinations of the supplied vectors or factors. > expand.grid(age=c(20,30), is_active=c(TRUE, FALSE), sex=c("Male", "Female")) age is_active sex 1 20 TRUE Male 2 30 TRUE Male 3 20 FALSE Male 4 30 FALSE Male 5 20 TRUE Female 6 30 TRUE Female 7 20 FALSE Female 8 30 FALSE Female
sample
> #Sample > sample(1:100, 10, replace=TRUE) [1] 30 98 11 87 87 36 72 53 88 8
Generate random data
There are numerous R functions that can be used to generate random data. In general, they are in the following forms:
- rfunc(n, p1, p2, ...)
func: probability distribution function
n: number of data generated
p1, p2, … parameters
- dfunc(x, ...): probability density
- pfunc(x, ...): cumulative probability density
- qfunc(p, ...), with 0 < p < 1: value of quantile
Functions summary
Law | R Function |
Gaussian | rnorm(n, mean=0, sd=1) |
exponential | rexp(n, rate=1) |
gamma | rgamma(n, shape, scale=1) |
Poisson | rpois(n, lambda) |
Weibull | rweibull(n, shape, scale=1) |
Cauchy | rcauchy(n, location=0, scale=1) |
beta | rbeta(n, shape1, shape2) |
‘Student’ (t) | rt(n, df) |
Fisher–Snedecor (F) | rf(n, df1, df2) |
Pearson (χ2) | rchisq(n, df) |
geometric | rgeom(n, prob) |
binomial | rbinom(n, size, prob) |
negative binomial | rnbinom(n, size, prob) |
multinomial | rmultinom(n, size, prob) |
hypergeometric | rhyper(nn, m, n, k) |
logistic | rlogis(n, location=0, scale=1) |
lognormal | rlnorm(n, meanlog=0, sdlog=1) |
uniform | runif(n, min=0, max=1) |
Wilcoxon’s statistics | rwilcox(nn, m, n), rsignrank(nn, n) |
Examples
The following shows an example data generation using function rnorm and rbeta (script R22.GenerateRandomData.R):
> # Generate random data > > rnorm(10, mean =10, sd=1) [1] 11.433660 10.327914 9.514077 11.336433 10.504968 9.110282 9.279272 10.580554 [9] 10.820820 10.233926 > rbeta(10,5,1) [1] 0.6665865 0.9663545 0.7827179 0.9677068 0.7099742 0.3413335 0.6474135 0.9994693 [9] 0.6543541 0.8248454
Data reshape or restructure
In data wrangling process, data reshape or restructure are frequently required to transform the data to the format that can be easily used by the downstream processes. The following list includes some of the commonly used transformations and the corresponded R functions:
- Merge data frames (join): merge()
- Melting (unpivot): melt() -- via reshape library
- Casting (with or without aggregate): cast()
- Transpose (reversing rows and columns): t()
- Aggregate: aggregate(x, by, FUN)
- Convert data frame to matrix: data.matrix
In the above list, I mostly used the merge function. R supports all the following four join types:
The following sections provide examples about these transformations (script R23.DataReshaping.R).
Merge examples
One thing to notice in the following example is that NA can be used to match or not depends on parameter 'incomparables'.
> ###### merge > authors <- data.frame( + surname = I(c("Tukey", "Venables", "Tierney", "Ripley", "McNeil")), + nationality = c("US", "Australia", "US", "UK", "Australia"), + deceased = c("yes", rep("no", 4))) > books <- data.frame( + name = I(c("Tukey", "Venables", "Tierney", + "Ripley", "Ripley", "McNeil", "R Core")), + title = c("Exploratory Data Analysis", + "Modern Applied Statistics ...", + "LISP-STAT", + "Spatial Statistics", "Stochastic Simulation", + "Interactive Data Analysis", + "An Introduction to R"), + other.author = c(NA, "Ripley", NA, NA, NA, NA, + "Venables & Smith")) > > (m1 <- merge(authors, books, by.x = "surname", by.y = "name")) surname nationality deceased title other.author 1 McNeil Australia no Interactive Data Analysis <NA> 2 Ripley UK no Spatial Statistics <NA> 3 Ripley UK no Stochastic Simulation <NA> 4 Tierney US no LISP-STAT <NA> 5 Tukey US yes Exploratory Data Analysis <NA> 6 Venables Australia no Modern Applied Statistics ... Ripley > (m2 <- merge(books, authors, by.x = "name", by.y = "surname")) name title other.author nationality deceased 1 McNeil Interactive Data Analysis <NA> Australia no 2 Ripley Spatial Statistics <NA> UK no 3 Ripley Stochastic Simulation <NA> UK no 4 Tierney LISP-STAT <NA> US no 5 Tukey Exploratory Data Analysis <NA> US yes 6 Venables Modern Applied Statistics ... Ripley Australia no > > > ## "R core" is missing from authors and appears only here : > merge(authors, books, by.x = "surname", by.y = "name", all = TRUE) surname nationality deceased title other.author 1 McNeil Australia no Interactive Data Analysis <NA> 2 R Core <NA> <NA> An Introduction to R Venables & Smith 3 Ripley UK no Spatial Statistics <NA> 4 Ripley UK no Stochastic Simulation <NA> 5 Tierney US no LISP-STAT <NA> 6 Tukey US yes Exploratory Data Analysis <NA> 7 Venables Australia no Modern Applied Statistics ... Ripley > > ## example of using 'incomparables' > x <- data.frame(k1 = c(NA,NA,3,4,5), k2 = c(1,NA,NA,4,5), data = 1:5) > y <- data.frame(k1 = c(NA,2,NA,4,5), k2 = c(NA,NA,3,4,5), data = 1:5) > merge(x, y, by = c("k1","k2")) # NA's match k1 k2 data.x data.y 1 4 4 4 4 2 5 5 5 5 3 NA NA 2 1 > merge(x, y, by = "k1") # NA's match, so 6 rows k1 k2.x data.x k2.y data.y 1 4 4 4 4 4 2 5 5 5 5 5 3 NA 1 1 NA 1 4 NA 1 1 3 3 5 NA NA 2 NA 1 6 NA NA 2 3 3 > merge(x, y, by = "k2", incomparables = NA) # 2 rows k2 k1.x data.x k1.y data.y 1 4 4 4 4 4 2 5 5 5 5 5
Melt examples
This example unpivot a wide table to row based table.
> ###### melt ###### > > ( + studentMarks <- data.frame + (StudentID = c(1000:1005), + StudentName=c("Tom","Lily","Lucy","Li Lei","Han Meimei","Mike"), + English = rnorm(6,90,5), + Math = rnorm(6,80,10), + Chemistry = seq(from=60,to=90,length.out = 6) + ) + ) StudentID StudentName English Math Chemistry 1 1000 Tom 88.6 91.7 60 2 1001 Lily 94.7 79.9 66 3 1002 Lucy 92.9 69.5 72 4 1003 Li Lei 88.9 76.1 78 5 1004 Han Meimei 91.1 88.5 84 6 1005 Mike 83.0 75.4 90 > > install.packages("reshape") Installing package into ‘E:/Documents/R/win-library/3.4’ > # load package > require(reshape) Loading required package: reshape > > ( + meltMarks <- melt(studentMarks, id=c("StudentID","StudentName")) + ) StudentID StudentName variable value 1 1000 Tom English 88.6 2 1001 Lily English 94.7 3 1002 Lucy English 92.9 4 1003 Li Lei English 88.9 5 1004 Han Meimei English 91.1 6 1005 Mike English 83.0 7 1000 Tom Math 91.7 8 1001 Lily Math 79.9 9 1002 Lucy Math 69.5 10 1003 Li Lei Math 76.1 11 1004 Han Meimei Math 88.5 12 1005 Mike Math 75.4 13 1000 Tom Chemistry 60.0 14 1001 Lily Chemistry 66.0 15 1002 Lucy Chemistry 72.0 16 1003 Li Lei Chemistry 78.0 17 1004 Han Meimei Chemistry 84.0 18 1005 Mike Chemistry 90.0
Cast examples
> # Cast molton data > ( + avgMarks <-cast(meltMarks, StudentName + StudentID ~ variable, max) + ) StudentName StudentID English Math Chemistry 1 Han Meimei 1004 91.1 88.5 84 2 Li Lei 1003 88.9 76.1 78 3 Lily 1001 94.7 79.9 66 4 Lucy 1002 92.9 69.5 72 5 Mike 1005 83.0 75.4 90 6 Tom 1000 88.6 91.7 60 > > class(avgMarks) [1] "cast_df" "data.frame" > > # order by student id > ( + avgMarks[with(avgMarks, order(StudentID)),] + ) StudentName StudentID English Math Chemistry 6 Tom 1000 88.6 91.7 60 3 Lily 1001 94.7 79.9 66 4 Lucy 1002 92.9 69.5 72 2 Li Lei 1003 88.9 76.1 78 1 Han Meimei 1004 91.1 88.5 84 5 Mike 1005 83.0 75.4 90 > > > # Average values > ( + avgMarks <-cast(meltMarks, . ~ variable, mean) + ) value English Math Chemistry 1 (all) 89.9 80.2 75 > > > # All other variables > ( + avgMarks <-cast(meltMarks, ... ~ variable, mean) + ) StudentID StudentName English Math Chemistry 1 1000 Tom 88.6 91.7 60 2 1001 Lily 94.7 79.9 66 3 1002 Lucy 92.9 69.5 72 4 1003 Li Lei 88.9 76.1 78 5 1004 Han Meimei 91.1 88.5 84 6 1005 Mike 83.0 75.4 90
Transpose examples
> ### Transpose > require(datasets) > (cars <- mtcars[1:5,1:4]) mpg cyl disp hp Mazda RX4 21.0 6 160 110 Mazda RX4 Wag 21.0 6 160 110 Datsun 710 22.8 4 108 93 Hornet 4 Drive 21.4 6 258 110 Hornet Sportabout 18.7 8 360 175 > t(cars) Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive Hornet Sportabout mpg 21 21 22.8 21.4 18.7 cyl 6 6 4.0 6.0 8.0 disp 160 160 108.0 258.0 360.0 hp 110 110 93.0 110.0 175.0
Aggregating examples
> ### Aggregating > options(digits=3) > # attach to access by column name directly > attach(studentMarks) > (aggdata <-aggregate(studentMarks[c(1,3,4)], by=list(StudentID), FUN=mean, na.rm=TRUE)) Group.1 StudentID English Math 1 1000 1000 88.6 91.7 2 1001 1001 94.7 79.9 3 1002 1002 92.9 69.5 4 1003 1003 88.9 76.1 5 1004 1004 91.1 88.5 6 1005 1005 83.0 75.4