Raymond Raymond

Generating and Transforming Data with R

event 2020-09-23 visibility 619 comment 0 insights toc
more_vert
insights Stats

In many scenarios, we need to generate data directly in memory. This article provides examples about generating regular and random sequences with R. It also shows you how to reshape or restructure data. 

Generate regular sequence

In the preceding articles, we already used a quite a few functions to generate regular sequence data in R. The following are the commonly used ones:

  • n:m
  • seq(from, to, by, ...)
  • scan()
  • sequence()
  • rep()
  • gl()
  • expand.grid()
  • sample()

The following are some code examples using the above functions (script R21.GenerateSequenceData.R).

n:m

> # n:m
> v <- 1:100
> print(v)
  [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20
 [21]  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40
 [41]  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60
 [61]  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80
 [81]  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100

seq

> # seq
> v <- seq(from=10,to=20.3,by=0.5)
> print(v)
 [1] 10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5 14.0 14.5 15.0 15.5 16.0 16.5 17.0 17.5
[17] 18.0 18.5 19.0 19.5 20.0

scan

> v <- scan()
1: 12
2: 12
3: 11
4: 10
5: 
Read 4 items
> print(v)
[1] 12 12 11 10

sequence

> # sequence specifies the end point of each element
> sequence(3:6)
 [1] 1 2 3 1 2 3 4 1 2 3 4 5 1 2 3 4 5 6
> sequence(c(3,5,6))
 [1] 1 2 3 1 2 3 4 5 1 2 3 4 5 6

rep

> # rep
> rep(1:4, 2)
[1] 1 2 3 4 1 2 3 4
> rep(1:4, each = 2)       # not the same.
[1] 1 1 2 2 3 3 4 4
> rep(1:4, c(2,2,2,2))     # same as second.
[1] 1 1 2 2 3 3 4 4
> rep(1:4, c(2,1,2,1))
[1] 1 1 2 3 3 4
> rep(1:4, each = 2, len = 4)    # first 4 only.
[1] 1 1 2 2
> rep(1:4, each = 2, len = 10)   # 8 integers plus two recycled 1's.
 [1] 1 1 2 2 3 3 4 4 1 1
> rep(1:4, each = 2, times = 3)  # length 24, 3 complete replications
 [1] 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4
> 
> rep(1, 40*(1-.8)) # length 7 on most platforms
[1] 1 1 1 1 1 1 1
> rep(1, 40*(1-.8)+1e-7) # better
[1] 1 1 1 1 1 1 1 1

We can also replicate a list:

> ## replicate a list
> test <- list(l1 = 1:10, name = "test")
> rep(test, 5)
$l1
 [1]  1  2  3  4  5  6  7  8  9 10

$name
[1] "test"

$l1
 [1]  1  2  3  4  5  6  7  8  9 10

$name
[1] "test"

$l1
 [1]  1  2  3  4  5  6  7  8  9 10

$name
[1] "test"

$l1
 [1]  1  2  3  4  5  6  7  8  9 10

$name
[1] "test"

$l1
 [1]  1  2  3  4  5  6  7  8  9 10

$name
[1] "test"

gl

> ### gl
> gl(2, 8, labels = c("Apple", "Pear"))
 [1] Apple Apple Apple Apple Apple Apple Apple Apple Pear  Pear  Pear  Pear  Pear  Pear 
[15] Pear  Pear 
Levels: Apple Pear

expand.grid

> ### expand.grid to create a data frame from all combinations of the supplied vectors or factors.
> expand.grid(age=c(20,30), is_active=c(TRUE, FALSE), sex=c("Male", "Female"))
  age is_active    sex
1  20      TRUE   Male
2  30      TRUE   Male
3  20     FALSE   Male
4  30     FALSE   Male
5  20      TRUE Female
6  30      TRUE Female
7  20     FALSE Female
8  30     FALSE Female

sample

> #Sample
> sample(1:100, 10, replace=TRUE)
 [1] 30 98 11 87 87 36 72 53 88  8

Generate random data

There are numerous R functions that can be used to generate random data. In general, they are in the following forms:

  • rfunc(n, p1, p2, ...)

func: probability distribution function

n: number of data generated

p1, p2, … parameters

  • dfunc(x, ...): probability density
  • pfunc(x, ...): cumulative probability density
  • qfunc(p, ...), with 0 < p < 1: value of quantile 

Functions summary

LawR Function
Gaussianrnorm(n, mean=0, sd=1)
exponentialrexp(n, rate=1)
gammargamma(n, shape, scale=1)
Poissonrpois(n, lambda)
Weibullrweibull(n, shape, scale=1)
Cauchyrcauchy(n, location=0, scale=1)
betarbeta(n, shape1, shape2)
‘Student’ (t)rt(n, df)
Fisher–Snedecor (F)rf(n, df1, df2)
Pearson (χ2)rchisq(n, df)
geometric
rgeom(n, prob)
binomialrbinom(n, size, prob)
negative binomialrnbinom(n, size, prob)
multinomialrmultinom(n, size, prob) 
hypergeometricrhyper(nn, m, n, k) 
logisticrlogis(n, location=0, scale=1) 
lognormalrlnorm(n, meanlog=0, sdlog=1) 
uniformrunif(n, min=0, max=1) 
Wilcoxon’s statisticsrwilcox(nn, m, n), rsignrank(nn, n)

Examples

The following shows an example data generation using function rnorm and rbeta (script R22.GenerateRandomData.R):

> # Generate random data
> 
> rnorm(10, mean =10, sd=1)
 [1] 11.433660 10.327914  9.514077 11.336433 10.504968  9.110282  9.279272 10.580554
 [9] 10.820820 10.233926
> rbeta(10,5,1)
 [1] 0.6665865 0.9663545 0.7827179 0.9677068 0.7099742 0.3413335 0.6474135 0.9994693
 [9] 0.6543541 0.8248454

Data reshape or restructure

In data wrangling process, data reshape or restructure are frequently required to transform the data to the format that can be easily used by the downstream processes. The following list includes some of the commonly used transformations and the corresponded R functions:

  • Merge data frames (join): merge()
  • Melting (unpivot): melt()   -- via reshape library
  • Casting (with or without aggregate): cast()
  • Transpose (reversing rows and columns): t()
  • Aggregate: aggregate(x, by, FUN)
  • Convert data frame to matrix: data.matrix

In the above list, I mostly used the merge function. R supports all the following four join types:

2020092391226-image.png

The following sections provide examples about these transformations (script R23.DataReshaping.R).

Merge examples

One thing to notice in the following example is that NA can be used to match or not depends on parameter 'incomparables'.

> ###### merge
> authors <- data.frame(
+   surname = I(c("Tukey", "Venables", "Tierney", "Ripley", "McNeil")),
+   nationality = c("US", "Australia", "US", "UK", "Australia"),
+   deceased = c("yes", rep("no", 4)))
> books <- data.frame(
+   name = I(c("Tukey", "Venables", "Tierney",
+              "Ripley", "Ripley", "McNeil", "R Core")),
+   title = c("Exploratory Data Analysis",
+             "Modern Applied Statistics ...",
+             "LISP-STAT",
+             "Spatial Statistics", "Stochastic Simulation",
+             "Interactive Data Analysis",
+             "An Introduction to R"),
+   other.author = c(NA, "Ripley", NA, NA, NA, NA,
+                    "Venables & Smith"))
> 
> (m1 <- merge(authors, books, by.x = "surname", by.y = "name"))
   surname nationality deceased                         title other.author
1   McNeil   Australia       no     Interactive Data Analysis         <NA>
2   Ripley          UK       no            Spatial Statistics         <NA>
3   Ripley          UK       no         Stochastic Simulation         <NA>
4  Tierney          US       no                     LISP-STAT         <NA>
5    Tukey          US      yes     Exploratory Data Analysis         <NA>
6 Venables   Australia       no Modern Applied Statistics ...       Ripley
> (m2 <- merge(books, authors, by.x = "name", by.y = "surname"))
      name                         title other.author nationality deceased
1   McNeil     Interactive Data Analysis         <NA>   Australia       no
2   Ripley            Spatial Statistics         <NA>          UK       no
3   Ripley         Stochastic Simulation         <NA>          UK       no
4  Tierney                     LISP-STAT         <NA>          US       no
5    Tukey     Exploratory Data Analysis         <NA>          US      yes
6 Venables Modern Applied Statistics ...       Ripley   Australia       no
> 
> 
> ## "R core" is missing from authors and appears only here :
> merge(authors, books, by.x = "surname", by.y = "name", all = TRUE)
   surname nationality deceased                         title     other.author
1   McNeil   Australia       no     Interactive Data Analysis             <NA>
2   R Core        <NA>     <NA>          An Introduction to R Venables & Smith
3   Ripley          UK       no            Spatial Statistics             <NA>
4   Ripley          UK       no         Stochastic Simulation             <NA>
5  Tierney          US       no                     LISP-STAT             <NA>
6    Tukey          US      yes     Exploratory Data Analysis             <NA>
7 Venables   Australia       no Modern Applied Statistics ...           Ripley
> 
> ## example of using 'incomparables'
> x <- data.frame(k1 = c(NA,NA,3,4,5), k2 = c(1,NA,NA,4,5), data = 1:5)
> y <- data.frame(k1 = c(NA,2,NA,4,5), k2 = c(NA,NA,3,4,5), data = 1:5)
> merge(x, y, by = c("k1","k2")) # NA's match
  k1 k2 data.x data.y
1  4  4      4      4
2  5  5      5      5
3 NA NA      2      1
> merge(x, y, by = "k1") # NA's match, so 6 rows
  k1 k2.x data.x k2.y data.y
1  4    4      4    4      4
2  5    5      5    5      5
3 NA    1      1   NA      1
4 NA    1      1    3      3
5 NA   NA      2   NA      1
6 NA   NA      2    3      3
> merge(x, y, by = "k2", incomparables = NA) # 2 rows
  k2 k1.x data.x k1.y data.y
1  4    4      4    4      4
2  5    5      5    5      5

Melt examples

This example unpivot a wide table to row based table. 

> ###### melt ######
> 
> (
+   studentMarks <- data.frame
+   (StudentID = c(1000:1005), 
+    StudentName=c("Tom","Lily","Lucy","Li Lei","Han Meimei","Mike"),
+    English = rnorm(6,90,5),
+    Math = rnorm(6,80,10),
+    Chemistry = seq(from=60,to=90,length.out = 6)
+   )
+ )
  StudentID StudentName English Math Chemistry
1      1000         Tom    88.6 91.7        60
2      1001        Lily    94.7 79.9        66
3      1002        Lucy    92.9 69.5        72
4      1003      Li Lei    88.9 76.1        78
5      1004  Han Meimei    91.1 88.5        84
6      1005        Mike    83.0 75.4        90
> 
> install.packages("reshape")
Installing package into ‘E:/Documents/R/win-library/3.4’
> # load package
> require(reshape)
Loading required package: reshape
> 
> (
+   meltMarks <- melt(studentMarks, id=c("StudentID","StudentName"))
+ )
   StudentID StudentName  variable value
1       1000         Tom   English  88.6
2       1001        Lily   English  94.7
3       1002        Lucy   English  92.9
4       1003      Li Lei   English  88.9
5       1004  Han Meimei   English  91.1
6       1005        Mike   English  83.0
7       1000         Tom      Math  91.7
8       1001        Lily      Math  79.9
9       1002        Lucy      Math  69.5
10      1003      Li Lei      Math  76.1
11      1004  Han Meimei      Math  88.5
12      1005        Mike      Math  75.4
13      1000         Tom Chemistry  60.0
14      1001        Lily Chemistry  66.0
15      1002        Lucy Chemistry  72.0
16      1003      Li Lei Chemistry  78.0
17      1004  Han Meimei Chemistry  84.0
18      1005        Mike Chemistry  90.0

Cast examples

> # Cast molton data
> (
+   avgMarks <-cast(meltMarks, StudentName + StudentID ~ variable, max)
+ )
  StudentName StudentID English Math Chemistry
1  Han Meimei      1004    91.1 88.5        84
2      Li Lei      1003    88.9 76.1        78
3        Lily      1001    94.7 79.9        66
4        Lucy      1002    92.9 69.5        72
5        Mike      1005    83.0 75.4        90
6         Tom      1000    88.6 91.7        60
> 
> class(avgMarks)
[1] "cast_df"    "data.frame"
> 
> # order by student id
> (
+   avgMarks[with(avgMarks, order(StudentID)),]
+ )
  StudentName StudentID English Math Chemistry
6         Tom      1000    88.6 91.7        60
3        Lily      1001    94.7 79.9        66
4        Lucy      1002    92.9 69.5        72
2      Li Lei      1003    88.9 76.1        78
1  Han Meimei      1004    91.1 88.5        84
5        Mike      1005    83.0 75.4        90
> 
> 
> # Average values
> (
+   avgMarks <-cast(meltMarks, . ~ variable, mean)
+ )
  value English Math Chemistry
1 (all)    89.9 80.2        75
> 
> 
> # All other variables
> (
+   avgMarks <-cast(meltMarks, ... ~ variable, mean)
+ )
  StudentID StudentName English Math Chemistry
1      1000         Tom    88.6 91.7        60
2      1001        Lily    94.7 79.9        66
3      1002        Lucy    92.9 69.5        72
4      1003      Li Lei    88.9 76.1        78
5      1004  Han Meimei    91.1 88.5        84
6      1005        Mike    83.0 75.4        90

Transpose examples

> ### Transpose
> require(datasets)
> (cars <- mtcars[1:5,1:4])
                   mpg cyl disp  hp
Mazda RX4         21.0   6  160 110
Mazda RX4 Wag     21.0   6  160 110
Datsun 710        22.8   4  108  93
Hornet 4 Drive    21.4   6  258 110
Hornet Sportabout 18.7   8  360 175
> t(cars)
     Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive Hornet Sportabout
mpg         21            21       22.8           21.4              18.7
cyl          6             6        4.0            6.0               8.0
disp       160           160      108.0          258.0             360.0
hp         110           110       93.0          110.0             175.0

Aggregating examples

> ### Aggregating
> options(digits=3)
> # attach to access by column name directly
> attach(studentMarks)
> (aggdata <-aggregate(studentMarks[c(1,3,4)], by=list(StudentID), FUN=mean, na.rm=TRUE))
  Group.1 StudentID English Math
1    1000      1000    88.6 91.7
2    1001      1001    94.7 79.9
3    1002      1002    92.9 69.5
4    1003      1003    88.9 76.1
5    1004      1004    91.1 88.5
6    1005      1005    83.0 75.4
More from Kontext
comment Comments
No comments yet.

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts