Working with Databases and Files in R

event 2020-09-23 visibility 220 comment 0 insights
more_vert
insights Stats
Raymond Raymond R Programming

Programming with R language - tutorials about R. 

R provides rich APIs to interact with source data such as databases and files (CSV, XML, JSON, etc.) With SparklyR, R can also be used to interact with big data platforms like Hadoop. This articles shows examples about using R to load data from relational databases and text files.  

Data interface summary

The following table is a summary of R packages for accessing data in different sources.

Data SourceLibrary/PackageFunctions
CSVutilsread.table, read.csv, read.csv2, read.delim, read.delim2 write.table, write.csv, write.csv2
Excelxlsxread.xlsx
BinarybasewriteBin(object, con) readBin(con, what, n )
XMLXMLxmlParse, xmlRoot, xmlSize, xmlToDataFrame …
JSONrjsonfromJSON, as.data.frame
Web resourcesRcurl, stringrgetHTMLLinks, links, l_ply
DatabaseRMySQL, RSQLServer, ROracle, RSQLite…dbConnect, dbListTables, dbSendQuery, fetch

Load CSV file

The following example performs these actions in sequence:

  • Load CSV file
  • Perform calculations/aggregations
  • Save the result into CSV file

The input CSV file 'SalesData.csv' has the following content:

Month,Product,Quantities,Amount
2017-07,Vitamin A,100,1700
2017-07,Vitamin E,50,900
2017-08,Vitamin A,100,1700
2017-09,Vitamin A,90,1500
2017-09,Vitamin B,500,3123

Code snippet

The following code is the implementation (script R24.WorkWithCSV.R):

# Load CSV
getwd()

require(utils)

(
  salesData <- read.csv("R24/SalesData.csv", header = TRUE, sep=",")
)

require(reshape)

# Melt data first
(
  moltenSalesData <- melt(salesData, id=c("Month"), measure=c("Quantities","Amount"))
)

# aggregation by month
(
aggSalesData  <- cast(moltenSalesData,  Month ~ variable, sum, )
)

# Set factor labels for display
aggSalesData$Month <- factor(aggSalesData$Month, labels = c('2017-JUL','2017-AUG','2017-SEP'))
aggSalesData

# Save data
write.csv(aggSalesData,file="R24/AggSalesData.csv", quote=FALSE, row.names = FALSE)

Sample output

After the above script is executed, a new file named 'AggSalesData.csv' is generated with the following content:

2020092394443-image.png'

Connect to SQL Server database via ODBC

There are many packages available in R to connect to databases. The section provides one example of interacting with SQL Server database via 'odbc' package. 

For interacting with databases, these DBI functions are provided:

  • dbConnect
  • dbGetInfo
  • dbListTables
  • dbReadTable
  • dbSendStatement
  • dbHasCompleted
  • dbGetRowsAffected
  • dbClearResult
  • dbSendQuery
  • dbFetch
  • dbDisconnect
  • ...

Code snippet

The following code snippet (script R25.Database.R) install odbc package and then use it to establish a database connection; once the database connection is established, it is then used to read data from table 'SalesData'. Insert statements are also executed to insert data into the database. The script also shows different ways to loading data from database as data frame. 

if("odbc" %in% installed.packages())
{
  print("odbc is installed.")
}else
{
  print("odbc is not installed.")
  install.packages("odbc")
}

require(odbc)

# establish connection
con <- dbConnect(odbc::odbc(), 
                 .connection_string = "Driver={SQL Server};Server=.;Database=RTutorials;Trusted_Connection=True;")

# get database info
dbGetInfo(con)

# list tables
dbListTables(con)

# read table
dbReadTable(con, "SalesData")

# DML statements
rs <- dbSendStatement(con, "INSERT [dbo].[SalesData] ([Month], [Product], [Quantitites], [Amount]) VALUES (N'2017-09', N'Vitamin C', 500, CAST(3123.00 AS Numeric(20, 2)))")
dbHasCompleted(rs)
dbGetRowsAffected(rs)
dbClearResult(rs)

# Select statements with dbFetch()
rs <- dbSendQuery(con, "SELECT * FROM SalesData;")
dbFetch(rs)
dbClearResult(rs)

# disconnect
dbDisconnect(con)

SQL scripts to create the database

USE [RTutorials]
GO
/****** Object:  Table [dbo].[SalesData]    Script Date: 3/12/2017 5:10:17 PM ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
IF NOT EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[SalesData]') AND type in (N'U'))
BEGIN
CREATE TABLE [dbo].[SalesData](
	[Month] [varchar](10) NOT NULL,
	[Product] [varchar](50) NOT NULL,
	[Quantitites] [int] NOT NULL,
	[Amount] [numeric](20, 2) NOT NULL
) ON [PRIMARY]
END
GO
INSERT [dbo].[SalesData] ([Month], [Product], [Quantitites], [Amount]) VALUES (N'2017-07', N'Vitamin A', 100, CAST(1700.00 AS Numeric(20, 2)))
GO
INSERT [dbo].[SalesData] ([Month], [Product], [Quantitites], [Amount]) VALUES (N'2017-07', N'Vitamin E', 50, CAST(900.00 AS Numeric(20, 2)))
GO
INSERT [dbo].[SalesData] ([Month], [Product], [Quantitites], [Amount]) VALUES (N'2017-08', N'Vitamin A', 100, CAST(1700.00 AS Numeric(20, 2)))
GO
INSERT [dbo].[SalesData] ([Month], [Product], [Quantitites], [Amount]) VALUES (N'2017-09', N'Vitamin A', 90, CAST(1500.00 AS Numeric(20, 2)))
GO
INSERT [dbo].[SalesData] ([Month], [Product], [Quantitites], [Amount]) VALUES (N'2017-09', N'Vitamin B', 500, CAST(3123.00 AS Numeric(20, 2)))
GO


select * from dbo.SalesData

2020092394912-image.png

Summary

It's critical to understand how to interact with different data source formats as they are the input for most of our R processes. Let me know if you have any questions or encounter any issues using these data APIs. 

More from Kontext
comment Comments
No comments yet.

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts