Hive - Create External Table for Multiline CSV Files

visibility 82 access_time 26 days ago languageEnglish timeline Stats
timeline Stats
Page index 3.61
more_horiz

In article PySpark Read Multiline (Multiple Lines) from CSV File, it shows how to created Spark DataFrame by reading from CSV files with embedded newlines in values. In Hive, external table can be created with locations of your CSV files, regardless of in HDFS or S3, Azure Blob Storage or GCS.

Multiline CSV file sample

Assume the CSV file named multiline-csv.csv is stored in S3 (s3://bucket/multiline/).

ID,Text1,Text2
1,Record 1,Hello World!
2,Record 2,Hello Hadoop!
3,Record 3,"Hello 
Kontext!"
4,Record 4,Hello!

Create the external table

Hive has built-in SerDe to deal with CSV data.

CREATE EXTERNAL TABLE test (
ID string,
Text1 string, 
Text2 string) 
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES ("separatorChar" = ",")
LOCATION 's3://bucket/multiline/'
TBLPROPERTIES ("skip.header.line.count"="1")

We skip the first line as it is the header.

When querying this external table in Athena on AWS or in Hive directly, you will get 5 records:

+---+--------+---------------+
| ID|   Text1|          Text2|
+---+--------+---------------+
|  1|Record 1|   Hello World!|
|  2|Record 2|  Hello Hadoop!|
|  3|Record 3|Hello
Kontext!|
|  4|Record 4|         Hello!|

The problem

The above occurs because OpenCSVSerde doesn't support embedded newline character in file.

The following sections provide some common resolutions to address this issue. They all involve pre-processing the data.

Approach 1 - Escape embedded newlines

One approach is to pre-process your CSV file and escape new lines in your fields. For example replace it with \n. So the input file becomes:

ID,Text1,Text2
1,Record 1,Hello World!
2,Record 2,Hello Hadoop!
3,Record 3,"Hello \nKontext!"
4,Record 4,Hello!

This is a standard CSV and can be read properly using the previous table definition. 

Approach 2 - Replace line termination character

You can also pre-process your CSV file with different row termination character, for example ~. Then the table can be defined using the following statement:

CREATE EXTERNAL TABLE test(
        ID STRING,
        Text1 STRING,
        Text2 STRING
    )
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '\\'
    LINES TERMINATED BY '~'
    NULL DEFINED AS '__NULL__'
    STORED AS TEXTFILE
    LOCATION 's3://bucket/multiline/'
TBLPROPERTIES ("skip.header.line.count"="1");

Approach 3 - Change file from CSV to parquet

The other way is to create a pre-process program to read the CSV data properly, for example using PySpark as mentioned in the other article, and then save it as parquet or other schema aware format. Finally the external table can be created using that format.

CREATE EXTERNAL TABLE test (
ID string,
Text1 string, 
Text2 string) 
STORED AS PARQUET
LOCATION 's3://bucket/multiline/'

References

OpenCSVSerDe for processing CSV - Amazon Athena

Differences between Hive External and Internal (Managed) Tables

info Last modified by Kontext 26 days ago copyright This page is subject to Site terms.
Like this article?
Share on

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts

More from Kontext
Create Table Stored as CSV, TSV, JSON Format - Hive SQL
visibility 7,051
thumb_up 1
access_time 2 years ago
Spark - Save DataFrame to Hive Table
visibility 66,127
thumb_up 13
access_time 2 years ago
Create Table with Parquet, Orc, Avro - Hive SQL
visibility 3,238
thumb_up 1
access_time 2 years ago