Hive - Create External Table for Multiline CSV Files

In article PySpark Read Multiline (Multiple Lines) from CSV File, it shows how to created Spark DataFrame by reading from CSV files with embedded newlines in values. In Hive, external table can be created with locations of your CSV files, regardless of in HDFS or S3, Azure Blob Storage or GCS.

Multiline CSV file sample

Assume the CSV file named multiline-csv.csv is stored in S3 (s3://bucket/multiline/).

ID,Text1,Text2
1,Record 1,Hello World!
2,Record 2,Hello Hadoop!
3,Record 3,"Hello 
Kontext!"
4,Record 4,Hello!

Create the external table

Hive has built-in SerDe to deal with CSV data.

CREATE EXTERNAL TABLE test (
ID string,
Text1 string, 
Text2 string) 
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'WITH SERDEPROPERTIES ("separatorChar" = ",") LOCATION 's3://bucket/multiline/'TBLPROPERTIES ("skip.header.line.count"="1")

We skip the first line as it is the header.

When querying this external table in Athena on AWS or in Hive directly, you will get 5 records:

+---+--------+---------------+
| ID|   Text1|          Text2|
+---+--------+---------------+
|  1|Record 1|   Hello World!|
|  2|Record 2|  Hello Hadoop!|
|  3|Record 3|Hello
Kontext!|
|  4|Record 4|         Hello!|

The problem

The above occurs because OpenCSVSerde doesn't support embedded newline character in file.

The following sections provide some common resolutions to address this issue. They all involve pre-processing the data.

Approach 1 - Escape embedded newlines

One approach is to pre-process your CSV file and escape new lines in your fields. For example replace it with \n. So the input file becomes:

ID,Text1,Text2
1,Record 1,Hello World!
2,Record 2,Hello Hadoop!
3,Record 3,"Hello \nKontext!"
4,Record 4,Hello!

This is a standard CSV and can be read properly using the previous table definition.

Approach 2 - Replace line termination character

You can also pre-process your CSV file with different row termination character, for example ~. Then the table can be defined using the following statement:

CREATE EXTERNAL TABLE test(        ID STRING,        Text1 STRING,        Text2 STRING    )    ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '\\'    LINES TERMINATED BY '~'    NULL DEFINED AS '__NULL__'    STORED AS TEXTFILE    LOCATION 's3://bucket/multiline/'    TBLPROPERTIES ("skip.header.line.count"="1");

Approach 3 - Change file from CSV to parquet

The other way is to create a pre-process program to read the CSV data properly, for example using PySpark as mentioned in the other article, and then save it as parquet or other schema aware format. Finally the external table can be created using that format.

CREATE EXTERNAL TABLE test (
ID string,
Text1 string, 
Text2 string) 
STORED AS PARQUETLOCATION 's3://bucket/multiline/'

References

OpenCSVSerDe for processing CSV - Amazon Athena

Differences between Hive External and Internal (Managed) Tables