Hive - Create External Table for Multiline CSV Files
In article PySpark Read Multiline (Multiple Lines) from CSV File, it shows how to created Spark DataFrame by reading from CSV files with embedded newlines in values. In Hive, external table can be created with locations of your CSV files, regardless of in HDFS or S3, Azure Blob Storage or GCS.
Multiline CSV file sample
Assume the CSV file named multiline-csv.csv
is stored in S3 (s3://bucket/multiline/).
ID,Text1,Text2 1,Record 1,Hello World! 2,Record 2,Hello Hadoop! 3,Record 3,"Hello Kontext!" 4,Record 4,Hello!
Create the external table
Hive has built-in SerDe to deal with CSV data.
CREATE EXTERNAL TABLE test (
ID string,
Text1 string,
Text2 string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES ("separatorChar" = ",")
LOCATION 's3://bucket/multiline/'
TBLPROPERTIES ("skip.header.line.count"="1")
We skip the first line as it is the header.
When querying this external table in Athena on AWS or in Hive directly, you will get 5 records:
+---+--------+---------------+ | ID| Text1| Text2| +---+--------+---------------+ | 1|Record 1| Hello World!| | 2|Record 2| Hello Hadoop!| | 3|Record 3|Hello Kontext!| | 4|Record 4| Hello!|
The problem
The above occurs because OpenCSVSerde doesn't support embedded newline character in file.
The following sections provide some common resolutions to address this issue. They all involve pre-processing the data.
Approach 1 - Escape embedded newlines
One approach is to pre-process your CSV file and escape new lines in your fields. For example replace it with \n. So the input file becomes:
ID,Text1,Text2 1,Record 1,Hello World! 2,Record 2,Hello Hadoop! 3,Record 3,"Hello \nKontext!" 4,Record 4,Hello!
This is a standard CSV and can be read properly using the previous table definition.
Approach 2 - Replace line termination character
You can also pre-process your CSV file with different row termination character, for example ~. Then the table can be defined using the following statement:
CREATE EXTERNAL TABLE test(
ID STRING,
Text1 STRING,
Text2 STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '\\'
LINES TERMINATED BY '~'
NULL DEFINED AS '__NULL__'
STORED AS TEXTFILE
LOCATION 's3://bucket/multiline/'
TBLPROPERTIES ("skip.header.line.count"="1");
Approach 3 - Change file from CSV to parquet
The other way is to create a pre-process program to read the CSV data properly, for example using PySpark as mentioned in the other article, and then save it as parquet or other schema aware format. Finally the external table can be created using that format.
CREATE EXTERNAL TABLE test (
ID string,
Text1 string,
Text2 string)
STORED AS PARQUET
LOCATION 's3://bucket/multiline/'
References
OpenCSVSerDe for processing CSV - Amazon Athena
Differences between Hive External and Internal (Managed) Tables