Python: Load / Read Multiline CSV File
CSV is a common data format used in many applications. It's also a common task for data workers to read and parse CSV and then save it into another storage such as RDBMS (Teradata, SQL Server, MySQL). In my previous article PySpark Read Multiple Lines Records from CSV I demonstrated how to use PySpark to read CSV as a data frame. This article will show you several approaches to read CSV files directly using Python (without Spark APIs).
CSV data file
The CSV file I'm going to load is the same as the one in the previous example. The file is named as data.csv with the following content:
ID,Text1,Text2 1,Record 1,Hello World! 2,Record 2,Hello Hadoop! 3,Record 3,"Hello Kontext!" 4,Record 4,Hello!
There are 4 records and three columns. One record's content is across multiple line.
Environment
All the following code snippets runs on a Windows 10 machine with Python 3.8.2 64bit. It should work on other platforms but I have not tested it. Please bear this in mind.
Use built-in csv module
csv module can be used to read CSV files directly. It can be used to both read and write CSV files.
Refer to official docs about this module.
Sample code
import csv file_path = 'data.csv' with open(file_path, newline='', encoding='utf-8') as f: reader = csv.reader(f, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL) for row in reader: print(row)
The above code snippet reads CSV with all default options and it can handle multi-line CSV automatically.
The output looks like this:
['ID', 'Text1', 'Text2'] ['1', 'Record 1', 'Hello World!'] ['2', 'Record 2', 'Hello Hadoop!'] ['3', 'Record 3', 'Hello \r\nKontext!'] ['4', 'Record 4', 'Hello!']
Use Pandas
Pandas has API to read CSV file as a data frame directly.
Read this document for all the parameters: pandas.read_csv.
Sample code
import pandas as pd file_path = 'data.csv' pdf = pd.read_csv(file_path) print(pdf)
For the sample CSV files, by default it can handle it properly. If your CSV structure/content is different, you can customize the API call.
The output looks like the following:
ID Text1 Text2 0 1 Record 1 Hello World! 1 2 Record 2 Hello Hadoop! 2 3 Record 3 Hello \r\nKontext! 3 4 Record 4 Hello!
For Pandas dataframe, you can also write the results into a database directly via to_sql function.