Python: Load / Read Multiline CSV File

access_time 7 months ago visibility1123 comment 0

CSV is a common data format used in many applications. It's also a common task for data workers to read and parse CSV and then save it into another storage such as RDBMS (Teradata, SQL Server, MySQL). In my previous article PySpark Read Multiple Lines Records from CSV I demonstrated how to use PySpark to read CSV as a data frame. This article will show you several approaches to read CSV files directly using Python (without Spark APIs).

CSV data file

The CSV file I'm going to load is the same as the one in the previous example. The file is named as data.csv with the following content:

ID,Text1,Text2
1,Record 1,Hello World!
2,Record 2,Hello Hadoop!
3,Record 3,"Hello 
Kontext!"
4,Record 4,Hello!

There are 4 records and three columns. One record's content is across multiple line. 

Environment 

All the following code snippets runs on a Windows 10 machine with Python 3.8.2 64bit. It should work on other platforms but I have not tested it. Please bear this in mind. 

Use built-in csv module

csv module can be used to read CSV files directly. It can be used to both read and write CSV files. 

Refer to official docs about this module. 

Sample code

import csv

file_path = 'data.csv'

with open(file_path, newline='', encoding='utf-8') as f:
    reader = csv.reader(f, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    for row in reader:
        print(row)

The above code snippet reads CSV with all default options and it can handle multi-line CSV automatically.

The output looks like this:

['ID', 'Text1', 'Text2']
['1', 'Record 1', 'Hello World!']
['2', 'Record 2', 'Hello Hadoop!']
['3', 'Record 3', 'Hello \r\nKontext!']
['4', 'Record 4', 'Hello!']

Use Pandas

Pandas has API to read CSV file as a data frame directly.

Read this document for all the parameters: pandas.read_csv.

Sample code

import pandas as pd
file_path = 'data.csv'
pdf = pd.read_csv(file_path)
print(pdf)

For the sample CSV files, by default it can handle it properly. If your CSV structure/content is different, you can customize the API call.

The output looks like the following:

   ID     Text1               Text2
0   1  Record 1        Hello World!
1   2  Record 2       Hello Hadoop!
2   3  Record 3  Hello \r\nKontext!
3   4  Record 4              Hello!

For Pandas dataframe, you can also write the results into a database directly via to_sql function.

info Last modified by Administrator at 3 months ago copyright This page is subject to Site terms.
Like this article?
Share on

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts

Want to publish your article on Kontext?

Learn more

Kontext Column

Created for everyone to publish data, programming and cloud related articles.
Follow three steps to create your columns.


Learn more arrow_forward

More from Kontext

local_offer teradata local_offer python local_offer python-database

visibility 2141
thumb_up 1
access_time 7 months ago

Pandas is commonly used by Python users to perform data operations. In many scenarios, the results need to be saved to a storage like Teradata. This article shows you how to do that easily using JayDeBeApi or  sqlalchemy-teradata   package.  JayDeBeApi package and Teradata JDBC ...

local_offer kafka local_offer python

visibility 105
thumb_up 0
access_time 2 months ago

Apache Kafka is written with Scala. Thus, the most natural way is to use Scala (or Java) to call Kafka APIs, for example, Consumer APIs and Producer APIs. For Python developers, there are open source packages available that function similar as official Java clients.  This article shows you ...

Kafka Topic Partitions Walkthrough via Python

local_offer kafka local_offer python

visibility 176
thumb_up 0
access_time 2 months ago

Partition is the parallelism unit in a Kafka cluster. Partitions are replicated in Kafka cluster (cluster of brokers) for fault tolerant and throughput. This articles show you how to work with Kafka partitions using Python as programming language. Package kafka-python will be used in the ...

About column