visibility 44 comment 0 access_time 2 months ago language English

codePySpark DataFrame - Select Columns using select Function

In PySpark, we can use select function to select a subset or all columns from a DataFrame.

Syntax

DataFrame.select(*cols)

This function returns a new DataFrame object based on the projection expression list. 

This code snippet prints out the following output:

+---+----------------+-------+---+
| id|customer_profile|   name|age|
+---+----------------+-------+---+
|  1|    {Kontext, 3}|Kontext|  3|
|  2|      {Tech, 10}|   Tech| 10|
+---+----------------+-------+---+

Code snippet

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

appName = "PySpark Example - select"
master = "local"

# Create Spark session
spark = SparkSession.builder \
    .appName(appName) \
    .master(master) \
    .getOrCreate()

spark.sparkContext.setLogLevel("WARN")

data = [{"id": 1, "customer_profile": {"name": "Kontext", "age": 3}},
        {"id": 2, "customer_profile": {"name": "Tech", "age": 10}}]

customer_schema = StructType([
    StructField('name', StringType(), True),
    StructField('age', IntegerType(), True),
])
df_schema = StructType([StructField("id", IntegerType(), True), StructField(
    "customer_profile", customer_schema, False)])
df = spark.createDataFrame(data, df_schema)
print(df.schema)
df.show()

# select certain columns
df.select('*', "customer_profile.name", "customer_profile.age").show()
fork_right Fork
copyright This page is subject to Site terms.

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts

comment Comments
No comments yet.