Code description
In PySpark, we can use select
function to select a subset or all columns from a DataFrame.
Syntax
DataFrame.select(*cols)
This function returns a new DataFrame
object based on the projection expression list.
This code snippet prints out the following output:
+---+----------------+-------+---+
| id|customer_profile| name|age|
+---+----------------+-------+---+
| 1| {Kontext, 3}|Kontext| 3|
| 2| {Tech, 10}| Tech| 10|
+---+----------------+-------+---+
Code snippet
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
appName = "PySpark Example - select"
master = "local"
# Create Spark session
spark = SparkSession.builder .appName(appName) .master(master) .getOrCreate()
spark.sparkContext.setLogLevel("WARN")
data = [{"id": 1, "customer_profile": {"name": "Kontext", "age": 3}},
{"id": 2, "customer_profile": {"name": "Tech", "age": 10}}]
customer_schema = StructType([
StructField('name', StringType(), True),
StructField('age', IntegerType(), True),
])
df_schema = StructType([StructField("id", IntegerType(), True), StructField(
"customer_profile", customer_schema, False)])
df = spark.createDataFrame(data, df_schema)
print(df.schema)
df.show()
# select certain columns
df.select('*', "customer_profile.name", "customer_profile.age").show()