PySpark - variance() Function

visibility 21 access_time 2mo languageEnglish

variance() is an aggregate function used to get the variance from the given column in the PySpark DataFrame.

We have to import variance() method from pyspark.sql.functions

Syntax:

dataframe.select(variance("column_name"))

Example:

  • Get variance in marks  column of the PySpark DataFrame.
# import the below modules

import pyspark
from pyspark.sql import SparkSession
# create an app
spark = SparkSession.builder.appName('kontext').getOrCreate()
#create a list of data
values = [{'rollno': 1, 'student name': 'Gottumukkala Sravan kumar','marks': 98},
        {'rollno': 2, 'student name': 'Gottumukkala Bobby','marks': 89},
        {'rollno': 3, 'student name': 'Lavu Ojaswi','marks': 90},
        {'rollno': 4, 'student name': 'Lavu Gnanesh','marks': 78},
        {'rollno': 5, 'student name': 'Chennupati Rohith','marks': 100}]
# create the dataframe from the values
data = spark.createDataFrame(values)
#import variance function
from pyspark.sql.functions import variance
#display variance in marks column
print(data.select(variance("marks")).collect())

Output:

[Row(var_samp(marks)=76.00000000000001)]

copyright This page is subject to Site terms.

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts

timeline Stats
Page index 0.53
local_offer Tags