python
97 items tagged with "python"
Articles
FastAPI Streaming Response: Error: Did not receive done or success response in stream
Geoanalytics with geopandas and ABS Public Data in Python
GenAI: Generate SQL from Text via Ollama Python Library on Local
Danger: Bypass SSL Interception Proxy Server Certificate Validation in .NET or Python HTTP Requests
Pandas DataFrame aggregate a list column to a set
This code snippet shows you how to group a pandas DataFrame and then aggregate a column with list or array type to a set (with duplicates removed) or a list. To implement it, we can first expode the column (list type) and then use groupby to create a grouped DataFrame and then aggregate using set or list or a combination of both. Input `` category users 0 A [1, 2] 1 B [3, 4] 2 C [5, 6, 7] 3 A [1, 8, 1] 4 B [1, 6, 9] ` Output ` category usersset userslist 0 A {8, 1, 2} [8, 1, 2] 1 B {1, 3, 4, 6, 9} [1, 3, 4, 6, 9] 2 C {5, 6, 7} [5, 6, 7] ``
Union two pandas DataFrame
This code snippet shows you how to union two pandas DataFrames in python using concat method in pandas namespace.If the schema is different, pandas will autmatically merge it. Output `` category value user 0 A 0 2.0 1 B 1 3.0 2 C 2 2.0 3 D 3 1.0 4 E 4 1.0 0 A 0 NaN 1 B 1 NaN 2 C 2 NaN 3 D 3 NaN 4 E 4 NaN ` For the second DataFrame, column user doesn't exist. Pandas uses NaN` to mark it.
Pandas DataFrame groupBy and then COUNT DISTINCT
This code snippet shows you how to group a pandas DataFrame via certain column and then do a distinct count of unique values in another column. It is similar as COUNT DISTINCT aggregation functions in SQL. It also sort the values by the group by column. Example output: `` category value user group-count 0 A 0 5 7 80 A 80 4 7 70 A 70 7 7 60 A 60 10 7 50 A 50 9 7 .. ... ... ... ... 29 J 29 9 7 19 J 19 9 7 9 J 9 9 7 89 J 89 8 7 99 J 99 7 7 [100 rows x 4 columns] ``
PySpark split and explode example
This code snippet shows you how to define a function to split a string column to an array of strings using Python built-in split function. It then explodes the array element from the split into using PySpark built-in explode function. Sample output `` +----------+-----------------+--------------------+-----+ | category| users| users_array| user| +----------+-----------------+--------------------+-----+ |Category A|user1,user2,user3|[user1, user2, us...|user1| |Category A|user1,user2,user3|[user1, user2, us...|user2| |Category A|user1,user2,user3|[user1, user2, us...|user3| |Category B| user3,user4| [user3, user4]|user3| |Category B| user3,user4| [user3, user4]|user4| +----------+-----------------+--------------------+-----+ ``
Azure App Service IIS Log Analytics using Pandas
Flatten Pandas DataFrame after Aggregate
In code snippet Pandas DataFrame Group by one Column and Aggregate using MAX, MIN, MEAN and MEDIAN, it shows how to do aggregations in a pandas DataFrame. This code snippet shows you how to flatten the DataFrame (multiindex) after aggregations. Sample output: category value\max value\min value\mean value\median 0 A 90 0 45 45 1 B 91 1 46 46 2 C 92 2 47 47 3 D 93 3 48 48 4 E 94 4 49 49 5 F 95 5 50 50 6 G 96 6 51 51 7 H 97 7 52 52 8 I 98 8 53 53 9 J 99 9 54 54
Pandas DataFrame Group by one Column and Aggregate using MAX, MIN, MEAN and MEDIAN
This code snippet provides one example of grouping a pandas DataFrame by one column and then aggregating on multiple columns using different functions including max, min, mean and median. We pass in a dictionary for each column that needs to be aggregated: the key is the column name, and the value is a list of aggregation functions supported by pandas DataFrame. The result DataFrame will have multiple levels as following output shows: Sample output: value max min mean median category A 90 0 45 45 B 91 1 46 46 C 92 2 47 47 D 93 3 48 48 E 94 4 49 49 F 95 5 50 50 G 96 6 51 51 H 97 7 52 52 I 98 8 53 53 J 99 9 54 54
Get First Top N Rows in a Pandas DataFrame
Method pandas.DataFrame.head can be used to retrieve top N records from a DataFrame object. It has one optional parameter for the number of rows to return; the default value is 5 if not specified. Syntax `` DataFrame.head(n=5) `` Sample output: A B 0 0.000000 0.000000 1 1.010101 10.101010 2 2.020202 20.202020 3 3.030303 30.303030 4 4.040404 40.404040 5 5.050505 50.505051 6 6.060606 60.606061 7 7.070707 70.707071 8 8.080808 80.808081 9 9.090909 90.909091
Check Column Data Types in a Pandas DataFrame
This code snippet provide examples of checking column data types in a DataFrame using dtype and dtypes. Sample output: >>> print(df.dtypes) a int64 b object dtype: object >>> print(df.a.dtype) int64
Rename Columns in Pandas DataFrame
This code snippet provides a simple approach to rename column names in a Panda DataFrame. It directly sets DataFrame.columns property. Output: !2022090404325-image.png
Python - Get a List of Files in a Directory
This code snippet provides an example of listing files in a directory in the file system using Python. Module os is used. There are multiple approaches to implement this: os.listdir - returns both files and directory in a directory. We can check whether the returned items are files using os.path.isfile method. os.walk - generates two lists for each directory it traverse - files and directories. The walk can be used to retrieve all the sub folders recursively. Sample output: ['F:\\Projects\\kontext-logs\\RawLogs-0\\c716eb-202209020059.log', 'F:\\Projects\\kontext-logs\\RawLogs-0\\c716eb-202209020429.log', 'F:\\Projects\\kontext-logs\\RawLogs-0\\c716eb-202209020710.log', 'F:\\Projects\\kontext-logs\\RawLogs-0\\c716eb-202209020921.log', 'F:\\Projects\\kontext-logs\\RawLogs-0\\c716eb-202209021147.log', 'F:\\Projects\\kontext-logs\\RawLogs-0\\c716eb-202209021357.log', 'F:\\Projects\\kontext-logs\\RawLogs-0\\c716eb-202209021637.log', 'F:\\Projects\\kontext-logs\\RawLogs-0\\c716eb-202209022007.log', 'F:\\Projects\\kontext-logs\\RawLogs-0\\c716eb-202209030218.log', 'F:\\Projects\\kontext-logs\\RawLogs-0\\c716eb-202209030818.log'] Reference: os — Miscellaneous operating system interfaces
Python Convert Relative to Absolute File Path
This code snippet shows you how to convert a relative path to absolute path using os module in Python and vice versa. We utilize os.path.abspath and os.path.relpath functions. For the later, it accepts the second parameter as the base path which will be used to calculate the relative path. Output: \*The current directory when running this code is C:\Users\userid. C:\Users\test.csv ..\test.csv
Use sort() and orderBy() with PySpark DataFrame
In Spark DataFrame, two APIs are provided to sort the rows in a DataFrame based on the provided column or columns: sort and orderBy. orderBy is just the alias for sort API. Syntax `` DataFrame.sort(cols, *kwargs) ` For *cols, we can used it to specify a column name, a Column object (pyspark.sql.Column), or a list of column names or Column objects. For kwargs, we can use it to specify additional arguments. For PySpark, we can specify a parameter named ascending. By default the value is True. It can be a list of boolean values for each columns that are used to sort the records. The code snippet provides the examples of sorting a DataFrame. Sample outputs ` +---+----+ | id|col1| +---+----+ | 2| E| | 4| E| | 6| E| | 8| E| | 1| O| | 3| O| | 5| O| | 7| O| | 9| O| +---+----+ +---+----+ | id|col1| +---+----+ | 2| E| | 4| E| | 6| E| | 8| E| | 1| O| | 3| O| | 5| O| | 7| O| | 9| O| +---+----+ +---+----+ | id|col1| +---+----+ | 8| E| | 6| E| | 4| E| | 2| E| | 9| O| | 7| O| | 5| O| | 3| O| | 1| O| +---+----+ ``
PySpark DataFrame - Add or Subtract Milliseconds from Timestamp Column
This code snippets shows you how to add or subtract milliseconds (or microseconds) and seconds from a timestamp column in Spark DataFrame. It first creates a DataFrame in memory and then add and subtract milliseconds/seconds from the timestamp column ts using Spark SQL internals. Output: `` +---+--------------------------+--------------------------+--------------------------+--------------------------+ |id |ts |ts1 |ts2 |ts3 | +---+--------------------------+--------------------------+--------------------------+--------------------------+ |1 |2022-09-01 12:05:37.227916|2022-09-01 12:05:37.226916|2022-09-01 12:05:37.228916|2022-09-01 12:05:38.227916| |2 |2022-09-01 12:05:37.227916|2022-09-01 12:05:37.226916|2022-09-01 12:05:37.228916|2022-09-01 12:05:38.227916| |3 |2022-09-01 12:05:37.227916|2022-09-01 12:05:37.226916|2022-09-01 12:05:37.228916|2022-09-01 12:05:38.227916| |4 |2022-09-01 12:05:37.227916|2022-09-01 12:05:37.226916|2022-09-01 12:05:37.228916|2022-09-01 12:05:38.227916| +---+--------------------------+--------------------------+--------------------------+--------------------------+ ` \*Note - the code assuming SparkSession object already exists via variable name spark`.
Subscribe to Multiple Topics in Kafka Consumer
When fetching records from Kafka bootstrap servers, we can specify one consumer to subscribe to more than one topic. This can be done in the init function by passing a list of topics: `` topics = ['kontext-events', 'other-events'] bootstrap_servers = 'localhost:9092' consumer = KafkaConsumer( *topics, client_id='local-test', bootstrapservers=bootstrapservers, autooffsetreset='earliest' ) ` Alternatively, we can use subscribe() function to subscribe to multiple topics. We can also regular expressions to match topics via patterns: ` subscribe(topics=(), pattern=None, listener=None) ``
Kafka Consumer - Fetch Records from Beginning
This code snippet reads data from topic's beginning offset by utilizing seektobeginning() API. This will sets each partitions subscribed by the consumers to fetch records from the earliest available offsets. Example output: polling... Kontext message 1 Kontext message 2 Kontext message 3 Kontext message 4 Kontext message 5 New kontext events~!! New events! polling... polling...
Retrieve Kafka Messages (Records) via Timestamp
Convert Timestamp to Milliseconds since Epoch in Python
This code snippets provides different approaches to convert a Python timestamp object to an integer that denotes the total milliseconds since Epoch (midnight 1st Jan 1970 UTC, UNIX time). Output: `` 1661169251815.902 1661133251815.902 ` \*Your result can be different depends on the time when you run the code. The result is the same no matter if we use utcnow() or now()` as they both represent the same time. However, if your input timestamp format is string, you need to be careful about time zone information when converting them to timestamp, i.e. specify the right time zone accordingly.
Python - Read Messages in Kafka Topic
This code snippet provides example of reading messages from Kafka topics. In common practices, one Kafka consumer is used to read from one partition for a topic; if there are multiple partitions for the same topic, each consumer can run in different servers or containers. Those consumers for different partitions are also configured within one consumer group. In this example, all partitions are assigned to the same consumer. This code snippet utilize Python package kafka-python. It can be installed via the following command if you have pip installed: `` pip install kafka-python ` Or we can also use built-in pip in Python 3: ` python -m pip install kafka-python `` References Apache Kafka Python Producer and Consumer Clients Introduction Install and Run Kafka 3.2.0 On WSL Install and Run Kafka 2.6.0 On Windows 10
Remove Special Characters from Column in PySpark DataFrame
Spark SQL function regexreplace can be used to remove special characters from a string column in Spark DataFrame. Depends on the definition of special characters, the regular expressions can vary. For instance, [^0-9a-zA-Z\-]+ can be used to match characters that are not alphanumeric or are not hyphen(-) or underscore(\_); regular expression '[@\+\#\$\%\^\!]+' can match these defined special characters. This code snippet replace special characters with empty string. Output: `` +---+--------------------------+ |id |str | +---+--------------------------+ |1 |ABCDEDF!@#$%%^123456qwerty| |2 |ABCDE!!! | +---+--------------------------+ +---+-------------------+ | id| replaced_str| +---+-------------------+ | 1|ABCDEDF123456qwerty| | 2| ABCDE| +---+-------------------+ ``
PySpark DataFrame - Select Columns using select Function
In PySpark, we can use select function to select a subset or all columns from a DataFrame. Syntax `` DataFrame.select(*cols) ` This function returns a new DataFrame object based on the projection expression list. This code snippet prints out the following output: ` +---+----------------+-------+---+ | id|customer_profile| name|age| +---+----------------+-------+---+ | 1| {Kontext, 3}|Kontext| 3| | 2| {Tech, 10}| Tech| 10| +---+----------------+-------+---+ ``
Extract Value from XML Column in PySpark DataFrame
Python Format with Dictionary Object
Access the Index in Python 'for' Loops
Python: Read Data from MS Access Database via ODBC
PySpark - Read and Parse Apache Access Log Text Files
Pandas DataFrame - Get Row and Column Count
Install Miniconda and Anaconda on WSL 2 or Linux
PySpark - Read Data from MariaDB Database
PySpark - Read Data from Oracle Database
Pandas DataFrame - Iterate over Rows
Python: Read Data from Oracle Database
Connect to HBase in Python via HappyBase
Install Python 3.9.1 on WSL
Install Airflow on Windows via Windows Subsystem for Linux (WSL)
Torchvision error: Could not find module image.pyd
Create Environments with Different Python Versions in Anaconda
Call SQL Server Stored Procedure in Python
Read Email from Microsoft 365 via Microsoft Graph API
In article Python: Send Email via Microsoft Graph API, I provided detailed steps to send email through msal package. In this article, I am going to show you how to read emails from Microsoft 365 via Microsoft Graph API.
Spark - 保存DataFrame为Hive数据表
Spark (PySpark) - 从SQL Server数据库中读取数据
PySpark: 将DataFrame中的JSON字符列转换为数组
PySpark - 转换Python数组或串列为Spark DataFrame
Python: Send Email via Microsoft Graph API
Set Spark Python Versions via PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON
Load XML File into BigQuery
Spark - Read from BigQuery Table
Python: Read Data from BigQuery
Pandas - Save DataFrame to BigQuery
Load JSON File into BigQuery
Connect to PostgreSQL in Spark (PySpark)
Python: Load Data from MySQL
Python: Load Data from Hive
Load Microsoft 365 SharePoint List Data in Python
Read and Write XML Files with Python
Kafka Topic Partitions Walkthrough via Python
Apache Kafka Python Producer and Consumer Clients Introduction
Python: Save Pandas DataFrame to Teradata
Python: Load / Read Multiline CSV File
Create, Insert, Delete, Update Operations on Teradata via JDBC in Python
Python: Read Data from SQLite via JDBC
Data Operations with SQLite Database via Python
AttributeError: type object 'java.sql.Types' has no attribute '__javaclass__'
Pandas DataFrame Plot - Scatter and Hexbin Chart
Pandas DataFrame Plot - Area Chart
Pandas DataFrame Plot - Pie Chart
Convert String to Date using Python
Pandas DataFrame Plot - Line Chart
Pandas DataFrame Plot - Bar Chart
PySpark Read Multiline (Multiple Lines) from CSV File
Connect to SQL Server via JayDeBeApi in Python
Python JayDeBeApi module allows you to connect from Python to databases using Java JDBC drivers.
PySpark: Convert Python Dictionary List to Spark DataFrame
Create and Read Pickle Files in Python
Convert Python Dictionary List to PySpark DataFrame
Check installed packages in Python
Different programming languages have different package management tools.
Read and parse JSON in Python
Read JSON file as Spark DataFrame in Python / Spark
Spark has easy fluent APIs that can be used to read data from JSON file as DataFrame object.
Write and read parquet files in Python / Spark
Parquet is columnar store format published by Apache. It's commonly used in Hadoop ecosystem. There are many programming language APIs that have been implemented to support writing and reading parquet files.
Convert string to date in Python / Spark
This code snippet shows how to convert string to date.
Convert List to Spark Data Frame in Python / Spark
In Spark, SparkContext.parallelize function can be used to convert list of objects to RDD and then RDD can be converted to DataFrame object through SparkSession.
Convert PySpark Row List to Pandas Data Frame
PySpark: Convert Python Array/List to Spark Data Frame
Read Hadoop Credential in PySpark
Data Partition in Spark (PySpark) In-depth Walkthrough
PySpark - Fix PermissionError: [WinError 5] Access is denied
Spark - Save DataFrame to Hive Table
Connect to SQL Server in Spark (PySpark)
Debug PySpark Code in Visual Studio Code
Implement SCD Type 2 Full Merge via Spark Data Frames
PySpark: Convert JSON String Column to Array of Object (StructType) in Data Frame
Connect to Teradata database through Python
Diagrams
Diagram - Connect to HBase in Python
This diagram illustrates connecting to HBase using Python HappyBase package.
Python Libraries to Connect to SQL Server
There are several different ways to connect to SQL Server in Python: JDBC (py4j), for example, JayDeBeApi ODBC (pyodbc) FreeTDS (pymssql)