Azure App Service IIS Log Analytics using Pandas
There are also many commercial software/frameworks that can be used to analyze logs.
About logs in ASP.NET Core
There are two types of logs we usually can use for an ASP.NET Core application hosted on Azure: application logs and web server logs. For application logs, they are the ones you added through loggers, for example, logs about an error within a function or API controller action, etc. Web server logs in this case are the IIS logs if the applications are hosted on Windows machine.
The following image shows an example of application log:
Web server (IIS) logs
The following texts are sample server logs:
#Software: Microsoft Internet Information Services 8.0 #Fields: date time s-sitename cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs(User-Agent) cs(Cookie) cs(Referer) cs-host sc-status sc-substatus sc-win32-status sc-bytes cs-bytes time-taken 2022-09-02 00:58:36 KONTEXT GET / X-ARR-LOG-ID=2944fc92-6216-4866-92a2-a4b2****a994 443 - 66.249.68.42 Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html) - - kontext.tech 200 0 0 17825 778 96
For each log file, the first two lines are metadata about the log file:
- First line indicates the server or software that generates the logs. For this case, it is IIS 8.0.
- The second line is the fields available for each log record. It includes information about the clients, for example, time when the request was initiated, status, result, user agent, IP address, response time, etc. We will use it to derive the column names later.
The rest of the file are space (" ") delimited log records. For example, the third line in the sample file is a log about Google crawler. It crawls the index page of Kontext (/) and it took 96 milliseconds to response. The IP address is 66.249.68.42.
We will discuss about how to parse these logs in the following sections.
Usage of the logs
We can use the logs to generate many insights. It can answer the following questions:
- Which URLs are being frequently requested?
- Which pages are running very slow?
- Which IP addresses can be malicious attacks?
- What are the common browser types or OS of the users?
- ...
All these information can be useful to help you make right decisions of your web applications. Again, these information can also be extracted using analytics tools.
Where are the logs stored?
Depends on your configurations of the App service, the web server logs can be stored directly in your web hosting server, or in an Azure Blob Storage container. It might also be turned off if you choose that option:
You can use FTP tools to connect to your web server to download the logs if it is stored there; alternatively you can download from your storage blob container. Azure Storage Explorer is a good tool to download using GUI software. You can also use SCM website to download.
The following section assumes you have already downloaded the logs into a local folder so that we can use Python to analyze them. Similar approach can be applied to your blob storage directly using other big data tools like Spark (or Azure HDInsights).
Analyze web server logs
Now let's start to write the application to analyze the logs. There are three major steps for the example application I am going to show:
- Read the log files and parse them as Panda DataFrame
- Consolidate or union the log DataFrame as single one for analysis. Please make sure your machine has sufficient memory to keep all those data in memory.
- Analyze the data using pandas.
Prerequisites
Make sure you have pandas installed. You can install via pip:
python -m pip install pandas
I am using Python 3.8.2 for this tutorial but the code can run with other higher Python versions.
Create the Python script
Create a Python script with the following content:
import pandas as pd import os from os.path import isfile, join if __name__ == "__main__": # headers header = "date time s-sitename cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs(User-Agent) cs(Cookie) cs(Referer) cs-host sc-status sc-substatus sc-win32-status sc-bytes cs-bytes time-taken" lines_to_skip = 2 sep = " " columns = header.split(" ") log_folder = os.path.abspath("../../RawLogs-1") print(f'Parsing log files in directory: {log_folder}') files = [join(log_folder, item) for item in os.listdir(log_folder) if isfile(join(log_folder, item))] print(files) frames = [] for file in files: print(f'Loading file {file}...') df = pd.read_csv(file, sep=sep, skiprows=2) df.columns = columns frames.append(df) df_logs = pd.concat(frames) df_logs = df_logs.drop_duplicates() print(df_logs.dtypes) print(df_logs)
The script reads all the log files from a relative folder named '../../RawLogs-1' which I downloaded from IIS logs. It then reads each file using pandas.read_csv
API before concatenating them are one single DataFrame
. There are a few minor details to pay attention to:
- The code skips the first 2 lines in each file (
skiprows=2
). As mentioned previously, they are metadata lines. - The code uses space (' ') as delimiter.
- It also used
drop_duplicates
to remove duplicated records.
The schema output looks like the following:
date object time object s-sitename object cs-method object cs-uri-stem object cs-uri-query object s-port int64 cs-username object c-ip object cs(User-Agent) object cs(Cookie) object cs(Referer) object cs-host object sc-status int64 sc-substatus int64 sc-win32-status int64 sc-bytes int64 cs-bytes int64 time-taken int64 dtype: object
Pandas implicitly infers the schema (data types) based on the content of the data.
The DataFrame print looks like the following screenshot:
Now let's look into some examples using this consolidated pandas DataFrame.
Find the out the slowest pages
The first thing I looked into are the slowest pages (URLs, column cs-uri-stem
). We can directly look into time-taken
column.
# Print out the top 100 URLs that takes longest df_slowest = df_logs.sort_values( by=['time-taken'], ascending=[False])[['date', 'cs-uri-stem']] print(df_slowest.head(1000)) df_slowest.head(100).to_excel("slowest-url-top-1000.xlsx")
The above code snippet simply order the DataFrame by time-taken column and then it prints out the date and URL. The output looks like the following:
The result is also stored into an Excel file.
With this information, I will be able to know which pages or URLs to look into to improve performance.
time-taken
column for this case after sorting.Find out slow URLs by MAX, MIN, MEAN and MEDIAN
A more balanced approach is to look into the time-taken column using aggregations since one single request might not be accurate enough. With Pandas DataFrame, we can easily aggregate on columns.
# Print out the slowest URLs that takes longest time. df_agg = df_logs[['date', 'cs-uri-stem', 'time-taken']].groupby( by=["date", "cs-uri-stem"]).aggregate({"time-taken": ['max', 'min', 'mean', 'median']}) df_agg.columns = ["_".join(x) for x in df_agg.columns.ravel()] print(df_agg.sort_values(by='time-taken_mean', ascending=False).head(20)) df_agg.sort_values(by='time-taken_mean', ascending=False).to_excel("time-taken-by-url.xlsx")
The above code snippet aggregates using max
, min
, mean
and median
functions. It then flatten the DataFrame and sort the values by mean
before saving it to an Excel file.
The output Excel looks like the following screenshot:
These information can be used to decide the poor performing pages.
Other use cases
You can use the DataFrame to do many other types of analysis based on your requirement. For example, understand user demo graphics by joining to IP lookup data set. You can also visualize the data easily using plotting APIs: Plotting with Pandas DataFrame.
Summary
This article just simply provides you some approaches to analyze logs directly using open source libraries. If you use Apache servers, refer to this article about another example of analyzing logs using PySpark: PySpark - Read and Parse Apache Access Log Text Files.
Feel free to post a comment if you have any questions.