Azure App Service IIS Log Analytics using Pandas

Kontext Kontext event 2022-09-05 visibility 497
more_vert
Azure App Service IIS Log Analytics using Pandas
Azure App Service is a product available on Azure that can be used to host .NET, Java, PHP, Node.js and other framework based web applications. Azure App Service provides features like Application Insights which can be used to analyze your application's performance. This article provides an alternative approach to use Python and Pandas to analyze your Windows based container or non-container application logs generated by IIS. 

There are also many commercial software/frameworks that can be used to analyze logs. 

About logs in ASP.NET Core

There are two types of logs we usually can use for an ASP.NET Core application hosted on Azure: application logs and web server logs. For application logs, they are the ones you added through loggers, for example, logs about an error within a function or API controller action, etc. Web server logs in this case are the IIS logs if the applications are hosted on Windows machine.

The following image shows an example of application log:


Web server (IIS) logs

The following texts are sample server logs:

#Software: Microsoft Internet Information Services 8.0
#Fields: date time s-sitename cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs(User-Agent) cs(Cookie) cs(Referer) cs-host sc-status sc-substatus sc-win32-status sc-bytes cs-bytes time-taken
2022-09-02 00:58:36 KONTEXT GET / X-ARR-LOG-ID=2944fc92-6216-4866-92a2-a4b2****a994 443 - 66.249.68.42 Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html) - - kontext.tech 200 0 0 17825 778 96

For each log file, the first two lines are metadata about the log file:

  • First line indicates the server or software that generates the logs. For this case, it is IIS 8.0.
  • The second line is the fields available for each log record. It includes information about the clients, for example, time when the request was initiated, status, result, user agent, IP address, response time, etc. We will use it to derive the column names later.

The rest of the file are space (" ") delimited log records. For example, the third line in the sample file is a log about Google crawler. It crawls the index page of Kontext (/) and it took 96 milliseconds to response. The IP address is 66.249.68.42.

We will discuss about how to parse these logs in the following sections.

Usage of the logs

We can use the logs to generate many insights. It can answer the following questions:

  • Which URLs are being frequently requested?
  • Which pages are running very slow?
  • Which IP addresses can be malicious attacks?
  • What are the common browser types or OS of the users?
  • ...

All these information can be useful to help you make right decisions of your web applications. Again, these information can also be extracted using analytics tools. 

Where are the logs stored?

Depends on your configurations of the App service, the web server logs can be stored directly in your web hosting server, or in an Azure Blob Storage container. It might also be turned off if you choose that option:

2022090602644-image.png

You can use FTP tools to connect to your web server to download the logs if it is stored there; alternatively you can download from your storage blob container. Azure Storage Explorer is a good tool to download using GUI software. You can also use SCM website to download.

The following section assumes you have already downloaded the logs into a local folder so that we can use Python to analyze them. Similar approach can be applied to your blob storage directly using other big data tools like Spark (or Azure HDInsights). 

Analyze web server logs

Now let's start to write the application to analyze the logs. There are three major steps for the example application I am going to show:

  • Read the log files and parse them as Panda DataFrame
  • Consolidate or union the log DataFrame as single one for analysis. Please make sure your machine has sufficient memory to keep all those data in memory. 
  • Analyze the data using pandas. 

Prerequisites 

Make sure you have pandas installed. You can install via pip:

python -m pip install pandas

I am using Python 3.8.2 for this tutorial but the code can run with other higher Python versions.

Create the Python script

Create a Python script with the following content:

import pandas as pd
import os
from os.path import isfile, join

if __name__ == "__main__":
    # headers
    header = "date time s-sitename cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs(User-Agent) cs(Cookie) cs(Referer) cs-host sc-status sc-substatus sc-win32-status sc-bytes cs-bytes time-taken"
    lines_to_skip = 2
    sep = " "
    columns = header.split(" ")

    log_folder = os.path.abspath("../../RawLogs-1")
    print(f'Parsing log files in directory: {log_folder}')

    files = [join(log_folder, item) for item in os.listdir(log_folder)
             if isfile(join(log_folder, item))]
    print(files)

    frames = []
    for file in files:
        print(f'Loading file {file}...')
        df = pd.read_csv(file, sep=sep, skiprows=2)
        df.columns = columns
        frames.append(df)
    df_logs = pd.concat(frames)
    df_logs = df_logs.drop_duplicates()
    
    print(df_logs.dtypes)
    print(df_logs)

The script reads all the log files from a relative folder named '../../RawLogs-1' which I downloaded from IIS logs. It then reads each file using pandas.read_csv API  before concatenating them are one single DataFrame. There are a few minor details to pay attention to:

  • The code skips the first 2 lines in each file (skiprows=2). As mentioned previously, they are metadata lines.
  • The code uses space (' ') as delimiter. 
  • It also used drop_duplicates to remove duplicated records.

The schema output looks like the following:

date               object
time               object
s-sitename         object
cs-method          object
cs-uri-stem        object
cs-uri-query       object
s-port              int64
cs-username        object
c-ip               object
cs(User-Agent)     object
cs(Cookie)         object
cs(Referer)        object
cs-host            object
sc-status           int64
sc-substatus        int64
sc-win32-status     int64
sc-bytes            int64
cs-bytes            int64
time-taken          int64
dtype: object

Pandas implicitly infers the schema (data types) based on the content of the data.

The DataFrame print looks like the following screenshot:

2022090604545-image.png

Now let's look into some examples using this consolidated pandas DataFrame.

Find the out the slowest pages

The first thing I looked into are the slowest pages (URLs, column cs-uri-stem). We can directly look into time-taken column.

    # Print out the top 100 URLs that takes longest
    df_slowest = df_logs.sort_values(
        by=['time-taken'], ascending=[False])[['date', 'cs-uri-stem']]
    print(df_slowest.head(1000))
    df_slowest.head(100).to_excel("slowest-url-top-1000.xlsx")

The above code snippet simply order the DataFrame by time-taken column and then it prints out the date and URL. The output looks like the following:

2022090604854-image.png

The result is also stored into an Excel file.

With this information, I will be able to know which pages or URLs to look into to improve performance. 

infoInfo - The index of the DataFrame is the time-taken column for this case after sorting.

Find out slow URLs by MAX, MIN, MEAN and MEDIAN

A more balanced approach is to look into the time-taken column using aggregations since one single request might not be accurate enough. With Pandas DataFrame, we can easily aggregate on columns.

    # Print out the slowest URLs that takes longest time.
    df_agg = df_logs[['date', 'cs-uri-stem', 'time-taken']].groupby(
        by=["date", "cs-uri-stem"]).aggregate({"time-taken":
                                               ['max', 'min', 'mean', 'median']})
    df_agg.columns = ["_".join(x) for x in df_agg.columns.ravel()]
    print(df_agg.sort_values(by='time-taken_mean', ascending=False).head(20))
    df_agg.sort_values(by='time-taken_mean', ascending=False).to_excel("time-taken-by-url.xlsx")

The above code snippet aggregates using max, min, mean and median functions. It then flatten the DataFrame and sort the values by mean before saving it to an Excel file.

The output Excel looks like the following screenshot:

2022090605533-image.png

These information can be used to decide the poor performing pages. 

Other use cases

You can use the DataFrame to do many other types of analysis based on your requirement. For example, understand user demo graphics by joining to IP lookup data set. You can also visualize the data easily using plotting APIs: Plotting with Pandas DataFrame.

Summary

This article just simply provides you some approaches to analyze logs directly using open source libraries. If you use Apache servers, refer to this article about another example of analyzing logs using PySpark: PySpark - Read and Parse Apache Access Log Text Files.

Feel free to post a comment if you have any questions. 

More from Kontext
comment Comments
No comments yet.

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts