My first post about statmodels - so be nice! :)

statmodels
Seaborn
Pandas
Python vs. R
Formula API
RDatasets in Python
Patsy in Python
Author

Bernardo Freire

Published

August 1, 2023

Introduction: Play with GLMs

During my study at HSLU I have learned to use mostly R when it comes to statistic and for some machine learning courses as well. When it comes to work, the most companies I have seen so far, use Python as their main programming language. My previous two employers used Python and R. This article is not going to discuss which language is better, but rather focus on how to use Python and the library statmodels which seems to produce similar outputs as it would in R.

One important and cool thing about the statmodels library is that it has a lot of data sets already included. You can find them here. The data set I use is a R data set from Rdatasets. In particular, the dataset from the package modeldata called car_prices is used.

## Data imports
df = (
    statsmodels.api # imported as sm in following code
    .datasets
    .get_rdataset("car_prices", package='modeldata')
    .data
)

Another important thing about this blog post is the usage of Quarto. Quarto allowed me to write this blog post directly from a Jupyter-Notebook without any fancy and complicated transformations. For more information, please visit the quarto website.

However, let’s start with the analysis. The first thing we do is to import the necessary libraries and the data set.

::: {#code .cell  code-fold=‘false’ execution_count=1}

Show the code
## General imports
import warnings
warnings.filterwarnings('ignore')

## Data manipulation imports
import pandas as pd
import numpy as np

## Display imports
from IPython.display import display, Markdown

## statmodels import
import statsmodels.api as sm
import statsmodels.formula.api as smf
import statsmodels.genmod.families.family as fam
from patsy import dmatrices

## Plot imports
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (5,5/2.5)
import seaborn as sns
sns.set_style('whitegrid')
sns.set_theme()
sns.set_context(
    "paper", 
    rc={
        "figsize"       :   plt.rcParams['figure.figsize'],
        'font_scale'    :   1.25,
    }
)


## Data imports
df = (
    sm
    .datasets
    .get_rdataset(
        "car_prices", 
        package='modeldata',
    )
    .data
)

:::

Explanatory Data Exploration

The data exploration is a crucial step before fitting a model. It allows to understand the data and to identify potential problems. In this notebook, we will explore very briefly the data, as the focus of this article is to understand and apply statmodels.

The data set contains 804 observations (rows) and 18 predictors (columns). The data set contains information about car prices and the variables are described as follows: Price; Mileage; Cylinder; Doors; Cruise; Sound; Leather; Buick; Cadillac; Chevy; Pontiac; Saab; Saturn; convertible; coupe; hatchback; sedan; and wagon.

First, two approximately continous variables Price and Mileage are investigated by means of graphical analysis. The following bar plots show the distribution of the two variables.

Show the code
height = plt.rcParams['figure.figsize'][0]
aspect = plt.rcParams['figure.figsize'][0]/plt.rcParams['figure.figsize'][1] / 2

g1 = sns.displot(
    data = df,
    x = 'Price',
    kde = True,
    height = height,
    aspect = aspect,
)
plt.title('Distribution of car price')
plt.xlabel('Price')
plt.ylabel('Count of occurance')
plt.show(g1)

g2 = sns.displot(
    data = df,
    x = 'Mileage',
    kde = True,
    height = height,
    aspect = aspect,
)
plt.title('Distribution of Mileage')
plt.xlabel('Mileage')
plt.ylabel('Count of occurance')
plt.show(g2)
(a) Distribution of the variable Price
(b) Distribution of the variable Mileage
Figure 1: Distribution of price and mileage variables.

Figure 1 (a) shows on the left a right-skewed distribution with a peak around 15k$ and price values ranging from 8k dollars to 70k dollars. On the other hand, figure 1 (b) shows on the right a more ballanced distribution with a peak around 20k$ and price values ranging from 266 miles up to 50k miles.

Proceeding to the next two variables, Cylinder and Doors, one can see less possible values, ranging from

Show the code
height = plt.rcParams['figure.figsize'][0]
aspect = plt.rcParams['figure.figsize'][0]/plt.rcParams['figure.figsize'][1] / 2

g = sns.displot(
    data = df,
    x = 'Cylinder',
    discrete = True,
    height = height,
    aspect = aspect,
)
plt.title('Distribution of Cylinder')
plt.xlabel('Cylinder')
plt.ylabel('Count of occurance')
plt.show(g)

# plt.figure()
g = sns.displot(
    data = df,
    x = 'Doors',
    discrete = True,
    shrink = 0.5,
    height = height,
    aspect = aspect,
)
plt.title('Distribution of Doors')
plt.xlabel('Doors')
plt.ylabel('Count of occurance')
plt.show(g)
(a) Distribution of the variable cylinder
(b) Distribution of the variable doors
Figure 2: Distribution of cylinder and doors variables.

The Figure 2 (a) surprised me quite a bit. I had anticipated the car to feature more than 8 cylinders, given that this dataset pertains to American cars. The cylinder count typically spans from 4 to 8, with the values accurately reflecting this range. It’s worth noting that the number of cylinders is expected to be even.

Again surprisingly, the Figure 2 (b) shows that the number of doors per car. The values are either 2 or 4, with the latter being more common. This is a bit surprising, as I would have expected the number of doors to be higher for American cars (SUV).

Next, we check the distribution of the make of the cars in the dataset. For this analysis, we first pivot the dataframe using pd.melt() and calculate the sum by means of groupby() and sum() method as follows:

Show the code
brands = (
    df
    [['Buick', 'Cadillac', 'Chevy', 'Pontiac', 'Saab']]
    .melt()
    .groupby('variable')
    .sum()
    .reset_index()
)

brands.head()
Table 1: Cars by brand after melting, grouping, and summing
variable value
0 Buick 80
1 Cadillac 80
2 Chevy 320
3 Pontiac 150
4 Saab 114

After aggregation, the visualization of the data is as follows:

Show the code
sns.catplot(
    data    = brands.sort_values('value', ascending=False),
    x       = 'variable',
    y       = 'value',
    kind    = 'bar',
    height  = 5,
    aspect  = 2,
)
plt.title('Number of cars by brand')
plt.xlabel('Brand')
plt.ylabel('Number of cars')
plt.show()
Figure 3: Number of cars by make

In descending order the most present make is: Chevy followed by Pontiac, Saab, Buick and Cadilac.

At this point it should be mentioned, that the data set is not balanced and the distribution of the features is not uniform across the different makes and car features. In a normal project this would be a point to consider and to take care of. However, for the purpose of this project, this is not necessary.

Modeling

Modeling with statsmodels becomes straightforward once the formula API and provided documentation are well understood. I encountered a minor challenge in grasping the formula API, but once I comprehended it, the usage turned out to be quite intuitive.

Let’s delve into an understanding of the formula API (statsmodels.formula.api). This feature enables us to employ R-style formulas for specifying models. To illustrate, when fitting a linear model, the following formula suffices:

import statsmodels.formula.api as smf

model = smf.ols(
    formula='y ~ x1 + x2 + x3', 
    data=df
)

The formula API leverages the patsy package (Patsy Package). It’s worth noting that the formula API’s potency extends to intricate models. Additionally, it’s important to mention that the formula API automatically incorporates an intercept into the formula if one isn’t explicitly specified. For cases where the intercept is undesired, a -1 can be used within the formula.

With the glm class, a vast array of models becomes accessible importing as follows:

import statsmodels.genmod.families.family as fam

The fam import is necessary for specifying the family of the model. The families available are:

Table 2: GLM Families
Name Description
Family(link, variance[, check_link]) The parent class for one-parameter exponential families.
Binomial([link, check_link]) Binomial exponential family distribution.
Gamma([link, check_link]) Gamma exponential family distribution.
Gaussian([link, check_link]) Gaussian exponential family distribution.
InverseGaussian([link, check_link]) InverseGaussian exponential family.
NegativeBinomial([link, alpha, check_link]) Negative Binomial exponential family (corresponds to NB2).
Poisson([link, check_link]) Poisson exponential family.
Tweedie([link, var_power, eql, check_link]) Tweedie family.

Fitting the model and analyzing the results are the same as one would using R. First define the model, then fit it, then analyze the results. The details of the fit can be accessed using the summary method.

                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                  Price   No. Observations:                  804
Model:                            GLM   Df Residuals:                      789
Model Family:                Gaussian   Df Model:                           14
Link Function:               Identity   Scale:                      8.4460e+06
Method:                          IRLS   Log-Likelihood:                -7544.8
Date:                Sun, 03 Sep 2023   Deviance:                   6.6639e+09
Time:                        21:02:49   Pearson chi2:                 6.66e+09
No. Iterations:                     3   Pseudo R-squ. (CS):              1.000
Covariance Type:            nonrobust                                         
===============================================================================
                  coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------
Mileage        -0.1842      0.013    -14.664      0.000      -0.209      -0.160
Cylinder     3659.4543    113.345     32.286      0.000    3437.303    3881.606
Doors        1654.6326    174.525      9.481      0.000    1312.570    1996.695
Cruise        340.8695    295.962      1.152      0.249    -239.205     920.944
Sound         440.9169    234.484      1.880      0.060     -18.664     900.497
Leather       790.8220    249.745      3.167      0.002     301.331    1280.313
Buick       -1911.3752    336.292     -5.684      0.000   -2570.495   -1252.256
Cadillac      1.05e+04    409.274     25.663      0.000    9701.132    1.13e+04
Chevy       -3408.2863    213.274    -15.981      0.000   -3826.295   -2990.278
Pontiac     -4258.9628    256.358    -16.613      0.000   -4761.416   -3756.509
Saab         9419.1227    331.211     28.438      0.000    8769.960    1.01e+04
Saturn      -2859.0803    358.709     -7.970      0.000   -3562.137   -2156.023
convertible  1.258e+04    525.984     23.922      0.000    1.16e+04    1.36e+04
coupe        1559.7620    395.946      3.939      0.000     783.723    2335.801
hatchback   -4977.3196    339.046    -14.680      0.000   -5641.837   -4312.803
sedan       -3064.7176    215.007    -14.254      0.000   -3486.123   -2643.312
wagon        1384.6400    364.920      3.794      0.000     669.410    2099.870
===============================================================================

After fitting a GLM using the glm class in statsmodels, you can obtain a summary of the model’s results using the .summary() method on the fitted model object. Here’s a general overview of the information typically included in the summary output:

  1. Model Information:
    • The name of the model.
    • The method used for estimation (e.g., maximum likelihood).
    • The distribution family (e.g., Gaussian, binomial, Poisson).
    • The link function used in the model (e.g., logit, identity, etc.).
    • The number of observations used in the model.
  2. Model Fit Statistics:
    • Log-likelihood value.
    • AIC (Akaike Information Criterion) and/or BIC (Bayesian Information Criterion).
    • Deviance and Pearson chi-square statistics.
    • Dispersion parameter (if applicable).
  3. Coefficients:
    • Estimated coefficients for each predictor variable.
    • Standard errors of the coefficients.
    • z-scores and p-values for testing the significance of the coefficients.
  4. Confidence Intervals:
    • Confidence intervals for each coefficient, often at a default level like 95%.
  5. Hypothesis Testing:
    • Hypothesis tests for the coefficients, typically with null hypothesis being that the coefficient is zero.
  6. Goodness of Fit:
    • Likelihood-ratio test for overall model fit.
    • Tests for assessing the contribution of individual predictors to the model.
  7. Diagnostic Information:
    • Information about model assumptions and diagnostics, depending on the type of GLM and the method used.
  8. Residuals:
    • Information about residuals, which can include measures like deviance residuals, Pearson residuals, etc.

Refer to the official documentation for the most accurate and up-to-date information about the summary output for your specific use case.

Conclusion

Statsmodels is a powerful Python library for statistical modeling and hypothesis testing, making it an excellent transition for R users or Python users who like R to solve certain problems. It offers a familiar syntax and functionality for regression, ANOVA, and more. Its integration with Python’s data analysis ecosystem, like pandas, allows seamless data manipulation. With support for various statistical methods and a comprehensive summary output, Statsmodels facilitates effortless migration from R, enabling R users to harness its capabilities in a Python environment.