Pandas Method Chaining

Pandas

Readability

Method Chaining

Author

Bernardo Freire

Published

August 13, 2023

Introduction

Method chaining is a way to combine multiple pandas operations together into a single line or statement. It is not necessary to use method chaining, but it can make your code more concise and easier to read. In this post, we will look at how to use method chaining to clean and transform a dataset.

Honestly, since I started using method chaining, I have found it difficult to go back to the old way of writing pandas code. I hope that by the end of this post, you will feel the same way. :)

Let’s get started! First I show how to use method chaining to import a dataset as follows:

## Data imports

df = (
    # use sm 
    sm                          
    # get datasets attribute
    .datasets                   
    # use get_rdataset method
    .get_rdataset(              
        "car_prices", 
        package='modeldata'
    )
    # get data attribute
    .data
)

Chaining in Python uses () in order to chain methods together. This allows the user to write multiple statements on different lines as showed in the code above. On big advantage of chaining is that it allows the user to write more readable code (one command per line) and debugg it more easily (comment out one line at a time). Not using chaining would require the user to write the code as follows:

df = sm.datasets.get_rdataset("car_prices", package='modeldata').data

Using chaining makes the code more readable right?

What I like most about chaining is that it allows the user to write more readable code (one command per line) and debugg it more easily (comment out one line at a time). Especially when working with large datasets, it is very useful to be able to comment out one line at a time.

Before continuing with the next section, let’s have a look at the data:

(
    df
    .head()
)

Table 1: Data Preview

	Price	Mileage	Cylinder	Doors	Cruise	Sound	Leather	Buick	Chevy	Saab	convertible	coupe	sedan
0	22661.05	20105	6	4	1	0	0	1	0	0	0	0	1
1	21725.01	13457	6	2	1	1	0	0	1	0	0	1	0
2	29142.71	31655	4	2	1	1	1	0	0	1	1	0	0
3	30731.94	22479	4	2	1	0	0	0	0	1	1	0	0
4	33358.77	17590	4	2	1	1	1	0	0	1	1	0	0

This dataset describes the price of sold cars by make, mileage and other features. Let us assume we would like to convert the data into a tidy format. We start with the car features.

(
    df
    .melt(
        id_vars     = [
            'Price','Mileage', 'Buick', 'Cadillac', 
            'Chevy', 'Pontiac', 'Saab', 'Saturn',
        ],
        var_name    = 'Feature',
        value_name  = 'Value',
    )
    .sort_values(
        by = 'Price',
    )
    .head()
)

Table 2: Tidy format of car features

	Price	Mileage	Chevy	Feature	Value
3783	8638.93	25216	1	Leather	0
5391	8638.93	25216	1	coupe	0
567	8638.93	25216	1	Cylinder	4
2175	8638.93	25216	1	Cruise	0
6999	8638.93	25216	1	sedan	1

Let’s now continue and see how we can use the assign method to create new columns in our dataframe. We will use the assign method to create a new column called Make which will a vector representation of the car make.

# Define a function to create the Make vector
def make_vector(row):
    makes = [
        'Buick', 'Cadillac', 'Chevy', 
        'Pontiac', 'Saab', 'Saturn',
    ]
    return [int(row[make]) for make in makes]

(
    df
    .melt(
        id_vars     = [
            'Price','Mileage', 'Buick', 'Cadillac', 
            'Chevy', 'Pontiac', 'Saab', 'Saturn',
        ],
        var_name    = 'Feature',
        value_name  = 'Value',
    )
    .assign(
        Make = lambda df: df.apply(make_vector, axis = 1),
    )
    .drop(
        columns = [
            'Buick', 'Cadillac', 'Chevy', 
            'Pontiac', 'Saab', 'Saturn',
        ],
    )
    .sort_values(
        by = 'Price',
    )
    .head()
)

Table 3: Continued from the previous chain I

	Price	Mileage	Feature	Value	Make
3783	8638.93	25216	Leather	0	[0, 0, 1, 0, 0, 0]
5391	8638.93	25216	coupe	0	[0, 0, 1, 0, 0, 0]
567	8638.93	25216	Cylinder	4	[0, 0, 1, 0, 0, 0]
2175	8638.93	25216	Cruise	0	[0, 0, 1, 0, 0, 0]
6999	8638.93	25216	sedan	1	[0, 0, 1, 0, 0, 0]

Let’s continue and use groupby and aggregate the Feature column by the agg method. This will give us the mean of each feature appears in the dataset.

# Define a function to create the Make vector
def make_vector(row):
    makes = [
        'Buick', 'Cadillac', 'Chevy', 
        'Pontiac', 'Saab', 'Saturn',
    ]
    return [int(row[make]) for make in makes]

(
    df
    .melt(
        id_vars     = [
            'Price','Mileage', 'Buick', 'Cadillac', 
            'Chevy', 'Pontiac', 'Saab', 'Saturn',
        ],
        var_name    = 'Feature',
        value_name  = 'Value',
    )
    .assign(
        Make = lambda df: df.apply(make_vector, axis = 1),
    )
    .drop(
        columns = [
            'Buick', 'Cadillac', 'Chevy', 
            'Pontiac', 'Saab', 'Saturn',
        ],
    )
    .sort_values(
        by = 'Price',
    )
    .groupby(
        by = 'Feature',
    )
    .agg(
        {
            'Value': ['mean'],
        }
    )
)

Table 4: Continued from the previous chain II

	Value
	mean
Feature
Cruise	0.752488
Cylinder	5.268657
Doors	3.527363
Leather	0.723881
Sound	0.679104
convertible	0.062189
coupe	0.174129
hatchback	0.074627
sedan	0.609453
wagon	0.079602

Summarizing, we could go on forever and adding more steps to our method chain. But I think you get the point. We can do a lot with method chaining and it is a great way to write clean and readable code.

Conclusion

Method chaining in Pandas is a powerful technique that offers several advantages for data manipulation and analysis workflows. It involves combining multiple operations on a DataFrame or Series into a single, concise chain of method calls. This approach enhances code readability, maintainability, and efficiency.

Firstly, method chaining reduces the need for intermediate variables, streamlining code and making it more readable. By stringing together operations, such as filtering, transforming, and aggregating, the code becomes a clear and sequential representation of the data transformation process.

Secondly, it encourages the use of functionally composed operations, leading to more modular and reusable code. This modular nature facilitates changes and updates, as adjustments can be made within the chain without affecting other parts of the code.

Furthermore, method chaining promotes better memory usage and performance optimization. Pandas optimizes these chains under the hood, reducing the creation of unnecessary intermediate copies of data frames, which leads to improved execution speed and reduced memory overhead.

Lastly, method chaining aligns well with the “tidy data” philosophy, as it emphasizes a more structured, organized approach to data manipulation. This promotes consistency and clarity in the analysis process, aiding in collaboration and code maintenance.

I hope you enjoyed this post and learned something new. If you have any questions contact me. Thanks for reading!