Pandas Method Chaining
Introduction
Method chaining is a way to combine multiple pandas operations together into a single line or statement. It is not necessary to use method chaining, but it can make your code more concise and easier to read. In this post, we will look at how to use method chaining to clean and transform a dataset.
Honestly, since I started using method chaining, I have found it difficult to go back to the old way of writing pandas code. I hope that by the end of this post, you will feel the same way. :)
Let’s get started! First I show how to use method chaining to import a dataset as follows:
Chaining in Python uses () in order to chain methods together. This allows the user to write multiple statements on different lines as showed in the code above. On big advantage of chaining is that it allows the user to write more readable code (one command per line) and debugg it more easily (comment out one line at a time). Not using chaining would require the user to write the code as follows:
Using chaining makes the code more readable right?
What I like most about chaining is that it allows the user to write more readable code (one command per line) and debugg it more easily (comment out one line at a time). Especially when working with large datasets, it is very useful to be able to comment out one line at a time.
Before continuing with the next section, let’s have a look at the data:
| Price | Mileage | Cylinder | Doors | Cruise | Sound | Leather | Buick | Cadillac | Chevy | Pontiac | Saab | Saturn | convertible | coupe | hatchback | sedan | wagon | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 22661.05 | 20105 | 6 | 4 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 1 | 21725.01 | 13457 | 6 | 2 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 2 | 29142.71 | 31655 | 4 | 2 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
| 3 | 30731.94 | 22479 | 4 | 2 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
| 4 | 33358.77 | 17590 | 4 | 2 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
This dataset describes the price of sold cars by make, mileage and other features. Let us assume we would like to convert the data into a tidy format. We start with the car features.
(
df
.melt(
id_vars = [
'Price','Mileage', 'Buick', 'Cadillac',
'Chevy', 'Pontiac', 'Saab', 'Saturn',
],
var_name = 'Feature',
value_name = 'Value',
)
.sort_values(
by = 'Price',
)
.head()
)| Price | Mileage | Buick | Cadillac | Chevy | Pontiac | Saab | Saturn | Feature | Value | |
|---|---|---|---|---|---|---|---|---|---|---|
| 3783 | 8638.93 | 25216 | 0 | 0 | 1 | 0 | 0 | 0 | Leather | 0 |
| 5391 | 8638.93 | 25216 | 0 | 0 | 1 | 0 | 0 | 0 | coupe | 0 |
| 567 | 8638.93 | 25216 | 0 | 0 | 1 | 0 | 0 | 0 | Cylinder | 4 |
| 2175 | 8638.93 | 25216 | 0 | 0 | 1 | 0 | 0 | 0 | Cruise | 0 |
| 6999 | 8638.93 | 25216 | 0 | 0 | 1 | 0 | 0 | 0 | sedan | 1 |
Let’s now continue and see how we can use the assign method to create new columns in our dataframe. We will use the assign method to create a new column called Make which will a vector representation of the car make.
# Define a function to create the Make vector
def make_vector(row):
makes = [
'Buick', 'Cadillac', 'Chevy',
'Pontiac', 'Saab', 'Saturn',
]
return [int(row[make]) for make in makes]
(
df
.melt(
id_vars = [
'Price','Mileage', 'Buick', 'Cadillac',
'Chevy', 'Pontiac', 'Saab', 'Saturn',
],
var_name = 'Feature',
value_name = 'Value',
)
.assign(
Make = lambda df: df.apply(make_vector, axis = 1),
)
.drop(
columns = [
'Buick', 'Cadillac', 'Chevy',
'Pontiac', 'Saab', 'Saturn',
],
)
.sort_values(
by = 'Price',
)
.head()
)| Price | Mileage | Feature | Value | Make | |
|---|---|---|---|---|---|
| 3783 | 8638.93 | 25216 | Leather | 0 | [0, 0, 1, 0, 0, 0] |
| 5391 | 8638.93 | 25216 | coupe | 0 | [0, 0, 1, 0, 0, 0] |
| 567 | 8638.93 | 25216 | Cylinder | 4 | [0, 0, 1, 0, 0, 0] |
| 2175 | 8638.93 | 25216 | Cruise | 0 | [0, 0, 1, 0, 0, 0] |
| 6999 | 8638.93 | 25216 | sedan | 1 | [0, 0, 1, 0, 0, 0] |
Let’s continue and use groupby and aggregate the Feature column by the agg method. This will give us the mean of each feature appears in the dataset.
# Define a function to create the Make vector
def make_vector(row):
makes = [
'Buick', 'Cadillac', 'Chevy',
'Pontiac', 'Saab', 'Saturn',
]
return [int(row[make]) for make in makes]
(
df
.melt(
id_vars = [
'Price','Mileage', 'Buick', 'Cadillac',
'Chevy', 'Pontiac', 'Saab', 'Saturn',
],
var_name = 'Feature',
value_name = 'Value',
)
.assign(
Make = lambda df: df.apply(make_vector, axis = 1),
)
.drop(
columns = [
'Buick', 'Cadillac', 'Chevy',
'Pontiac', 'Saab', 'Saturn',
],
)
.sort_values(
by = 'Price',
)
.groupby(
by = 'Feature',
)
.agg(
{
'Value': ['mean'],
}
)
)| Value | |
|---|---|
| mean | |
| Feature | |
| Cruise | 0.752488 |
| Cylinder | 5.268657 |
| Doors | 3.527363 |
| Leather | 0.723881 |
| Sound | 0.679104 |
| convertible | 0.062189 |
| coupe | 0.174129 |
| hatchback | 0.074627 |
| sedan | 0.609453 |
| wagon | 0.079602 |
Summarizing, we could go on forever and adding more steps to our method chain. But I think you get the point. We can do a lot with method chaining and it is a great way to write clean and readable code.
Conclusion
Method chaining in Pandas is a powerful technique that offers several advantages for data manipulation and analysis workflows. It involves combining multiple operations on a DataFrame or Series into a single, concise chain of method calls. This approach enhances code readability, maintainability, and efficiency.
Firstly, method chaining reduces the need for intermediate variables, streamlining code and making it more readable. By stringing together operations, such as filtering, transforming, and aggregating, the code becomes a clear and sequential representation of the data transformation process.
Secondly, it encourages the use of functionally composed operations, leading to more modular and reusable code. This modular nature facilitates changes and updates, as adjustments can be made within the chain without affecting other parts of the code.
Furthermore, method chaining promotes better memory usage and performance optimization. Pandas optimizes these chains under the hood, reducing the creation of unnecessary intermediate copies of data frames, which leads to improved execution speed and reduced memory overhead.
Lastly, method chaining aligns well with the “tidy data” philosophy, as it emphasizes a more structured, organized approach to data manipulation. This promotes consistency and clarity in the analysis process, aiding in collaboration and code maintenance.
I hope you enjoyed this post and learned something new. If you have any questions contact me. Thanks for reading!