Quite often it is needed to aggregate data to obtain more information, for example, suppose to have a dataset about bike-sharing and we want to know the mean of the number of bike’s rents by season. In Pandas the solution is easy:
bike_sharing = pd.read_csv('bike_data.csv') rents_by_season = bike_sharing.groupby('Seasons').mean().reset_index() rents_by_season[['Seasons', 'Rented Bike Count']]
And… What if we have to aggregate data by some attribute and we want the “mean” of categorical data?
The statistical tool to use in this scenario is the mode, in poor words the most common value. At the present time, Pandas implement the mode as a method to call on a dataframe, but does not implement the mode as a reducer for the groupby method.
Solution in Pandas
As said before, since Pandas does not implement the mode as aggregation operator for the groupby method, it is needed to adopt another smart strategy. Suppose to have the following dataset.
Discover your most common music genre by year
Learn Spotify API to get your data
And try this solution
Let’s suppose that we want to know for each year the most frequent genre. Here is the solution.
import pandas as pd data = pd.read_csv('film-data.csv') mode_data = data.groupby(['Year']).agg(lambda x:x.value_counts().index).reset_index() mode_data[['Year', 'Genre']]
First, choose the aggregation attribute, in this case Year. Second, we have to define a custom function that implements our aggregation strategy to pass to agg method. In particular, we have defined a lambda function that computes the value counts (x.value_counts()) for each column of the dataset and takes the first row of the result (index). Remember the method value_counts() returns the frequency of values in a column in decrescent order.
And… here we are! Now we can aggregate by computing the mode for the qualitative data.
Leave a Reply