A Powerful EDA Tool: Group-By Aggregation
Photo by Mourizal Zativa on UnsplashLearn how to use group-by aggregation to uncover insights from your dataExploratory Data Analysis (EDA) is the core competency of a data analyst. Every day, data analysts are tasked with seeing the “unseen,” or extracting useful insights from a vast ocean of data.In this regard, I’d like share a technique that I find beneficial for extracting relevant insights from data: group-by aggregation.To this end, the rest of this article will be arranged as follows:Explanation of group-by aggregation in PandasThe dataset: Metro Interstate TrafficMetro Traffic EDAGroup-By AggregationGroup-by aggregation is a data manipulation technique that consists of two steps. First, we group the data based on the values of specific columns. Second, we perform some aggregation operations on top of the grouped data.Group-by aggregation is especially useful when our data is granular, as in typical fact tables (transactions data) and time series data with narrow intervals. By aggregating at a higher level than raw data granularity, we can represent the data in a more compact way — and may distill useful insights in the process.In pandas, we can perform group-by aggregation using the following general syntax form.df.groupby(['base_col']).agg( agg_col=('ori_col','agg_func'))Where base_col is the column whose values become the grouping basis, agg_col is the new column defined by taking agg_func aggregation on ori_col column.For example, consider the infamous Titanic dataset whose five rows are displayed below.import pandas as pdimport seaborn as sns# import titanic datasettitanic = sns.load_dataset("titanic")titanic.head()Titanic data’s first 5 rows (Image by Author)We can group this data by the survived column and then aggregate it by taking the median of the fare column to get the results below.Median fare of titanic passengers, by survival status (Image by Author)Suddenly, we see an interesting insight: survived passengers have a higher fare median, which has more than doubled. This could be related to prioritizing safety boats for higher cabin class passengers (i.e., passengers with higher fare tickets).Hopefully, this simple example demonstrates the potential of group by aggregation in gathering insights from data. Okay then, let’s try group-by-aggregation on a more interesting dataset!The DatasetWe will use the Metro Interstate Traffic Volume dataset. It’s a publicly available dataset with a Creative Common 4.0 license (which allows for sharing and adaptation of the dataset for any purpose).The dataset contains hourly Minneapolis-St Paul, MN traffic volume for westbound I-94, which also includes weather details from 2012–2018. The data dictionary information can be found on its UCI Machine Learning repo page.import pandas as pd# load datasetdf = pd.read_csv("dir/to/Metro_Interstate_Traffic_Volume.csv")# convert date_time column from object to proper datetime formatdf['date_time'] = pd.to_datetime(df['date_time'])# headdf.head()Traffic data (df) head (Image by Author)For this blog demo, we will only use data from 2016 onwards, as there is missing traffic data from earlier periods (try to check yourself for exercise!).Furthermore, we will add a new column is_congested, which will have a value of 1 if the traffic_volume exceeds 5000 and 0 otherwise.# only consider 2016 onwards datadf = df.loc[df['date_time']>="2016-01-01",:]# feature engineering is_congested columndf['is_congested'] = df['traffic_volume'].apply(lambda x: 1 if x > 5000 else 0)Metro Traffic EDAUsing group-by aggregation as the main weapon, we will try to answer the following analysis questions.How is the monthly progression of the traffic volume?How is the traffic profile of each day in a week (Monday, Tuesday, etc)?How are typical hourly traffic volume across 24 hours, broken down by weekday vs weekend?What are the top weather conditions that correspond to higher congestion rates?Monthly progression of traffic volumeThis question requires us to aggregate (sum) traffic volumes at month level. Because we don’t have the month column, we need to derive one based on date_time column.With monthcolumn in place, we can group based on this column, and take the sum of traffic_volume. The codes are given below.# create month column based on date_time# sample values: 2016-01, 2026-02df['month'] = df['date_time'].dt.to_period("M")# get sum of traffic_volume by monthmonthly_traffic = df.groupby('month', as_index=False).agg( total_traffic = ('traffic_volume', 'sum'))# convert month column to string for vizmonthly_traffic['month'] = monthly_traffic['month'].astype(str)monthly_traffic.head()monthly_traffic head (Image by Author)We can draw line plot from this dataframe!# draw time series plotplt.figure(figsize=(12,5))sns.lineplot(data=monthly_traffic, x ="month", y="total_traffic")plt.xticks(rotation=90)plt.title("Monthly Traffic Volume")plt.show()Monthly traffic volume (Image by Author)The above visualization shows that traffic volume has generally increased over the months within the considered data period.Daily traffic profileTo analyze this, we need to create two additional columns: date and dayname. The former is used as the primary group-by basis, whereas the latter is used as a breakdown when displaying the data.In the following codes, we define date and dayname columns. Later on, we group-by based on both columns to get the sum of traffic_volume. Note that since dayname is more coarse (higher aggregation level) than date , it effectively means we aggregate based on date values.# create column date from date_time# sample values: 2016-01-01, 2016-01-02df['date'] = df['date_time'].dt.to_period('D')# create dayname column# sample values: Monday, Tuesdaydf['dayname'] = df['date_time'].dt.day_name()# get sum of traffic, at date leveldaily_traffic = df.groupby(['dayname','date'], as_index=False).agg( total_traffic = ('traffic_volume', 'sum'))# map dayname to number for viz laterdayname_map = { 'Monday': 1, 'Tuesday': 2, 'Wednesday': 3, 'Thursday': 4, 'Friday': 5, 'Saturday': 6, 'Sunday': 7}daily_traffic['dayname_index'] = daily_traffic['dayname'].map(dayname_map)daily_traffic = daily_traffic.sort_values(by='dayname_index')daily_traffic.head()daily_traffic head (Image by Author)The above table contains different realizations of daily total traffic volume per day name. Box plot visualizations are appropriate to show those variations of traffic volume, allowing us to comprehend how traffic volumes differ on Monday, Tuesday, and so on.# draw boxplot per day nameplt.figure(figsize=(12,5))sns.boxplot(data=daily_traffic, x="dayname", y="total_traffic")plt.xticks(rotation=90)plt.title("Daily Traffic Volume")plt.show()The above plot shows that all weekdays (Mon-Fri) have roughly the same traffic density. Weekends (Saturday and Sunday) have lower traffic, with Sunday having the least of the two.Hourly traffic patterns, broken down by weekend statusSimilar as previous questions, we need to engineer two new columns to answer this question, i.e., hour and is_weekend.Using the same trick, we will group by is_weekend and hour columns to get averages of traffic_volume.# extract hour digit from date_time# sample values: 1,2,3df['hour'] = df['date_time'].dt.hour# create is_weekend flag based on daynamedf['is_weekend'] = df['dayname'].apply(lambda x: 1 if x in ['Saturday', 'Sunday'] else 0)# get average traffic at hour level, broken down by is_weekend flaghourly_traffic = df.groupby(['is_weekend','hour'], as_index=False).agg( avg_traffic = ('traffic_volume', 'mean'))hourly_traffic.head()hourly_traffic head (Image by Author)For the visualization, we can use bar chart with break down on is_weekend flag.# draw as barplot with hue = is_weekendplt.figure(figsize=(20,6))sns.barplot(data=hourly_traffic, x='hour', y='avg_traffic', hue='is_weekend')plt.title("Average Hourly Traffic Volume: Weekdays (blue) vs Weekend (orange)", fontsize=14)plt.show()Hourly traffic pattern, by weekend status (Image by Author)Very interesting and rich visualization! Observations:Weekday traffic has a bimodal distribution pattern. It reaches its highest traffic between 6 and 8 a.m. and 16 and 17 p.m. This is somewhat intuitive because those time windows represent people going to work and returning home from work.Weekend traffic follows a completely different pattern. It has a unimodal shape with a large peak window (12–17). Despite being generally inferior (less traffic) to weekday equivalent hours, it is worth noting that weekend traffic is actually higher during late-night hours (22–2). This could be because people are staying out until late on weekend nights.Top weather associated with congestionTo answer this question, we need to calculate congestion rate for each weather condition in the dataset (utilizing is_congested column). Can we calculate it using group-by aggregation? Yes we can!The key observation to make is that the is_congested column is binary. Thus, the congestion rate can be calculated by simply averaging this column! Average of a binary column equals to sum(value 1)/count(all rows) — let that sink in for a moment if it’s new for you.Based on this neat observation, all we need to do is take the average (mean) of is_congested grouped by weather_description. Following that, we sort the results descending by congested_rate.# rate of congestion (is_congested) , grouped by weather descriptioncongested_weather = df.groupby('weather_description', as_index=False).agg( congested_rate = ('is_congested', 'mean')).sort_values(by='congested_rate', ascending=False, ignore_index=True)congested_weather.head()congested_weather head (Image by Author)# draw as barplotplt.figure(figsize=(20,6))sns.barplot(data=congested_weather, x='weather_description', y='congested_rate')plt.xticks(rotation=90)plt.title('Top Weather with High Congestion Rates')plt.show()Top weather based on congestion rate (Image by Author)From the graph:The top three weather conditions with the highest congestion rates are sleet, light shower snow, and very heavy rain.Meanwhile, light rain and snow, thunderstorms with drizzle, freezing rain, and squalls have not caused any congestion. People must be staying indoors during such extreme weather!ClosingIn this blog post, we covered how to use group-by-aggregation in EDA exercises. As we can see, this technique is highly effective in revealing interesting, useful insights from data, particularly when dealing with granular data.I hope you can practice doing group-by aggregation during your next EDA project! All in all, thanks for reading, and let’s connect with me on LinkedIn! 👋A Powerful EDA Tool: Group-By Aggregation was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Welcome to Billionaire Club Co LLC, your gateway to a brand-new social media experience! Sign up today and dive into over 10,000 fresh daily articles and videos curated just for your enjoyment. Enjoy the ad free experience, unlimited content interactions, and get that coveted blue check verification—all for just $1 a month!
Account Frozen
Your account is frozen. You can still view content but cannot interact with it.
Please go to your settings to update your account status.
Open Profile Settings