The Chief Marketing Officer (CMO) is interested in understanding how the sales of different product families are affected by promotions originating from a specific source (e.g., "Facebook promotions").
You are given three hypothetical Pandas DataFrames:
products_df: Contains product information (product_id, product_family).
promotions_df: Contains promotion details (promotion_id, source). For this problem, we're interested in promotions where source == 'facebook'.
Outline the steps using Pandas to:
Merge the necessary DataFrames to link sales to product families and identify promotion sources.
Create a boolean column to identify sales associated with a "valid Facebook promotion" (i.e., promotion_id is present in the Facebook promotions and is not NA/null).
Group the data by product_family and calculate:
Total units sold for sales with a valid Facebook promotion.
Total units sold for all sales (regardless of promotion).
Calculate the percentage of units sold with valid Facebook promotions for each product family.
Ensure any potential NaN values resulting from aggregations (if a product family had no promotional sales, for instance) are handled appropriately (e.g., filled with 0).
Pandas Workflow for Sales Analysis
This problem requires a sequence of data manipulation steps using Pandas to derive the desired insights. Let's assume we have the following initial DataFrames:
We need to combine sales_df with products_df to get the product_family for each sale, and then with promotions_df to get the source of each promotion.
# Merge sales with productsmerged_df=pd.merge(sales_df, products_df, on='product_id', how='left')
# Merge the result with promotions# Using a left merge to keep all sales, even if promotion_id is NaN or not in promotions_dfmerged_df=pd.merge(merged_df, promotions_df, on='promotion_id', how='left')
print("\n--- Fully Merged DataFrame ---\n", merged_df)
A left merge is appropriate here to ensure we keep all sales records and enrich them with product and promotion information where available.
Step 2: Create a Boolean Column for Valid Facebook Promotions
We need to identify promotions that came from 'facebook' and have a valid (non-null) promotion_id. First, let's get a set of valid Facebook promotion IDs.
facebook_promotion_ids=set(promotions_df[promotions_df['source'] =='facebook']['promotion_id'].unique())
print(f"\nValid Facebook Promotion IDs: {facebook_promotion_ids}")
# Create the boolean column# A promotion is valid if its ID is in our set and the promotion_id field is not NaNmerged_df['is_valid_fb_promo'] =merged_df['promotion_id'].apply(
lambdapid: (notpd.isna(pid)) and (pidinfacebook_promotion_ids)
)
# Alternative using .isin() which is often more efficient for checking against a list/set:# merged_df['is_valid_fb_promo'] = merged_df['promotion_id'].isin(facebook_promotion_ids) & \# merged_df['promotion_id'].notna()print("\n--- Merged DataFrame with is_valid_fb_promo column ---\n", merged_df)
The .apply() with a lambda function (or using .isin() and .notna()) checks each promotion_id.
Step 3: Group by Product Family and Sum Units Sold
Now we group by product_family to get total sales and sales from valid Facebook promotions.
# Total units sold per product familytotal_sales_by_family=merged_df.groupby('product_family')['units_sold'].sum().rename('total_units_sold')
print("\n--- Total Sales by Product Family ---\n", total_sales_by_family)
# Units sold with a valid Facebook promotion per product familyfb_promo_sales_by_family=merged_df[merged_df['is_valid_fb_promo'] ==True].groupby('product_family')['units_sold'].sum().rename('fb_promo_units_sold')
print("\n--- Facebook Promo Sales by Product Family ---\n", fb_promo_sales_by_family)
# Combine these into a single summary DataFramesummary_df=pd.concat([total_sales_by_family, fb_promo_sales_by_family], axis=1)
print("\n--- Combined Summary ---\n", summary_df)
Step 4: Calculate Percentage of Units Sold with Valid Promotions
Using the summary DataFrame, calculate the percentage.
# Handle potential NaN if a product family had no FB promo sales (before fillna)summary_df['fb_promo_percentage'] = (summary_df['fb_promo_units_sold'] /summary_df['total_units_sold']) *100print("\n--- Summary with Percentage (before fillna) ---\n", summary_df)
Step 5: Handle Potential NaN Values
If a product family had sales but no Facebook promotional sales, fb_promo_units_sold (and thus fb_promo_percentage) would be NaN. We should fill these with 0.
summary_df['fb_promo_units_sold'] =summary_df['fb_promo_units_sold'].fillna(0)
summary_df['fb_promo_percentage'] =summary_df['fb_promo_percentage'].fillna(0)
# Ensure total_units_sold is not zero for percentage calculation if some families had 0 total sales# However, in this case, if total_units_sold is 0, fb_promo_units_sold will also be 0 or NaN.# The division 0/0 might result in NaN, or x/0 in inf, which fillna(0) handles.print("\n--- Final Summary DataFrame ---\n", summary_df)
Key Discussion Points for an Interview:
Merge Strategy: Justify the use of left merges to retain all sales data. Discuss different types of merges (inner, outer, right) if probed.
Efficiency of Boolean Column Creation: While .apply() with lambda is clear, for very large DataFrames, using vectorized operations like .isin() and boolean indexing (&) can be more performant than row-wise .apply().
Handling Missing Data: Discuss the importance of .fillna(0) for calculations like percentages to avoid NaNs propagating or errors. Also, consider if promotion_id itself could be missing in `sales_df` and how that's handled (.notna() or .map() with a default for NaNs).
Alternative Aggregations: Briefly mention that .groupby().agg() could be used for step 3 to calculate both sums simultaneously if structured correctly, potentially making the code more concise.
defsum_if_promo(x):
returnx[x['is_valid_fb_promo']]['units_sold'].sum()
summary_alt=merged_df.groupby('product_family').agg(
total_units_sold=pd.NamedAgg(column='units_sold', aggfunc='sum'),
fb_promo_units_sold=pd.NamedAgg(column='units_sold', aggfunc=lambdax:x[merged_df.loc[x.index, 'is_valid_fb_promo']].sum())
# The lambda for fb_promo_units_sold in agg needs careful indexing if applied directly# A simpler way is to calculate fb_promo_units separately as done initially, or use apply.
)
# print("\n--- Alternative Summary with agg ---\n", summary_alt)
The initial separate calculation for `fb_promo_units_sold` is often clearer.
Clarity vs. Conciseness: While one-liners are possible in Pandas, breaking down the problem into logical, readable steps is often preferred in an interview to showcase thought process.