Checking for Data Errors & Anomalies
What descriptive statistics would you compute to quickly check for potential data entry errors or anomalies related to purchase amounts from restaurants in areas like Gachibowli and KPHB (e.g., negative values, unusually high values, zero values, or amounts below ₹100)?
Related Concepts
Hint
To find negative values or values below ₹100 (the minimum mentioned by Kiran Kumar from Manikonda for Swiggy orders with "TELANGANA25"), what's the quickest statistic? For unusually high values from Paradise Biryani or Chutneys, what would you check? How would you spot zero values?
Solution
Imagine Kiran Kumar from Swiggy's Manikonda office asks you to quickly check if the purchase amounts from restaurants like Paradise Biryani in Gachibowli or Bawarchi in KPHB look okay. He mentioned orders should be at least ₹100 after the "TELANGANA25" discount.
To find problems, I'd quickly look at:
- Smallest and Biggest Orders (Min/Max): The minimum value would instantly tell me if there are any negative amounts (which makes no sense for a sale) or amounts less than ₹100. The maximum value would show if there's an unbelievably high order (e.g., ₹5,00,000 for a single biryani order – likely a typo!).
- How many times each amount appears (Frequency Counts): This would help me see if there are many orders at exactly ₹0, or many orders suspiciously below ₹100.
- A quick look at the average and spread: While not the first check for these specific errors, if the average is very low or the spread (standard deviation) is huge, it hints at data problems.
To quickly check for potential data entry errors or anomalies in purchase amounts from Swiggy orders (e.g., from restaurants like Paradise Biryani, Chutneys, Bawarchi in areas like Gachibowli and KPHB), I would compute the following descriptive statistics first:
- 1. Minimum Value:
- This will immediately reveal if there are any negative purchase amounts (which are impossible for actual sales) or values like zero (which might indicate cancelled orders before payment, free promotional items without a main purchase, or data errors).
- It will also quickly show if there are values below the expected ₹100 minimum mentioned by Kiran Kumar from Manikonda (after the "TELANGANA25" discount).
- 2. Maximum Value:
- This helps identify unusually high purchase amounts. For example, a single food order worth lakhs of rupees would be highly suspicious and likely a data entry error (e.g., extra zeros).
- 3. Range (Max - Min):
- While the min and max are key, the range gives a quick sense of the overall spread. An extremely large range often signals the presence of outliers on one or both ends.
- 4. Frequency Counts (especially for specific suspicious values):
- Count the occurrences of purchase amounts equal to zero.
- Count occurrences of purchase amounts between ₹1 and ₹99.99 to see how many fall below the expected minimum.
- If specific problematic values are found (e.g., a common typo like ₹10 instead of ₹1000), checking their frequency can indicate a systematic error.
- 5. Count of Missing Values (NaNs):
- Determine if there are any orders where the purchase amount is not recorded at all.
- 6. Basic Central Tendency and Dispersion (as secondary checks):
- Mean and Median: A mean significantly different from the median can suggest skewness often caused by outliers.
- Standard Deviation: A very high standard deviation relative to the mean can also indicate the presence of extreme values. However, min/max and frequencies are more direct for the specific error types mentioned.
These initial statistics, particularly min, max, and frequency counts around zero and below ₹100, are the fastest way to flag the most obvious data quality issues related to purchase amounts for Raghavendra Analytics' work with Swiggy.