Untitled — Python Coding — Nerchuko Academy

Pandas: `.loc` vs `.iloc`

CONCEPTUAL

Explain the difference between .loc and .iloc indexers in Pandas DataFrames. Provide examples of how and when you would use each.

Explanation: `.loc` vs `.iloc`

In Pandas, .loc and .iloc are the primary accessors used for selecting data from DataFrames (and Series). While both are used for slicing and selecting, they differ fundamentally in how they interpret the input used for selection.

Core Distinction:

.loc (Label-based selection):
- Selects data based on the actual labels of the rows and columns.
- The row and column identifiers you provide to .loc are interpreted as labels from the DataFrame's index and column names.
- When slicing with labels (e.g., df.loc['start_label':'end_label']), both the start and end labels are inclusive.
.iloc (Integer position-based selection):
- Selects data based on the integer positions (0-based index) of the rows and columns, similar to how you would index a Python list or a NumPy array.
- The row and column identifiers you provide to .iloc are interpreted as integer positions.
- When slicing with integer positions (e.g., df.iloc[start_pos:end_pos]), the start position is inclusive, and the end position is exclusive (standard Python slicing behavior).

Detailed Breakdown & Examples:

Let's consider a sample DataFrame:

import pandas as pd
import numpy as np

data = {'col_A': [10, 20, 30, 40, 50],
        'col_B': ['p', 'q', 'r', 's', 't'],
        'col_C': np.random.rand(5)}
df = pd.DataFrame(data, index=['row1', 'row2', 'row3', 'row4', 'row5'])
print("Original DataFrame:\n", df)
# Output:
# Original DataFrame:
#       col_A col_B    col_C
# row1     10     p  0.123...
# row2     20     q  0.456...
# row3     30     r  0.789...
# row4     40     s  0.987...
# row5     50     t  0.654...

Using `.loc` (Label-based):

Selecting a single row by label:

print("\nRow 'row2' using .loc:\n", df.loc['row2'])

Selecting multiple rows by a list of labels:

print("\nRows 'row1' and 'row3' using .loc:\n", df.loc[['row1', 'row3']])

Selecting a range of rows by label (inclusive):

print("\nRows 'row2' to 'row4' using .loc:\n", df.loc['row2':'row4'])

Selecting a single cell by row and column labels: df.loc[row_label, col_label]
```
print("\nCell at ('row3', 'col_B') using .loc:", df.loc['row3', 'col_B'])
```

Selecting specific rows and specific columns by labels:

print("\nRows 'row1','row4' and Cols 'col_A','col_C' using .loc:\n", df.loc[['row1', 'row4'], ['col_A', 'col_C']])

Selecting rows based on a boolean condition (powerful!):

print("\nRows where col_A > 25 using .loc:\n", df.loc[df['col_A'] > 25])

Setting values using labels:

df_copy = df.copy()
df_copy.loc['row1', 'col_A'] = 100
print("\nDataFrame after setting ('row1', 'col_A') to 100 with .loc:\n", df_copy)

Using `.iloc` (Integer position-based):

Selecting a single row by integer position:

print("\nRow at position 1 (second row) using .iloc:\n", df.iloc[1])

Selecting multiple rows by a list of integer positions:

print("\nRows at positions 0 and 2 using .iloc:\n", df.iloc[[0, 2]])

Selecting a range of rows by integer position (exclusive end):

print("\nRows from position 1 up to (not including) 4 using .iloc:\n", df.iloc[1:4])

Selecting a single cell by row and column integer positions: df.iloc[row_pos, col_pos]
```
print("\nCell at (pos 2, pos 1) using .iloc:", df.iloc[2, 1]) # row3, col_B
```

Selecting specific rows and specific columns by integer positions:

print("\nRows at pos 0,3 and Cols at pos 0,2 using .iloc:\n", df.iloc[[0, 3], [0, 2]])

Selecting all rows for specific columns by integer position:

print("\nAll rows for columns at pos 0 and 2 using .iloc:\n", df.iloc[:, [0, 2]])

Setting values using integer positions:

df_copy_iloc = df.copy()
df_copy_iloc.iloc[0, 0] = 1000
print("\nDataFrame after setting (pos 0, pos 0) to 1000 with .iloc:\n", df_copy_iloc)

Important Note on Slicing:

df.loc['label1':'label3'] is **inclusive** of 'label3'.
df.iloc[0:3] is **exclusive** of position 3 (i.e., it gets positions 0, 1, 2).

When to Use Which:

Scenario	Use `.loc`	Use `.iloc`
You know the row/column labels (names).	✔️ Yes (e.g., `df.loc['row_name', 'column_name']`)	❌ No (unless labels happen to be integers that match positions)
You want to select data by its position, regardless of labels.	❌ No	✔️ Yes (e.g., `df.iloc[0, 1]` for first row, second column)
Row/column labels are not integers (e.g., strings, datetimes).	✔️ Yes	❌ No (`.iloc` strictly requires integers for positions)
Row/column labels are integers, but you want to refer to them as labels.	✔️ Yes (e.g., if index is `[10, 20, 30]`, `df.loc[10]` uses label 10)	Use with caution (`df.iloc[10]` would try to get the 11th row by position, which might be different or out of bounds)
You need to select rows based on a boolean condition.	✔️ Yes (e.g., `df.loc[df['col'] > 5]`)	Can be done, but less direct. Often involves converting boolean Series to NumPy array: `df.iloc[df['col'].values > 5]` or `df.iloc[(df['col'] > 5).to_numpy()]`. `.loc` is more natural for this.
Slicing behavior for the end point.	Inclusive (`'start':'end'` includes 'end')	Exclusive (`start:end` excludes `end`)

Key Considerations:

Clarity and Readability: Use .loc when your selection logic is based on meaningful labels. This often makes code easier to understand. Use .iloc when the numerical position is what matters.
Avoiding Ambiguity: If your DataFrame has an integer index (e.g., 0, 1, 2, ...), using [] directly for selection (e.g., df[0:2] or df['col_name']) can sometimes be ambiguous or lead to unexpected behavior depending on whether it's interpreted as label or position. .loc and .iloc are explicit and therefore preferred to avoid this ambiguity.
Potential for Errors: Mixing up .loc and .iloc (e.g., providing a label to .iloc) will result in errors (TypeError or KeyError).

In summary: Use .loc for label-based indexing and .iloc for purely integer-based positional indexing. Being explicit with these accessors leads to more robust and readable Pandas code.

Pandas: .loc vs .iloc

Explanation: .loc vs .iloc