ADVANCED COURSES ARE LIVE !!! HURRY UP JOIN NOW

Data Selection & Indexing in Pandas

pandas full course

1. Label-based Data Selection with loc[]

Definition and Purpose

Label-based data selection in Pandas utilizes the loc[] indexer to access data based on row labels and column labels. Unlike position-based indexing, loc[] allows users to specify explicit labels, facilitating intuitive data retrieval aligned with real-world data descriptions, such as country names, product IDs, or date labels.

Syntax and Usage

The fundamental syntax for label-based data selection is:

df.loc[row_label(s), column_label(s)]
  • To select a specific cell, specify both row and column labels:
    df.loc['row_label', 'column_label']
  • To select entire rows or columns:
    df.loc['row_label']      # Single row
    df.loc[:, 'column_label'] # Single column
    
  • For multiple rows or columns:
    df.loc[['row1', 'row2'], ['col1', 'col2']]

Advanced Label Selection

  • Slicing with loc[] allows selecting a range of data based on labels:
    df.loc['start_label':'end_label']
  • Conditional label-based data extraction employs boolean indexing with loc[]:
    df.loc[df['sales'] > 1000]
    This retrieves rows where the value in ‘sales’ exceeds 1000.

Practical Example

Suppose a DataFrame of sales data with monthly labels:

Month Sales Region
Jan2000North
Feb1500South
Mar1800East
import pandas as pd
data = {'Sales': [2000, 1500, 1800],
        'Region': ['North', 'South', 'East']}
index = ['Jan', 'Feb', 'Mar']
df = pd.DataFrame(data, index=index)

# Select sales in February
feb_sales = df.loc['Feb', 'Sales']
# Select all data for March
mar_data = df.loc['Mar']
# Select sales in Jan and Feb
jan_feb_sales = df.loc['Jan':'Feb', ['Sales']]
Outcome:
– Feb sales: 1500
– March data: Sales=1800, Region=’East’
– Sales for Jan and Feb:
     Sales
Jan    2000
Feb    1500

Practice Questions

  1. How does loc[] differ from iloc[] in data selection?
  2. Write code to select the ‘Region’ column for rows ‘Jan’ and ‘Mar’ using loc[].
  3. How would you select data where the sales value exceeds 1700 using loc[]?
  4. Demonstrate slicing from ‘Feb’ to ‘Mar’ using loc[].
  5. What is the result of df.loc['Jan':'Feb']?

2. Position-based Data Selection with iloc[]

Definition and Purpose

iloc[] is the position-based indexer used in Pandas for selecting data based on integer-based row and column positions. It is essential when data labels are unknown or when working with positional data views, especially in numerical computations or when data is ordered without explicit labels.

Syntax and Usage

The basic syntax:

df.iloc[row_position, column_position]
  • To access a specific cell:
    df.iloc[0, 1]
  • Entire rows or columns:
    df.iloc[0, :]    # First row
    df.iloc[:, 2]    # Third column
    
  • Multiple selections:
    df.iloc[0:2, 0:3]  # Rows 0 and 1, Columns 0 to 2

Slicing with iloc[]

Slicing in iloc[] is inclusive of the start index but exclusive of the end index, similar to standard Python slicing:

df.iloc[1:3, 0:2]

Practical Example

Using a same DataFrame as above:

# Selecting the first row
first_row = df.iloc[0]
# Selecting the 'Sales' column (assuming it's the second column, index 1)
sales_column = df.iloc[:, 1]
# Rows 0 and 1, columns 0 and 1
subset = df.iloc[0:2, 0:2]
Outcome:
– First row:
Sales       2000
Region     North
Name: Jan, dtype: object
– Sales column:
Jan    2000
Feb    1500
Mar    1800
Name: Sales, dtype: int64
– Subset:
      Sales  Region
Jan   2000   North
Feb   1500   South

Practice Questions

  1. How does iloc[] handle slicing differently from loc[]?
  2. Retrieve the data in the second row across all columns using iloc[].
  3. Write code to select the first two rows and first three columns.
  4. What will be the output of df.iloc[1:3, 0:2]?
  5. How can you select the last row using iloc[]?

3. Conditional Filtering for Data Subsetting

Purpose and Importance

Conditional filtering allows for extracting specific data subsets based on criteria, making datasets more manageable and relevant analyses possible. It involves creating boolean masks that filter rows where conditions are met.

Implementing Conditional Filters

  • Use comparison operators (==, !=, >, <, >=, <=) to generate boolean Series:
    mask = df['Sales'] > 1500
  • Combine multiple conditions with logical operators:
    • & for AND
    • | for OR
    • ~ for NOT
  • Ensure parentheses are used for proper evaluation:
    df[(df['Sales'] > 1500) & (df['Region'] == 'North')]

Filtering Examples

Suppose a catalog of product sales:

Product Sales Category
A500Electronics
B2000Furniture
C1500Electronics
D3000Furniture
# Select products with sales over 1000
high_sales = df[df['Sales'] > 1000]

# Select furniture products with sales over 2000
furniture_high = df[(df['Category'] == 'Furniture') & (df['Sales'] > 2000)]
Outcome:
high_sales includes products B and D.
furniture_high includes only D.

Practical Significance

Conditional data subsetting enhances data analysis efficiency, allowing targeted insights such as high-performing products or regional sales trends.

Practice Questions

  1. Write code to select rows where the ‘Category’ is ‘Electronics’ and ‘Sales’ are less than 1600.
  2. How do you select data where sales are between 1000 and 2000?
  3. Demonstrate filtering rows where ‘Region’ is not ‘North’ using boolean masks.
  4. What output do you get from df[df['Sales'] > 1800]?
  5. How can multiple conditions be combined using | and & operators?

4. Setting and Resetting Index for DataFrames

Understanding DataFrame Indexing

The index in a DataFrame uniquely identifies each row, serving as a reference point for data selection. A meaningful custom index facilitates faster retrievals, alignment, and data integrity during analysis.

Setting a New Index

Using set_index(), you can specify a column to serve as the DataFrame’s index:

df.set_index('Product', inplace=True)

This operation promotes the ‘Product’ column to an index, replacing the default integer index.

Resetting the Index

To revert to the default integer indexing, use reset_index():

df.reset_index(inplace=True)

This action converts the current index back into a column, maintaining data completeness.

Handling Index Labels

When setting or resetting indices, be cautious about index label preservation:

  • Use drop=True with set_index() if you do not want to keep the original column.
  • Use drop=False to retain the column along with the index.

Practical Example

# Initial DataFrame
df = pd.DataFrame({
    'Product': ['A', 'B', 'C'],
    'Sales': [500, 2000, 1500],
    'Category': ['Electronics', 'Furniture', 'Electronics']
})

# Setting 'Product' as index
df.set_index('Product', inplace=True)

# Resetting to default index
df.reset_index(inplace=True)
Outcome:
– Post setting index: rows identified by product labels.
– Post resetting index: ‘Product’ becomes a column again.

Practice Questions

  1. How does setting an index improve data retrieval?
  2. Write code to set the ‘Category’ column as index.
  3. What is the effect of drop=True in set_index()?
  4. Why might you want to reset an index after data manipulation?
  5. Demonstrate resetting index without losing the index labels.

Summary

This material elaborates on Data Selection and Indexing Techniques in Pandas, emphasizing the use of label-based (loc[]) and position-based (iloc[]) selection methods for flexible data retrieval. It underscores the importance of conditional filtering to subset datasets efficiently and discusses index management strategies including setting and resetting indices to optimize data accessibility and clarity. Knowledge of these core techniques enhances data analysis workflows, leading to faster, more accurate insights.


Study Resources


This educational material provides a comprehensive, theoretical overview suitable for learners to master Data Selection & Indexing in Pandas with practical insights and example exercises for effective understanding.

More Courses

Enroll Now

Tags:

Share:

You May Also Like

Your Website WhatsApp