11 NumPy and Pandas#

Goal#

Learn the fundamentals of NumPy (numerical computing) and Pandas (data manipulation). These are essential libraries for scientific computing and data analysis.

Prerequisites#

1. Introduction#

NumPy provides efficient arrays and mathematical functions for numerical computing. Pandas builds on NumPy to provide high-level data structures (DataFrames) perfect for tabular data like spreadsheets.

Together, they handle tasks like:

  • Reading and writing data files (CSV, Excel, etc.)

  • Cleaning and transforming data

  • Statistical analysis

  • Filtering and aggregating data

  • Preparing data for visualization or machine learning

2. Installation#

NumPy and Pandas usually come with Anaconda/Miniconda. Install explicitly with:

pip install numpy pandas

3. NumPy Basics#

3.1 NumPy Arrays#

import numpy as np

# Create arrays
arr = np.array([1, 2, 3, 4, 5])
print(arr)                    # [1 2 3 4 5]
print(type(arr))              # <class 'numpy.ndarray'>

# 2D arrays (matrices)
matrix = np.array([[1, 2, 3], [4, 5, 6]])
print(matrix.shape)           # (2, 3) - 2 rows, 3 columns

# Create special arrays
zeros = np.zeros(5)           # [0. 0. 0. 0. 0.]
ones = np.ones((3, 3))        # 3x3 matrix of ones
range_arr = np.arange(0, 10)  # [0 1 2 3 4 5 6 7 8 9]

3.2 Array Operations#

# Element-wise operations
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

print(a + b)      # [5 7 9]
print(a * 2)      # [2 4 6]
print(a ** 2)     # [1 4 9]
print(np.sqrt(a)) # [1.        1.41421356 1.73205081]

3.3 Indexing and Slicing#

arr = np.array([10, 20, 30, 40, 50])

print(arr[0])     # 10 (first element)
print(arr[-1])    # 50 (last element)
print(arr[1:4])   # [20 30 40] (elements 1 to 3)

# 2D indexing
matrix = np.array([[1, 2, 3], [4, 5, 6]])
print(matrix[0, :])  # [1 2 3] (first row)
print(matrix[:, 1])  # [2 5] (second column)

3.4 Useful Functions#

data = np.array([1, 5, 3, 9, 2])

print(np.mean(data))      # 4.0 (average)
print(np.std(data))       # 3.16... (standard deviation)
print(np.min(data))       # 1
print(np.max(data))       # 9
print(np.sum(data))       # 20

4. Pandas Basics#

4.1 DataFrames (Tables)#

import pandas as pd

# Create from a dictionary
data = {
    "Name": ["Alice", "Bob", "Charlie"],
    "Age": [25, 30, 35],
    "Score": [85.5, 92.3, 78.9]
}

df = pd.DataFrame(data)
print(df)

#      Name  Age  Score
# 0   Alice   25   85.5
# 1     Bob   30   92.3
# 2 Charlie   35   78.9

4.2 Reading and Writing Data#

# Read from CSV
df = pd.read_csv("data.csv")

# Write to CSV
df.to_csv("output.csv", index=False)

# Read from Excel
df = pd.read_excel("data.xlsx")

# Write to Excel
df.to_excel("output.xlsx", index=False)

4.3 Accessing Data#

df = pd.DataFrame({
    "Name": ["Alice", "Bob", "Charlie"],
    "Age": [25, 30, 35]
})

# Access column
print(df["Name"])          # Series of names
print(df["Age"].mean())    # 30.0 (average age)

# Access row
print(df.iloc[0])         # First row as Series
print(df.loc[0, "Name"])  # "Alice" (first row, Name column)

# First/Last rows
print(df.head(2))         # First 2 rows
print(df.tail(1))         # Last row

4.4 Filtering Data#

# Filter rows where Age > 25
older = df[df["Age"] > 25]

# Multiple conditions
young_high_score = df[(df["Age"] < 30) & (df["Score"] > 80)]

# Select specific columns
names_ages = df[["Name", "Age"]]

4.5 Data Cleaning#

# Check for missing values
print(df.isnull())

# Drop rows with missing values
df_clean = df.dropna()

# Fill missing values
df.fillna(0, inplace=True)

# Remove duplicates
df_unique = df.drop_duplicates()

# Rename columns
df.rename(columns={"Age": "Years"}, inplace=True)

4.6 Grouping and Aggregation#

# Group by a column and calculate statistics
df_grouped = df.groupby("Age").agg({
    "Score": ["mean", "min", "max"],
    "Name": "count"
})

# By department, get average salary
df.groupby("Department")["Salary"].mean()

4.7 Sorting#

# Sort by Age (ascending)
df_sorted = df.sort_values("Age")

# Sort by Age (descending)
df_sorted = df.sort_values("Age", ascending=False)

# Sort by multiple columns
df_sorted = df.sort_values(["Department", "Salary"])

5. Combining NumPy and Pandas#

import numpy as np
import pandas as pd

# Create data with NumPy
values = np.random.random(100)  # 100 random numbers

# Put in DataFrame
df = pd.DataFrame({"measurements": values})

# Calculate statistics
print(df["measurements"].describe())  # Count, mean, std, min, etc.

6. Common Workflow Example#

import pandas as pd
import numpy as np

# 1. Load data
df = pd.read_csv("experiment_data.csv")

# 2. Explore
print(df.shape)              # Dimensions
print(df.info())             # Data types and null counts
print(df.describe())         # Statistical summary

# 3. Clean
df["temperature"].fillna(df["temperature"].mean(), inplace=True)
df = df[df["result"] != "ERROR"]

# 4. Analyze
by_condition = df.groupby("condition")["result"].mean()
print(by_condition)

# 5. Save cleaned data
df.to_csv("cleaned_data.csv", index=False)

7. Resources#

Next Steps#

Now that you can work with data efficiently, let’s visualize it: 12 Matplotlib.