学习 Pandas：Python 数据分析基础

在当今数据驱动的世界里，处理、分析和理解数据是许多领域的核心技能。Python 作为一种功能强大且易于学习的编程语言，在数据科学领域占据着举足轻重的地位。而在 Python 的数据科学生态系统中，Pandas 库无疑是数据处理和分析的基石。

Pandas 提供了一系列高性能、易于使用的数据结构和数据分析工具，极大地简化了数据导入、清洗、转换、聚合等常见任务。无论是处理结构化数据（如表格数据、CSV 文件、数据库记录）还是时间序列数据，Pandas 都能提供高效的解决方案。

本文将带领读者深入学习 Pandas 的基础知识，包括其核心数据结构 Series 和 DataFrame，以及常用的数据操作方法，为你打下坚实的数据分析基础。

1. Pandas 是什么？为什么选择 Pandas？

Pandas 是一个开源的 Python 库，专门为数据处理和分析而设计。它的名字来源于 “Panel Data” 和 “Python Data Analysis”。Pandas 的主要特点包括：

核心数据结构： 提供 Series（一维带标签数组）和 DataFrame（二维带标签表格数据）两种强大的数据结构，能够高效处理各种数据。
数据加载和保存： 支持从多种文件格式（如 CSV, Excel, SQL 数据库, JSON 等）读取数据，并能方便地将数据保存到这些格式。
数据清洗和预处理： 提供丰富的工具处理缺失数据、重复数据、异常值，进行数据类型转换、数据格式调整等。
数据选择和过滤： 强大的索引和选择功能，可以根据标签、位置或条件轻松筛选和访问数据。
数据转换： 支持数据排序、排名、数据透视、数据堆叠/熔化等操作。
数据聚合和分组： 提供了强大的 groupby 功能，可以轻松实现数据的分组、聚合和转换。
时间序列分析： 内置了处理时间序列数据的功能，支持日期范围生成、频率转换、移动窗口统计等。

为什么选择 Pandas？

高效性： Pandas 基于 NumPy 构建，底层使用 C 或 Cython 编写，提供了高性能的数据处理能力。
易用性： Pandas 提供了直观的 API，许多常见的数据操作可以用非常简洁的代码实现。
灵活性： 能够处理各种类型的数据，并与其他 Python 库（如 NumPy, Matplotlib, Scikit-learn）无缝集成。
社区支持： 作为最流行的数据分析库之一，Pandas 拥有庞大活跃的社区，提供了丰富的文档、教程和问题解决方案。

2. 安装与导入

在开始使用 Pandas 之前，你需要先安装它。如果你已经安装了 Anaconda 或 Miniconda，Pandas 通常已经包含在内。如果没有，可以使用 pip 进行安装：

bash pip install pandas

安装完成后，通常会按照惯例将其导入，并使用 pd 作为别名：

python import pandas as pd

3. Pandas 核心数据结构：Series

Series 是一种一维的带标签数组，可以看作是带索引的 NumPy 数组或单列电子表格。每个元素都有一个与之关联的标签，称为索引（Index）。

创建 Series

你可以通过多种方式创建 Series：

从 Python 列表或 NumPy 数组创建：

“`python
import pandas as pd
import numpy as np

从列表创建

s1 = pd.Series([1, 3, 5, 7, 9])
print(“Series from list:”)
print(s1)

默认索引是 0 到 n-1

从 NumPy 数组创建

s2 = pd.Series(np.array([2, 4, 6, 8, 10]))
print(“\nSeries from NumPy array:”)
print(s2)
“`

输出：

“`
Series from list:
0 1
1 3
2 5
3 7
4 9
dtype: int64

Series from NumPy array:
0 2
1 4
2 6
3 8
4 10
dtype: int64
“`
指定索引创建 Series：

python s3 = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd']) print("\nSeries with custom index:") print(s3)

输出：

Series with custom index: a 10 b 20 c 30 d 40 dtype: int64
从 Python 字典创建： 字典的键将作为 Series 的索引，值作为 Series 的数据。

python data = {'apple': 10, 'banana': 15, 'cherry': 8, 'date': 12} s4 = pd.Series(data) print("\nSeries from dictionary:") print(s4)

输出：

Series from dictionary: apple 10 banana 15 cherry 8 date 12 dtype: int64

Series 的索引与取值

你可以通过索引标签或整数位置来访问 Series 中的元素：

“`python
print(“\nAccessing Series elements:”)

通过索引标签访问 (如果索引是字符串或自定义标签)

print(“s3[‘b’]:”, s3[‘b’])
print(“s4[‘cherry’]:”, s4[‘cherry’])

通过整数位置访问 (类似列表/数组)

print(“s1[0]:”, s1[0])
print(“s2[3]:”, s2[3])
print(“s3[1]:”, s3[1]) # 即使有自定义索引，也可以通过位置访问
“`

输出：

Accessing Series elements: s3['b']: 20 s4['cherry']: 8 s1[0]: 1 s2[3]: 8 s3[1]: 20

Series 的切片

Series 也支持类似列表的切片操作：

“`python
print(“\nSeries slicing:”)

按整数位置切片 (不包含结束位置)

print(“s1[1:3]:”)
print(s1[1:3])

按索引标签切片 (包含结束位置)

print(“\ns3[‘b’:’d’]:”)
print(s3[‘b’:’d’])
“`

输出：

“`
Series slicing:
1 3
2 5
dtype: int64

s3[‘b’:’d’]:
b 20
c 30
d 40
dtype: int64
“`

Series 的基本操作

Series 支持许多数学运算和 NumPy 数组类似的操作：

“`python
print(“\nSeries basic operations:”)

广播运算

print(“s1 + 5:”)
print(s1 + 5)

不同 Series 间的运算 (按索引对齐)

s5 = pd.Series([100, 200, 300], index=[1, 3, 4])
print(“\ns1 + s5:”)
print(s1 + s5) # 注意索引不对齐的位置结果是 NaN (Not a Number)

过滤

print(“\ns1 > 5:”)
print(s1[s1 > 5]) # 返回满足条件的元素组成的 Series
“`

输出：

“`
Series basic operations:
0 6
1 8
2 10
3 12
4 14
dtype: int64

s1 + s5:
0 NaN
1 103.0
2 NaN
3 207.0
4 309.0
dtype: float64

s1 > 5:
3 7
4 9
dtype: int64
“`

4. Pandas 核心数据结构：DataFrame

DataFrame 是 Pandas 中最重要、最常用的数据结构，它可以看作是一个二维的表格型数据结构，类似于电子表格、SQL 表或字典型的 Series 容器。它由行和列组成，每列可以存放不同的数据类型。

创建 DataFrame

创建 DataFrame 的常用方式：

从字典创建： 字典的键作为列名，字典的值（通常是列表、NumPy 数组或 Series）作为列的数据。

python data = { 'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'], 'Age': [25, 30, 35, 28, 32], 'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix'], 'Salary': [50000, 60000, 75000, 55000, 70000] } df = pd.DataFrame(data) print("DataFrame from dictionary:") print(df)

输出：

DataFrame from dictionary: Name Age City Salary 0 Alice 25 New York 50000 1 Bob 30 Los Angeles 60000 2 Charlie 35 Chicago 75000 3 David 28 Houston 55000 4 Eva 32 Phoenix 70000

默认情况下，DataFrame 的行索引是 0 到 n-1 的整数。你也可以指定行索引：

python df_custom_index = pd.DataFrame(data, index=['a', 'b', 'c', 'd', 'e']) print("\nDataFrame with custom index:") print(df_custom_index)

输出：

DataFrame with custom index: Name Age City Salary a Alice 25 New York 50000 b Bob 30 Los Angeles 60000 c Charlie 35 Chicago 75000 d David 28 Houston 55000 e Eva 32 Phoenix 70000
从列表的列表或 NumPy 二维数组创建：

python data2 = [ ['Alice', 25, 'New York', 50000], ['Bob', 30, 'Los Angeles', 60000], ['Charlie', 35, 'Chicago', 75000] ] df2 = pd.DataFrame(data2, columns=['Name', 'Age', 'City', 'Salary']) print("\nDataFrame from list of lists:") print(df2)

输出：

DataFrame from list of lists: Name Age City Salary 0 Alice 25 New York 50000 1 Bob 30 Los Angeles 60000 2 Charlie 35 Chicago 75000

DataFrame 的基本属性

DataFrame 具有一些有用的属性来查看其结构和内容：

python print("\nDataFrame properties:") print("Shape:", df.shape) # 行数和列数 print("Columns:", df.columns) # 列名 print("Index:", df.index) # 行索引 print("Data types:", df.dtypes) # 每列的数据类型 print("Values (as NumPy array):", df.values) # 数据值，返回 NumPy 数组

输出：

DataFrame properties: Shape: (5, 4) Columns: Index(['Name', 'Age', 'City', 'Salary'], dtype='object') Index: Int64Index([0, 1, 2, 3, 4], dtype='int64') Data types: Name object Age int64 City object Salary int64 dtype: object Values (as NumPy array): [['Alice' 25 'New York' 50000] ['Bob' 30 'Los Angeles' 60000] ['Charlie' 35 'Chicago' 75000] ['David' 28 'Houston' 55000] ['Eva' 32 'Phoenix' 70000]]

DataFrame 的列选择

可以通过列名来选择一列或多列：

“`python
print(“\nSelecting columns:”)

选择单列 (返回 Series)

name_column = df[‘Name’]
print(“Single column ‘Name’:”)
print(name_column)

选择多列 (返回 DataFrame)

subset_columns = df[[‘Name’, ‘Salary’]]
print(“\nMultiple columns [‘Name’, ‘Salary’]:”)
print(subset_columns)

使用点语法 (列名是有效的 Python 变量名且不与属性名冲突时)

print(“\nSingle column ‘Age’ using dot notation:”)
print(df.Age)
“`

输出：

“`
Selecting columns:
Single column ‘Name’:
0 Alice
1 Bob
2 Charlie
3 David
4 Eva
Name: Name, dtype: object

Multiple columns [‘Name’, ‘Salary’]:
Name Salary
0 Alice 50000
1 Bob 60000
2 Charlie 75000
3 David 55000
4 Eva 70000

Single column ‘Age’ using dot notation:
0 25
1 30
2 35
3 28
4 32
Name: Age, dtype: int64
“`

DataFrame 的行选择

选择行是 DataFrame 操作中非常重要的部分。Pandas 提供了两种主要的行选择方法：.loc 和 .iloc。

.loc：基于标签（Label-based）的索引

使用 .loc 可以通过行索引标签来选择行。

“`python
print(“\nSelecting rows using .loc (label-based):”)

选择单行 (使用行索引标签)

print(“Row with index 0:”)
print(df.loc[0])

选择多行 (使用行索引标签列表)

print(“\nRows with indices 1, 3:”)
print(df.loc[[1, 3]])

按标签范围切片 (包含结束标签)

print(“\nRows from index ‘b’ to ‘d’ (inclusive) in df_custom_index:”)
print(df_custom_index.loc[‘b’:’d’])
“`

输出：

“`
Selecting rows using .loc (label-based):
Row with index 0:
Name Alice
Age 25
City New York
Salary 50000
Name: 0, dtype: object

Rows with indices 1, 3:
Name Age City Salary
1 Bob 30 Los Angeles 60000
3 David 28 Houston 55000

Rows from index ‘b’ to ‘d’ (inclusive) in df_custom_index:
Name Age City Salary
b Bob 30 Los Angeles 60000
c Charlie 35 Chicago 75000
d David 28 Houston 55000
“`
.iloc：基于位置（Integer-location based）的索引

使用 .iloc 可以通过行的整数位置（从 0 开始）来选择行，类似于 NumPy 数组的索引。

“`python
print(“\nSelecting rows using .iloc (integer-location based):”)

选择单行 (使用行位置)

print(“Row at position 0:”)
print(df.iloc[0])

选择多行 (使用行位置列表)

print(“\nRows at positions 1, 3:”)
print(df.iloc[[1, 3]])

按位置范围切片 (不包含结束位置)

print(“\nRows from position 1 to 3 (exclusive):”)
print(df.iloc[1:3])
“`

输出：

“`
Selecting rows using .iloc (integer-location based):
Row at position 0:
Name Alice
Age 25
City New York
Salary 50000
Name: 0, dtype: object

Rows at positions 1, 3:
Name Age City Salary
1 Bob 30 Los Angeles 60000
3 David 28 Houston 55000

Rows from position 1 to 3 (exclusive):
Name Age City Salary
1 Bob 30 Los Angeles 60000
2 Charlie 35 Chicago 75000
“`

组合选择：选择特定的行和列

.loc 和 .iloc 也可以用于同时选择特定的行和列：

“`python
print(“\nCombined row and column selection:”)

使用 .loc: 选择标签为 0 和 2 的行的 ‘Name’ 和 ‘Salary’ 列

print(“df.loc[[0, 2], [‘Name’, ‘Salary’]]:”)
print(df.loc[[0, 2], [‘Name’, ‘Salary’]])

使用 .iloc: 选择位置为 1 和 3 的行的位置为 0 和 3 的列

print(“\ndf.iloc[[1, 3], [0, 3]]:”)
print(df.iloc[[1, 3], [0, 3]])

使用 .loc: 选择标签 ‘b’ 到 ‘d’ 的行的 ‘City’ 列 (在 custom_index DataFrame 中)

print(“\ndf_custom_index.loc[‘b’:’d’, ‘City’]:”)
print(df_custom_index.loc[‘b’:’d’, ‘City’])
“`

输出：

“`
Combined row and column selection:
df.loc[[0, 2], [‘Name’, ‘Salary’]]:
Name Salary
0 Alice 50000
2 Charlie 75000

df.iloc[[1, 3], [0, 3]]:
Name Salary
1 Bob 60000
3 David 55000

df_custom_index.loc[‘b’:’d’, ‘City’]:
b Los Angeles
c Chicago
d Houston
Name: City, dtype: object
“`

5. 数据加载与保存

从文件读取数据是数据分析的第一步。Pandas 提供了 read_* 系列函数来读取各种文件格式。最常用的是 read_csv 和 read_excel。

读取 CSV 文件

pd.read_csv() 是读取 CSV 文件的主要函数，它有很多参数可以控制读取行为（如分隔符、是否有表头、指定索引列、处理缺失值等）。

“`python

假设有一个名为 ‘data.csv’ 的文件

内容如下：

Name,Age,City,Salary

Alice,25,New York,50000

Bob,30,Los Angeles,60000

Charlie,35,Chicago,75000

David,28,Houston,55000

Eva,32,Phoenix,70000

读取 CSV 文件

try:
df_csv = pd.read_csv(‘data.csv’)
print(“\nDataFrame loaded from CSV:”)
print(df_csv)

# 指定某一列作为索引
df_csv_indexed = pd.read_csv('data.csv', index_col='Name')
print("\nDataFrame loaded from CSV with 'Name' as index:")
print(df_csv_indexed)

except FileNotFoundError:
print(“\nError: data.csv not found. Please create a dummy data.csv file.”)
# 创建一个示例 data.csv 文件用于演示
dummy_data = {
‘Name’: [‘Alice’, ‘Bob’, ‘Charlie’, ‘David’, ‘Eva’],
‘Age’: [25, 30, 35, 28, 32],
‘City’: [‘New York’, ‘Los Angeles’, ‘Chicago’, ‘Houston’, ‘Phoenix’],
‘Salary’: [50000, 60000, 75000, 55000, 70000]
}
dummy_df = pd.DataFrame(dummy_data)
dummy_df.to_csv(‘data.csv’, index=False)
print(“\nCreated dummy data.csv. Please re-run the script.”)

“`

保存 DataFrame 到文件

使用 to_* 系列方法可以将 DataFrame 保存到各种文件格式。to_csv() 是常用的保存方法。

“`python

将 DataFrame 保存为新的 CSV 文件

index=False 表示不将行索引写入文件

df.to_csv(‘output_data.csv’, index=False)
print(“\nDataFrame saved to output_data.csv”)

将 DataFrame 保存为 Excel 文件 (需要安装 openpyxl 或 xlwt 库)

try:

df.to_excel(‘output_data.xlsx’, index=False)

print(“\nDataFrame saved to output_data.xlsx”)

except ImportError:

print(“\nTo save to Excel, install openpyxl: pip install openpyxl”)

“`

6. 数据概览与基本信息

加载数据后，通常需要快速了解数据的基本情况，包括前几行、列信息、数据类型、描述性统计等。

head() 和 tail()： 查看 DataFrame 的前几行或后几行（默认前 5 行）。

“`python
print(“\nFirst 3 rows of the DataFrame:”)
print(df.head(3))

print(“\nLast 2 rows of the DataFrame:”)
print(df.tail(2))
“`
info()： 查看 DataFrame 的简要信息，包括索引类型、列名、非空值的数量以及每列的数据类型，以及内存使用情况。

python print("\nDataFrame info:") df.info()

输出：

“`
DataFrame info:

RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
# Column Non-Null Count Dtype

0 Name 5 non-null object
1 Age 5 non-null int64
2 City 5 non-null object
3 Salary 5 non-null int64
dtypes: int64(2), object(2)
memory usage: 288.0+ bytes
“`
describe()： 生成描述性统计信息，包括计数、均值、标准差、最小值、最大值以及四分位数等。这主要针对数值型的列。

python print("\nDataFrame descriptive statistics:") print(df.describe())

输出：

DataFrame descriptive statistics: Age Salary count 5.000000 5.000000 mean 30.000000 62000.000000 std 3.807887 10954.451150 min 25.000000 50000.000000 25% 28.000000 55000.000000 50% 30.000000 60000.000000 75% 32.000000 70000.000000 max 35.000000 75000.000000

7. 数据清洗与预处理基础

真实世界的数据往往是不干净的，可能包含缺失值、重复项、不正确的数据类型等。Pandas 提供了强大的工具来处理这些问题。

处理缺失值（Missing Values）

缺失值通常用 NaN（Not a Number）表示。

isnull() 和 notnull()： 返回布尔型的 DataFrame，表示每个位置是否是缺失值。

“`python
import numpy as np

df_missing = pd.DataFrame({
‘A’: [1, 2, np.nan, 4],
‘B’: [5, np.nan, np.nan, 8],
‘C’: [9, 10, 11, 12]
})
print(“\nDataFrame with missing values:”)
print(df_missing)

print(“\nChecking for null values:”)
print(df_missing.isnull())

print(“\nChecking for non-null values:”)
print(df_missing.notnull())
“`

输出：

“`
DataFrame with missing values:
A B C
0 1.0 5.0 9
1 2.0 NaN 10
2 NaN NaN 11
3 4.0 8.0 12

Checking for null values:
A B C
0 False False False
1 False True False
2 True True False
3 False False False

Checking for non-null values:
A B C
0 True True True
1 True False True
2 False False True
3 True True True
“`
dropna()： 删除包含缺失值的行或列。

“`python
print(“\nDropping rows with any null values:”)
print(df_missing.dropna()) # axis=0 是默认值，删除行

print(“\nDropping columns with any null values:”)
print(df_missing.dropna(axis=1)) # axis=1 删除列

print(“\nDropping rows only if ALL values are null:”)
print(df_missing.dropna(how=’all’)) # how=’all’ 指定只有当一行/列所有值都是 null 时才删除
“`

输出：

“`
Dropping rows with any null values:
A B C
0 1.0 5.0 9
3 4.0 8.0 12

Dropping columns with any null values:
C
0 9
1 10
2 11
3 12

Dropping rows only if ALL values are null:
A B C
0 1.0 5.0 9
1 2.0 NaN 10
2 NaN NaN 11
3 4.0 8.0 12
“`
fillna()： 填充缺失值。可以填充固定值、前一个有效值（forward fill, method='ffill'）或后一个有效值（backward fill, method='bfill'），也可以用列的均值、中位数等统计量填充。

“`python
print(“\nFilling null values with 0:”)
print(df_missing.fillna(0))

print(“\nFilling null values with mean of the column ‘A’:”)
print(df_missing[‘A’].fillna(df_missing[‘A’].mean()))

print(“\nFilling null values using forward fill (ffill):”)
print(df_missing.fillna(method=’ffill’))
“`

输出：

“`
Filling null values with 0:
A B C
0 1.0 5.0 9
1 2.0 0.0 10
2 0.0 0.0 11
3 4.0 8.0 12

Filling null values with mean of the column ‘A’:
0 1.0
1 2.0
2 2.333333
3 4.0
Name: A, dtype: float64

Filling null values using forward fill (ffill):
A B C
0 1.0 5.0 9
1 2.0 5.0 10
2 2.0 5.0 11
3 4.0 8.0 12
“`

处理重复项（Duplicates）

duplicated()： 返回布尔型 Series，表示每一行是否是重复行（与之前的行相同）。

“`python
df_duplicates = pd.DataFrame({
‘col1’: [‘A’, ‘B’, ‘A’, ‘C’, ‘B’],
‘col2’: [1, 2, 1, 3, 2]
})
print(“\nDataFrame with duplicates:”)
print(df_duplicates)

print(“\nChecking for duplicated rows:”)
print(df_duplicates.duplicated()) # keep=’first’ 是默认值，标记除了第一次出现以外的重复项
“`

输出：

“`
DataFrame with duplicates:
col1 col2
0 A 1
1 B 2
2 A 1
3 C 3
4 B 2

Checking for duplicated rows:
0 False
1 False
2 True
3 False
4 True
dtype: bool
“`
drop_duplicates()： 删除重复行。

“`python
print(“\nDropping duplicated rows:”)
print(df_duplicates.drop_duplicates())

也可以根据特定列判断重复

print(“\nDropping duplicated rows based on ‘col1’:”)
print(df_duplicates.drop_duplicates(subset=[‘col1’])) # 删除 col1 值重复的行 (保留第一次出现)
“`

输出：

“`
Dropping duplicated rows:
col1 col2
0 A 1
1 B 2
3 C 3

Dropping duplicated rows based on ‘col1’:
col1 col2
0 A 1
1 B 2
3 C 3
“`

数据类型转换（Type Conversion）

使用 astype() 方法可以转换列的数据类型。

“`python
print(“\nOriginal data types:”)
print(df.dtypes)

将 ‘Salary’ 列转换为 float

df[‘Salary’] = df[‘Salary’].astype(float)
print(“\nData types after converting ‘Salary’ to float:”)
print(df.dtypes)
print(df) # 查看转换后的数据，数值会显示小数点
“`

输出：

“`
Original data types:
Name object
Age int64
City object
Salary int64
dtype: object

Data types after converting ‘Salary’ to float:
Name object
Age int64
City object
Salary float64
dtype: object

  Name  Age         City   Salary

0 Alice 25 New York 50000.0
1 Bob 30 Los Angeles 60000.0
2 Charlie 35 Chicago 75000.0
3 David 28 Houston 55000.0
4 Eva 32 Phoenix 70000.0
“`

8. 数据选择与过滤（Boolean Indexing）

除了使用 .loc 和 .iloc 按标签或位置选择数据外，还可以使用布尔数组进行数据过滤，这被称为布尔索引或布尔掩码。

“`python
print(“\nOriginal DataFrame:”)
print(df)

过滤出年龄大于 30 的人

age_filter = df[‘Age’] > 30
print(“\nBoolean mask for Age > 30:”)
print(age_filter)

print(“\nRows where Age > 30:”)
print(df[age_filter])

或者直接写成：print(df[df[‘Age’] > 30])

过滤出薪水大于 60000 且在纽约或芝加哥的人

condition1 = df[‘Salary’] > 60000
condition2 = df[‘City’].isin([‘New York’, ‘Chicago’]) # isin() 检查值是否在列表中
print(“\nRows where Salary > 60000 AND City is New York or Chicago:”)
print(df[condition1 & condition2]) # 使用 & 表示 AND

过滤出年龄小于 30 或薪水大于 70000 的人

condition3 = df[‘Age’] < 30
condition4 = df[‘Salary’] > 70000
print(“\nRows where Age < 30 OR Salary > 70000:”)
print(df[condition3 | condition4]) # 使用 | 表示 OR

过滤出不在洛杉矶的人

print(“\nRows where City is NOT Los Angeles:”)
print(df[~df[‘City’].isin([‘Los Angeles’])]) # 使用 ~ 表示 NOT
“`

输出：

“`
Original DataFrame:
Name Age City Salary
0 Alice 25 New York 50000.0
1 Bob 30 Los Angeles 60000.0
2 Charlie 35 Chicago 75000.0
3 David 28 Houston 55000.0
4 Eva 32 Phoenix 70000.0

Boolean mask for Age > 30:
0 False
1 False
2 True
3 False
4 True
Name: Age, dtype: bool

Rows where Age > 30:
Name Age City Salary
2 Charlie 35 Chicago 75000.0
4 Eva 32 Phoenix 70000.0

Rows where Salary > 60000 AND City is New York or Chicago:
Name Age City Salary
2 Charlie 35 Chicago 75000.0

Rows where Age < 30 OR Salary > 70000:
Name Age City Salary
0 Alice 25 New York 50000.0
2 Charlie 35 Chicago 75000.0
3 David 28 Houston 55000.0

Rows where City is NOT Los Angeles:
Name Age City Salary
0 Alice 25 New York 50000.0
2 Charlie 35 Chicago 75000.0
3 David 28 Houston 55000.0
4 Eva 32 Phoenix 70000.0
“`

9. 分组与聚合（Group By）

groupby() 方法是 Pandas 中用于进行数据聚合和转换的核心功能之一。它按照指定的列或多个列将 DataFrame 拆分成多个组，然后对每个组独立地应用某个操作（如聚合、转换、过滤）。

基本流程：Split-Apply-Combine
1. Split（拆分）： 根据分组键将数据拆分成不同的组。
2. Apply（应用）： 对每个组独立地应用一个函数（如计算均值、求和、计数等）。
3. Combine（合并）： 将每个组的结果合并成一个单一的 Pandas 对象（Series 或 DataFrame）。

“`python

增加一个部门列用于分组

df[‘Department’] = [‘Sales’, ‘IT’, ‘Sales’, ‘IT’, ‘Sales’]
print(“\nDataFrame with Department column:”)
print(df)

按部门分组并计算每个部门的平均薪水

dept_salary = df.groupby(‘Department’)[‘Salary’].mean()
print(“\nAverage Salary by Department:”)
print(dept_salary)

按部门分组并计算每个部门的人数

dept_count = df.groupby(‘Department’)[‘Name’].count()
print(“\nEmployee Count by Department:”)
print(dept_count)

按部门和城市分组，计算平均年龄

dept_city_age = df.groupby([‘Department’, ‘City’])[‘Age’].mean()
print(“\nAverage Age by Department and City:”)
print(dept_city_age)

对分组应用多个聚合函数

dept_stats = df.groupby(‘Department’)[‘Salary’].agg([‘mean’, ‘min’, ‘max’, ‘count’])
print(“\nSalary Statistics by Department:”)
print(dept_stats)
“`

输出：

“`
DataFrame with Department column:
Name Age City Salary Department
0 Alice 25 New York 50000.0 Sales
1 Bob 30 Los Angeles 60000.0 IT
2 Charlie 35 Chicago 75000.0 Sales
3 David 28 Houston 55000.0 IT
4 Eva 32 Phoenix 70000.0 Sales

Average Salary by Department:
Department
IT 57500.0
Sales 65000.0
Name: Salary, dtype: float64

Employee Count by Department:
Department
IT 2
Sales 3
Name: Name, dtype: int64

Average Age by Department and City:
Department City
IT Houston 28.0
Los Angeles 30.0
Sales Chicago 35.0
New York 25.0
Phoenix 32.0
Name: Age, dtype: float64

Salary Statistics by Department:
mean min max count
Department
IT 57500.0 55000.0 60000.0 2
Sales 65000.0 50000.0 75000.0 3
“`

10. 数据合并与连接（Merge and Concatenate）

在实际数据分析中，经常需要将来自不同源或不同结构的数据合并在一起。Pandas 提供了 merge() 和 concat() 两个主要函数来实现这一目的。

concat()： 沿着某个轴（行或列）连接多个 Pandas 对象。

“`python
df1 = pd.DataFrame({‘A’: [‘A0’, ‘A1’], ‘B’: [‘B0’, ‘B1’]})
df2 = pd.DataFrame({‘A’: [‘A2’, ‘A3’], ‘B’: [‘B2’, ‘B3’]})

print(“df1:”)
print(df1)
print(“\ndf2:”)
print(df2)

按行连接 (默认 axis=0)

result_concat_rows = pd.concat([df1, df2])
print(“\nConcatenated by rows:”)
print(result_concat_rows)

按列连接 (axis=1)

result_concat_cols = pd.concat([df1, df2], axis=1)
print(“\nConcatenated by columns:”)
print(result_concat_cols)
“`

输出：

“`
df1:
A B
0 A0 B0
1 A1 B1

df2:
A B
0 A2 B2
1 A3 B3

Concatenated by rows:
A B
0 A0 B0
1 A1 B1
0 A2 B2
1 A3 B3

Concatenated by columns:
A B A B
0 A0 B0 A2 B2
1 A1 B1 A3 B3
“`
merge()： 类似于数据库的 JOIN 操作，根据一个或多个键列将两个 DataFrame 连接起来。

“`python
df_left = pd.DataFrame({
‘key’: [‘K0’, ‘K1’, ‘K2’, ‘K3’],
‘A’: [‘A0’, ‘A1’, ‘A2’, ‘A3’],
‘B’: [‘B0’, ‘B1’, ‘B2’, ‘B3’]
})

df_right = pd.DataFrame({
‘key’: [‘K0’, ‘K1’, ‘K2’, ‘K4’],
‘C’: [‘C0’, ‘C1’, ‘C2’, ‘C4’],
‘D’: [‘D0’, ‘D1’, ‘D2’, ‘D4’]
})

print(“\ndf_left:”)
print(df_left)
print(“\ndf_right:”)
print(df_right)

内连接 (默认 how=’inner’)：只保留两个 DataFrame 中 key 列都存在的行

result_merge_inner = pd.merge(df_left, df_right, on=’key’)
print(“\nInner merge on ‘key’:”)
print(result_merge_inner)

外连接 (how=’outer’)：保留所有行，缺失值填充 NaN

result_merge_outer = pd.merge(df_left, df_right, on=’key’, how=’outer’)
print(“\nOuter merge on ‘key’:”)
print(result_merge_outer)

左连接 (how=’left’)：保留左边 DataFrame 的所有行，右边匹配不到的填充 NaN

result_merge_left = pd.merge(df_left, df_right, on=’key’, how=’left’)
print(“\nLeft merge on ‘key’:”)
print(result_merge_left)

右连接 (how=’right’)：保留右边 DataFrame 的所有行，左边匹配不到的填充 NaN

result_merge_right = pd.merge(df_left, df_right, on=’key’, how=’right’)
print(“\nRight merge on ‘key’:”)
print(result_merge_right)
“`

输出：

“`
df_left:
key A B
0 K0 A0 B0
1 K1 A1 B1
2 K2 A2 B2
3 K3 A3 B3

df_right:
key C D
0 K0 C0 D0
1 K1 C1 D1
2 K2 C2 D2
3 K4 C4 D4

Inner merge on ‘key’:
key A B C D
0 K0 A0 B0 C0 D0
1 K1 A1 B1 C1 D1
2 K2 A2 B2 C2 D2

Outer merge on ‘key’:
key A B C D
0 K0 A0 B0 C0 D0
1 K1 A1 B1 C1 D1
2 K2 A2 B2 C2 D2
3 K3 A3 B3 NaN NaN
4 K4 NaN NaN C4 D4

Left merge on ‘key’:
key A B C D
0 K0 A0 B0 C0 D0
1 K1 A1 B1 C1 D1
2 K2 A2 B2 C2 D2
3 K3 A3 B3 NaN NaN

Right merge on ‘key’:
key A B C D
0 K0 A0 B0 C0 D0
1 K1 A1 B1 C1 D1
2 K2 A2 B2 C2 D2
3 K4 NaN NaN C4 D4
“`

11. 应用函数（Applying Functions）

Pandas 提供了 apply() 方法，可以将自定义函数或 NumPy 通用函数应用于 Series 或 DataFrame 的行或列。

应用于 Series：

“`python
s = pd.Series([1, 2, 3, 4, 5])
print(“\nOriginal Series:”)
print(s)

将函数应用于 Series 的每个元素

print(“\nApplying lambda x: x*x to Series:”)
print(s.apply(lambda x: x * x))

应用 NumPy 函数

print(“\nApplying np.sqrt to Series:”)
print(s.apply(np.sqrt))
“`

输出：

“`
Original Series:
0 1
1 2
2 3
3 4
4 5
dtype: int64

Applying lambda x: x*x to Series:
0 1
1 4
2 9
3 16
4 25
dtype: int64

Applying np.sqrt to Series:
0 1.000000
1 1.414214
2 1.732051
3 2.000000
4 2.236068
dtype: float64
“`
应用于 DataFrame： 默认按列应用函数（axis=0）。

“`python
print(“\nOriginal DataFrame:”)
print(df[[‘Age’, ‘Salary’]])

计算 ‘Age’ 和 ‘Salary’ 列的均值 (axis=0 应用于每列)

print(“\nApplying mean function to each column:”)
print(df[[‘Age’, ‘Salary’]].apply(np.mean))

计算每一行的年龄和薪水之和 (axis=1 应用于每行)

print(“\nApplying sum function to each row:”)
print(df[[‘Age’, ‘Salary’]].apply(np.sum, axis=1))

应用自定义函数到 DataFrame 的某个列

def categorize_age(age):
if age < 30:
return ‘Young’
elif age < 40:
return ‘Middle-aged’
else:
return ‘Senior’

df[‘Age_Category’] = df[‘Age’].apply(categorize_age)
print(“\nDataFrame with new ‘Age_Category’ column:”)
print(df)
“`

输出：

“`
Original DataFrame:
Age Salary
0 25 50000.0
1 30 60000.0
2 35 75000.0
3 28 55000.0
4 32 70000.0

Applying mean function to each column:
Age 30.0
Salary 62000.0
dtype: float64

Applying sum function to each row:
0 50025.0
1 60030.0
2 75035.0
3 55028.0
4 70032.0
dtype: float64

DataFrame with new ‘Age_Category’ column:
Name Age City Salary Department Age_Category
0 Alice 25 New York 50000.0 Sales Young
1 Bob 30 Los Angeles 60000.0 IT Middle-aged
2 Charlie 35 Chicago 75000.0 Sales Middle-aged
3 David 28 Houston 55000.0 IT Young
4 Eva 32 Phoenix 70000.0 Sales Middle-aged
“`

12. 时间序列基础（简要）

Pandas 在处理时间序列数据方面有特别强大的功能。核心是 DatetimeIndex 和一系列处理日期、时间、时间间隔的函数。

“`python

创建一个日期范围作为索引

date_rng = pd.date_range(start=’2023-01-01′, end=’2023-01-10′, freq=’D’)
print(“\nDate Range:”)
print(date_rng)

创建一个以日期为索引的 Series

ts = pd.Series(np.random.randn(len(date_rng)), index=date_rng)
print(“\nTime Series:”)
print(ts)

基于日期的选择和切片

print(“\nSelecting data for a specific date:”)
print(ts[‘2023-01-05’])

print(“\nSlicing data for a date range:”)
print(ts[‘2023-01-03′:’2023-01-07’])
“`

输出：

“`
Date Range:
DatetimeIndex([‘2023-01-01’, ‘2023-01-02’, ‘2023-01-03’, ‘2023-01-04’,
‘2023-01-05’, ‘2023-01-06’, ‘2023-01-07’, ‘2023-01-08’,
‘2023-01-09’, ‘2023-01-10′],
dtype=’datetime64[ns]’, freq=’D’)

Time Series:
2023-01-01 -0.476284
2023-01-02 -1.229601
2023-01-03 -0.769578
2023-01-04 0.653034
2023-01-05 -0.899684
2023-01-06 -1.076619
2023-01-07 1.754508
2023-01-08 -1.483182
2023-01-09 -1.066064
2023-01-10 0.173194
Freq: D, dtype: float64

Selecting data for a specific date:
-0.8996843152596765

Slicing data for a date range:
2023-01-03 -0.769578
2023-01-04 0.653034
2023-01-05 -0.899684
2023-01-06 -1.076619
2023-01-07 1.754508
Freq: D, dtype: float64
“`

13. 基础可视化（与 Matplotlib 集成）

Pandas 对象集成了 Matplotlib 库的功能，可以直接调用 .plot() 方法进行基础的可视化。

“`python

需要安装 Matplotlib

pip install matplotlib

import matplotlib.pyplot as plt

绘制 Series 的折线图

ts.plot(title=”Random Time Series Data”)
plt.xlabel(“Date”)
plt.ylabel(“Value”)
plt.show()

绘制 DataFrame 某列的直方图

df[‘Salary’].plot(kind=’hist’, title=”Salary Distribution”)
plt.xlabel(“Salary”)
plt.ylabel(“Frequency”)
plt.show()

绘制 DataFrame 多列的折线图

df[[‘Age’, ‘Salary’]].plot(title=”Age and Salary”)
plt.xlabel(“Index”)
plt.ylabel(“Value”)
plt.show()
“`

（此处不会直接显示图，但代码可以生成图）

总结

Pandas 是 Python 数据分析领域不可或缺的工具。本文详细介绍了 Pandas 的两大核心数据结构 Series 和 DataFrame 的创建、索引、选择；讲解了数据加载与保存、数据概览、基础数据清洗（缺失值和重复项处理）、数据类型转换、数据过滤（布尔索引）、数据分组与聚合，以及数据合并与连接等基础而重要的操作。

掌握了这些基础知识，你就已经迈入了使用 Pandas 进行数据分析的大门。数据分析是一个实践性很强的领域，建议读者在理解概念的同时，多动手实践，使用不同的数据集进行练习，逐步熟悉和掌握 Pandas 的更多高级功能，例如更复杂的数据转换、窗口函数、Categorical Data 处理等。

Pandas 的官方文档是学习和查阅的宝贵资源。随着实践的深入，你会发现 Pandas 能够极大地提高你处理和分析数据的效率。祝你在数据分析的道路上越走越远！

学习 Pandas：Python 数据分析基础

1. Pandas 是什么？为什么选择 Pandas？

2. 安装与导入

3. Pandas 核心数据结构：Series

从列表创建

默认索引是 0 到 n-1

从 NumPy 数组创建

通过索引标签访问 (如果索引是字符串或自定义标签)

通过整数位置访问 (类似列表/数组)

按整数位置切片 (不包含结束位置)

按索引标签切片 (包含结束位置)

广播运算

不同 Series 间的运算 (按索引对齐)

过滤

4. Pandas 核心数据结构：DataFrame

选择单列 (返回 Series)

选择多列 (返回 DataFrame)

使用点语法 (列名是有效的 Python 变量名且不与属性名冲突时)

选择单行 (使用行索引标签)

选择多行 (使用行索引标签列表)

按标签范围切片 (包含结束标签)

选择单行 (使用行位置)

选择多行 (使用行位置列表)

按位置范围切片 (不包含结束位置)

使用 .loc: 选择标签为 0 和 2 的行的 ‘Name’ 和 ‘Salary’ 列

使用 .iloc: 选择位置为 1 和 3 的行的 位置为 0 和 3 的列

使用 .loc: 选择标签 ‘b’ 到 ‘d’ 的行的 ‘City’ 列 (在 custom_index DataFrame 中)

5. 数据加载与保存

假设有一个名为 ‘data.csv’ 的文件

内容如下：

Name,Age,City,Salary

Alice,25,New York,50000

Bob,30,Los Angeles,60000

Charlie,35,Chicago,75000

David,28,Houston,55000

Eva,32,Phoenix,70000

读取 CSV 文件

将 DataFrame 保存为新的 CSV 文件

index=False 表示不将行索引写入文件

将 DataFrame 保存为 Excel 文件 (需要安装 openpyxl 或 xlwt 库)

try:

df.to_excel(‘output_data.xlsx’, index=False)

print(“\nDataFrame saved to output_data.xlsx”)

except ImportError:

print(“\nTo save to Excel, install openpyxl: pip install openpyxl”)

6. 数据概览与基本信息

7. 数据清洗与预处理基础

也可以根据特定列判断重复

将 ‘Salary’ 列转换为 float

8. 数据选择与过滤（Boolean Indexing）

过滤出年龄大于 30 的人

或者直接写成：print(df[df[‘Age’] > 30])

过滤出薪水大于 60000 且在纽约或芝加哥的人

过滤出年龄小于 30 或薪水大于 70000 的人

过滤出不在洛杉矶的人

9. 分组与聚合（Group By）

增加一个部门列用于分组

按部门分组并计算每个部门的平均薪水

按部门分组并计算每个部门的人数

按部门和城市分组，计算平均年龄

对分组应用多个聚合函数

10. 数据合并与连接（Merge and Concatenate）

按行连接 (默认 axis=0)

按列连接 (axis=1)

内连接 (默认 how=’inner’)：只保留两个 DataFrame 中 key 列都存在的行

外连接 (how=’outer’)：保留所有行，缺失值填充 NaN

左连接 (how=’left’)：保留左边 DataFrame 的所有行，右边匹配不到的填充 NaN

右连接 (how=’right’)：保留右边 DataFrame 的所有行，左边匹配不到的填充 NaN

11. 应用函数（Applying Functions）

将函数应用于 Series 的每个元素

应用 NumPy 函数

计算 ‘Age’ 和 ‘Salary’ 列的均值 (axis=0 应用于每列)

计算每一行的年龄和薪水之和 (axis=1 应用于每行)

应用自定义函数到 DataFrame 的某个列

12. 时间序列基础（简要）

创建一个日期范围作为索引

创建一个以日期为索引的 Series

基于日期的选择和切片

13. 基础可视化（与 Matplotlib 集成）

需要安装 Matplotlib

使用 .iloc: 选择位置为 1 和 3 的行的位置为 0 和 3 的列

发表评论取消回复