csv-processor
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseCSV Processor Skill
CSV 处理器技能
Parse, transform, and analyze CSV files with advanced data manipulation capabilities.
具备高级数据处理能力,可解析、转换和分析CSV文件。
Instructions
说明
You are a CSV processing expert. When invoked:
-
Parse CSV Files:
- Auto-detect delimiters (comma, tab, semicolon, pipe)
- Handle different encodings (UTF-8, Latin-1, Windows-1252)
- Process quoted fields and escaped characters
- Handle multi-line fields correctly
- Detect and use header rows
-
Transform Data:
- Filter rows based on conditions
- Select specific columns
- Sort and group data
- Merge multiple CSV files
- Split large files into smaller chunks
- Pivot and unpivot data
-
Clean Data:
- Remove duplicates
- Handle missing values
- Trim whitespace
- Normalize data formats
- Fix encoding issues
- Validate data types
-
Analyze Data:
- Generate statistics (sum, average, min, max, count)
- Identify data quality issues
- Detect outliers
- Profile column data types
- Calculate distributions
您是CSV处理专家。被调用时:
-
解析CSV文件:
- 自动检测分隔符(逗号、制表符、分号、竖线)
- 处理不同编码(UTF-8、Latin-1、Windows-1252)
- 处理带引号的字段和转义字符
- 正确处理多行字段
- 检测并使用标题行
-
转换数据:
- 根据条件筛选行
- 选择特定列
- 排序和分组数据
- 合并多个CSV文件
- 将大文件拆分为较小的块
- 数据透视和逆透视
-
清洗数据:
- 移除重复项
- 处理缺失值
- 修剪空白字符
- 标准化数据格式
- 修复编码问题
- 验证数据类型
-
分析数据:
- 生成统计数据(总和、平均值、最小值、最大值、计数)
- 识别数据质量问题
- 检测异常值
- 分析列的数据类型
- 计算分布情况
Usage Examples
使用示例
@csv-processor data.csv
@csv-processor --filter "age > 30"
@csv-processor --select "name,email,age"
@csv-processor --merge file1.csv file2.csv
@csv-processor --stats
@csv-processor --clean --remove-duplicates@csv-processor data.csv
@csv-processor --filter "age > 30"
@csv-processor --select "name,email,age"
@csv-processor --merge file1.csv file2.csv
@csv-processor --stats
@csv-processor --clean --remove-duplicatesBasic CSV Operations
基础CSV操作
Reading CSV Files
读取CSV文件
Python (pandas)
Python (pandas)
python
import pandas as pdpython
import pandas as pdBasic read
Basic read
df = pd.read_csv('data.csv')
df = pd.read_csv('data.csv')
Custom delimiter
Custom delimiter
df = pd.read_csv('data.tsv', delimiter='\t')
df = pd.read_csv('data.tsv', delimiter='\t')
Specify encoding
Specify encoding
df = pd.read_csv('data.csv', encoding='latin-1')
df = pd.read_csv('data.csv', encoding='latin-1')
Skip rows
Skip rows
df = pd.read_csv('data.csv', skiprows=2)
df = pd.read_csv('data.csv', skiprows=2)
Select specific columns
Select specific columns
df = pd.read_csv('data.csv', usecols=['name', 'email', 'age'])
df = pd.read_csv('data.csv', usecols=['name', 'email', 'age'])
Parse dates
Parse dates
df = pd.read_csv('data.csv', parse_dates=['created_at', 'updated_at'])
df = pd.read_csv('data.csv', parse_dates=['created_at', 'updated_at'])
Handle missing values
Handle missing values
df = pd.read_csv('data.csv', na_values=['NA', 'N/A', 'null', ''])
df = pd.read_csv('data.csv', na_values=['NA', 'N/A', 'null', ''])
Specify data types
Specify data types
df = pd.read_csv('data.csv', dtype={
'user_id': int,
'age': int,
'score': float,
'active': bool
})
undefineddf = pd.read_csv('data.csv', dtype={
'user_id': int,
'age': int,
'score': float,
'active': bool
})
undefinedJavaScript (csv-parser)
JavaScript (csv-parser)
javascript
const fs = require('fs');
const csv = require('csv-parser');
// Basic parsing
const results = [];
fs.createReadStream('data.csv')
.pipe(csv())
.on('data', (row) => {
results.push(row);
})
.on('end', () => {
console.log(`Processed ${results.length} rows`);
});
// With custom options
const Papa = require('papaparse');
Papa.parse(fs.createReadStream('data.csv'), {
header: true,
delimiter: ',',
skipEmptyLines: true,
transformHeader: (header) => header.trim().toLowerCase(),
complete: (results) => {
console.log('Parsed:', results.data);
}
});javascript
const fs = require('fs');
const csv = require('csv-parser');
// Basic parsing
const results = [];
fs.createReadStream('data.csv')
.pipe(csv())
.on('data', (row) => {
results.push(row);
})
.on('end', () => {
console.log(`Processed ${results.length} rows`);
});
// With custom options
const Papa = require('papaparse');
Papa.parse(fs.createReadStream('data.csv'), {
header: true,
delimiter: ',',
skipEmptyLines: true,
transformHeader: (header) => header.trim().toLowerCase(),
complete: (results) => {
console.log('Parsed:', results.data);
}
});Python (csv module)
Python (csv module)
python
import csvpython
import csvBasic reading
Basic reading
with open('data.csv', 'r', encoding='utf-8') as file:
reader = csv.DictReader(file)
for row in reader:
print(row['name'], row['age'])
with open('data.csv', 'r', encoding='utf-8') as file:
reader = csv.DictReader(file)
for row in reader:
print(row['name'], row['age'])
Custom delimiter
Custom delimiter
with open('data.csv', 'r') as file:
reader = csv.reader(file, delimiter='\t')
for row in reader:
print(row)
with open('data.csv', 'r') as file:
reader = csv.reader(file, delimiter='\t')
for row in reader:
print(row)
Handle different dialects
Handle different dialects
with open('data.csv', 'r') as file:
dialect = csv.Sniffer().sniff(file.read(1024))
file.seek(0)
reader = csv.reader(file, dialect)
for row in reader:
print(row)
undefinedwith open('data.csv', 'r') as file:
dialect = csv.Sniffer().sniff(file.read(1024))
file.seek(0)
reader = csv.reader(file, dialect)
for row in reader:
print(row)
undefinedWriting CSV Files
写入CSV文件
Python (pandas)
Python (pandas)
python
undefinedpython
undefinedBasic write
Basic write
df.to_csv('output.csv', index=False)
df.to_csv('output.csv', index=False)
Custom delimiter
Custom delimiter
df.to_csv('output.tsv', sep='\t', index=False)
df.to_csv('output.tsv', sep='\t', index=False)
Specify encoding
Specify encoding
df.to_csv('output.csv', encoding='utf-8-sig', index=False)
df.to_csv('output.csv', encoding='utf-8-sig', index=False)
Write only specific columns
Write only specific columns
df[['name', 'email']].to_csv('output.csv', index=False)
df[['name', 'email']].to_csv('output.csv', index=False)
Append to existing file
Append to existing file
df.to_csv('output.csv', mode='a', header=False, index=False)
df.to_csv('output.csv', mode='a', header=False, index=False)
Quote all fields
Quote all fields
df.to_csv('output.csv', quoting=csv.QUOTE_ALL, index=False)
undefineddf.to_csv('output.csv', quoting=csv.QUOTE_ALL, index=False)
undefinedJavaScript (csv-writer)
JavaScript (csv-writer)
javascript
const createCsvWriter = require('csv-writer').createObjectCsvWriter;
const csvWriter = createCsvWriter({
path: 'output.csv',
header: [
{id: 'name', title: 'Name'},
{id: 'email', title: 'Email'},
{id: 'age', title: 'Age'}
]
});
const records = [
{name: 'John Doe', email: 'john@example.com', age: 30},
{name: 'Jane Smith', email: 'jane@example.com', age: 25}
];
csvWriter.writeRecords(records)
.then(() => console.log('CSV file written successfully'));javascript
const createCsvWriter = require('csv-writer').createObjectCsvWriter;
const csvWriter = createCsvWriter({
path: 'output.csv',
header: [
{id: 'name', title: 'Name'},
{id: 'email', title: 'Email'},
{id: 'age', title: 'Age'}
]
});
const records = [
{name: 'John Doe', email: 'john@example.com', age: 30},
{name: 'Jane Smith', email: 'jane@example.com', age: 25}
];
csvWriter.writeRecords(records)
.then(() => console.log('CSV file written successfully'));Data Transformation Patterns
数据转换模式
Filtering Rows
筛选行
Python (pandas)
Python (pandas)
python
undefinedpython
undefinedSingle condition
Single condition
filtered = df[df['age'] > 30]
filtered = df[df['age'] > 30]
Multiple conditions (AND)
Multiple conditions (AND)
filtered = df[(df['age'] > 30) & (df['country'] == 'USA')]
filtered = df[(df['age'] > 30) & (df['country'] == 'USA')]
Multiple conditions (OR)
Multiple conditions (OR)
filtered = df[(df['age'] < 18) | (df['age'] > 65)]
filtered = df[(df['age'] < 18) | (df['age'] > 65)]
String operations
String operations
filtered = df[df['email'].str.contains('@gmail.com')]
filtered = df[df['name'].str.startswith('John')]
filtered = df[df['email'].str.contains('@gmail.com')]
filtered = df[df['name'].str.startswith('John')]
Is in list
Is in list
filtered = df[df['country'].isin(['USA', 'Canada', 'Mexico'])]
filtered = df[df['country'].isin(['USA', 'Canada', 'Mexico'])]
Not null values
Not null values
filtered = df[df['email'].notna()]
filtered = df[df['email'].notna()]
Complex conditions
Complex conditions
filtered = df.query('age > 30 and country == "USA" and active == True')
undefinedfiltered = df.query('age > 30 and country == "USA" and active == True')
undefinedJavaScript
JavaScript
javascript
// Filter with arrow function
const filtered = data.filter(row => row.age > 30);
// Multiple conditions
const filtered = data.filter(row =>
row.age > 30 && row.country === 'USA'
);
// String operations
const filtered = data.filter(row =>
row.email.includes('@gmail.com')
);
// Complex filtering
const filtered = data.filter(row => {
const age = parseInt(row.age);
return age >= 18 && age <= 65 && row.active === 'true';
});javascript
// Filter with arrow function
const filtered = data.filter(row => row.age > 30);
// Multiple conditions
const filtered = data.filter(row =>
row.age > 30 && row.country === 'USA'
);
// String operations
const filtered = data.filter(row =>
row.email.includes('@gmail.com')
);
// Complex filtering
const filtered = data.filter(row => {
const age = parseInt(row.age);
return age >= 18 && age <= 65 && row.active === 'true';
});Selecting Columns
选择列
Python (pandas)
Python (pandas)
python
undefinedpython
undefinedSelect single column
Select single column
names = df['name']
names = df['name']
Select multiple columns
Select multiple columns
subset = df[['name', 'email', 'age']]
subset = df[['name', 'email', 'age']]
Select by column type
Select by column type
numeric_cols = df.select_dtypes(include=['int64', 'float64'])
string_cols = df.select_dtypes(include=['object'])
numeric_cols = df.select_dtypes(include=['int64', 'float64'])
string_cols = df.select_dtypes(include=['object'])
Select columns matching pattern
Select columns matching pattern
email_cols = df.filter(regex='.email.')
email_cols = df.filter(regex='.email.')
Drop columns
Drop columns
df_without = df.drop(['temporary', 'unused'], axis=1)
df_without = df.drop(['temporary', 'unused'], axis=1)
Rename columns
Rename columns
df_renamed = df.rename(columns={
'old_name': 'new_name',
'email_address': 'email'
})
undefineddf_renamed = df.rename(columns={
'old_name': 'new_name',
'email_address': 'email'
})
undefinedJavaScript
JavaScript
javascript
// Map to select columns
const subset = data.map(row => ({
name: row.name,
email: row.email,
age: row.age
}));
// Destructuring
const subset = data.map(({name, email, age}) => ({name, email, age}));
// Dynamic column selection
const columns = ['name', 'email', 'age'];
const subset = data.map(row =>
Object.fromEntries(
columns.map(col => [col, row[col]])
)
);javascript
// Map to select columns
const subset = data.map(row => ({
name: row.name,
email: row.email,
age: row.age
}));
// Destructuring
const subset = data.map(({name, email, age}) => ({name, email, age}));
// Dynamic column selection
const columns = ['name', 'email', 'age'];
const subset = data.map(row =>
Object.fromEntries(
columns.map(col => [col, row[col]])
)
);Sorting Data
排序数据
Python (pandas)
Python (pandas)
python
undefinedpython
undefinedSort by single column
Sort by single column
sorted_df = df.sort_values('age')
sorted_df = df.sort_values('age')
Sort descending
Sort descending
sorted_df = df.sort_values('age', ascending=False)
sorted_df = df.sort_values('age', ascending=False)
Sort by multiple columns
Sort by multiple columns
sorted_df = df.sort_values(['country', 'age'], ascending=[True, False])
sorted_df = df.sort_values(['country', 'age'], ascending=[True, False])
Sort by index
Sort by index
sorted_df = df.sort_index()
undefinedsorted_df = df.sort_index()
undefinedJavaScript
JavaScript
javascript
// Sort by single field
const sorted = data.sort((a, b) => a.age - b.age);
// Sort descending
const sorted = data.sort((a, b) => b.age - a.age);
// Sort by string
const sorted = data.sort((a, b) => a.name.localeCompare(b.name));
// Sort by multiple fields
const sorted = data.sort((a, b) => {
if (a.country !== b.country) {
return a.country.localeCompare(b.country);
}
return b.age - a.age;
});javascript
// Sort by single field
const sorted = data.sort((a, b) => a.age - b.age);
// Sort descending
const sorted = data.sort((a, b) => b.age - a.age);
// Sort by string
const sorted = data.sort((a, b) => a.name.localeCompare(b.name));
// Sort by multiple fields
const sorted = data.sort((a, b) => {
if (a.country !== b.country) {
return a.country.localeCompare(b.country);
}
return b.age - a.age;
});Grouping and Aggregation
分组与聚合
Python (pandas)
Python (pandas)
python
undefinedpython
undefinedGroup by single column
Group by single column
grouped = df.groupby('country')
grouped = df.groupby('country')
Count by group
Count by group
counts = df.groupby('country').size()
counts = df.groupby('country').size()
Multiple aggregations
Multiple aggregations
stats = df.groupby('country').agg({
'age': ['mean', 'min', 'max'],
'salary': ['sum', 'mean'],
'user_id': 'count'
})
stats = df.groupby('country').agg({
'age': ['mean', 'min', 'max'],
'salary': ['sum', 'mean'],
'user_id': 'count'
})
Group by multiple columns
Group by multiple columns
grouped = df.groupby(['country', 'city']).agg({
'revenue': 'sum',
'user_id': 'count'
})
grouped = df.groupby(['country', 'city']).agg({
'revenue': 'sum',
'user_id': 'count'
})
Custom aggregation
Custom aggregation
df.groupby('country').apply(lambda x: x['salary'].max() - x['salary'].min())
df.groupby('country').apply(lambda x: x['salary'].max() - x['salary'].min())
Pivot table
Pivot table
pivot = df.pivot_table(
values='revenue',
index='country',
columns='year',
aggfunc='sum',
fill_value=0
)
undefinedpivot = df.pivot_table(
values='revenue',
index='country',
columns='year',
aggfunc='sum',
fill_value=0
)
undefinedJavaScript (lodash)
JavaScript (lodash)
javascript
const _ = require('lodash');
// Group by field
const grouped = _.groupBy(data, 'country');
// Count by group
const counts = _.mapValues(
_.groupBy(data, 'country'),
group => group.length
);
// Sum by group
const sums = _.mapValues(
_.groupBy(data, 'country'),
group => _.sumBy(group, row => parseFloat(row.salary))
);
// Multiple aggregations
const stats = Object.entries(_.groupBy(data, 'country')).map(([country, rows]) => ({
country,
count: rows.length,
avgAge: _.meanBy(rows, row => parseInt(row.age)),
totalSalary: _.sumBy(rows, row => parseFloat(row.salary))
}));javascript
const _ = require('lodash');
// Group by field
const grouped = _.groupBy(data, 'country');
// Count by group
const counts = _.mapValues(
_.groupBy(data, 'country'),
group => group.length
);
// Sum by group
const sums = _.mapValues(
_.groupBy(data, 'country'),
group => _.sumBy(group, row => parseFloat(row.salary))
);
// Multiple aggregations
const stats = Object.entries(_.groupBy(data, 'country')).map(([country, rows]) => ({
country,
count: rows.length,
avgAge: _.meanBy(rows, row => parseInt(row.age)),
totalSalary: _.sumBy(rows, row => parseFloat(row.salary))
}));Merging CSV Files
合并CSV文件
Python (pandas)
Python (pandas)
python
undefinedpython
undefinedConcatenate vertically (stack rows)
Concatenate vertically (stack rows)
df1 = pd.read_csv('file1.csv')
df2 = pd.read_csv('file2.csv')
combined = pd.concat([df1, df2], ignore_index=True)
df1 = pd.read_csv('file1.csv')
df2 = pd.read_csv('file2.csv')
combined = pd.concat([df1, df2], ignore_index=True)
Join (SQL-like merge)
Join (SQL-like merge)
users = pd.read_csv('users.csv')
orders = pd.read_csv('orders.csv')
users = pd.read_csv('users.csv')
orders = pd.read_csv('orders.csv')
Inner join
Inner join
merged = pd.merge(users, orders, on='user_id', how='inner')
merged = pd.merge(users, orders, on='user_id', how='inner')
Left join
Left join
merged = pd.merge(users, orders, on='user_id', how='left')
merged = pd.merge(users, orders, on='user_id', how='left')
Multiple keys
Multiple keys
merged = pd.merge(
users, orders,
left_on='id',
right_on='user_id',
how='left'
)
merged = pd.merge(
users, orders,
left_on='id',
right_on='user_id',
how='left'
)
Merge with different column names
Merge with different column names
merged = pd.merge(
users, orders,
left_on='user_id',
right_on='customer_id',
how='inner'
)
undefinedmerged = pd.merge(
users, orders,
left_on='user_id',
right_on='customer_id',
how='inner'
)
undefinedJavaScript
JavaScript
javascript
// Concatenate arrays
const file1 = parseCSV('file1.csv');
const file2 = parseCSV('file2.csv');
const combined = [...file1, ...file2];
// Join arrays (like SQL)
function leftJoin(left, right, leftKey, rightKey) {
return left.map(leftRow => {
const rightRow = right.find(r => r[rightKey] === leftRow[leftKey]);
return {...leftRow, ...rightRow};
});
}
const merged = leftJoin(users, orders, 'id', 'user_id');javascript
// Concatenate arrays
const file1 = parseCSV('file1.csv');
const file2 = parseCSV('file2.csv');
const combined = [...file1, ...file2];
// Join arrays (like SQL)
function leftJoin(left, right, leftKey, rightKey) {
return left.map(leftRow => {
const rightRow = right.find(r => r[rightKey] === leftRow[leftKey]);
return {...leftRow, ...rightRow};
});
}
const merged = leftJoin(users, orders, 'id', 'user_id');Data Cleaning Operations
数据清洗操作
Remove Duplicates
移除重复项
Python (pandas)
Python (pandas)
python
undefinedpython
undefinedRemove duplicate rows
Remove duplicate rows
df_unique = df.drop_duplicates()
df_unique = df.drop_duplicates()
Based on specific columns
Based on specific columns
df_unique = df.drop_duplicates(subset=['email'])
df_unique = df.drop_duplicates(subset=['email'])
Keep first or last occurrence
Keep first or last occurrence
df_unique = df.drop_duplicates(subset=['email'], keep='first')
df_unique = df.drop_duplicates(subset=['email'], keep='last')
df_unique = df.drop_duplicates(subset=['email'], keep='first')
df_unique = df.drop_duplicates(subset=['email'], keep='last')
Identify duplicates
Identify duplicates
duplicates = df[df.duplicated()]
duplicate_emails = df[df.duplicated(subset=['email'])]
undefinedduplicates = df[df.duplicated()]
duplicate_emails = df[df.duplicated(subset=['email'])]
undefinedHandle Missing Values
处理缺失值
Python (pandas)
Python (pandas)
python
undefinedpython
undefinedCheck for missing values
Check for missing values
missing_count = df.isnull().sum()
missing_percent = (df.isnull().sum() / len(df)) * 100
missing_count = df.isnull().sum()
missing_percent = (df.isnull().sum() / len(df)) * 100
Drop rows with any missing values
Drop rows with any missing values
df_clean = df.dropna()
df_clean = df.dropna()
Drop rows where specific column is missing
Drop rows where specific column is missing
df_clean = df.dropna(subset=['email'])
df_clean = df.dropna(subset=['email'])
Drop columns with too many missing values
Drop columns with too many missing values
df_clean = df.dropna(axis=1, thresh=len(df)*0.7)
df_clean = df.dropna(axis=1, thresh=len(df)*0.7)
Fill missing values
Fill missing values
df_filled = df.fillna(0)
df_filled = df.fillna({'age': 0, 'country': 'Unknown'})
df_filled = df.fillna(0)
df_filled = df.fillna({'age': 0, 'country': 'Unknown'})
Forward fill
Forward fill
df_filled = df.fillna(method='ffill')
df_filled = df.fillna(method='ffill')
Fill with mean/median
Fill with mean/median
df['age'].fillna(df['age'].mean(), inplace=True)
df['age'].fillna(df['age'].median(), inplace=True)
df['age'].fillna(df['age'].mean(), inplace=True)
df['age'].fillna(df['age'].median(), inplace=True)
Interpolate
Interpolate
df['value'].interpolate(method='linear', inplace=True)
undefineddf['value'].interpolate(method='linear', inplace=True)
undefinedJavaScript
JavaScript
javascript
// Filter out rows with missing values
const cleaned = data.filter(row =>
row.email && row.name && row.age
);
// Fill missing values
const filled = data.map(row => ({
...row,
age: row.age || 0,
country: row.country || 'Unknown'
}));javascript
// Filter out rows with missing values
const cleaned = data.filter(row =>
row.email && row.name && row.age
);
// Fill missing values
const filled = data.map(row => ({
...row,
age: row.age || 0,
country: row.country || 'Unknown'
}));Data Validation
数据验证
Python (pandas)
Python (pandas)
python
undefinedpython
undefinedValidate email format
Validate email format
import re
email_pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,}$'
df['email_valid'] = df['email'].str.match(email_pattern)
import re
email_pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,}$'
df['email_valid'] = df['email'].str.match(email_pattern)
Validate age range
Validate age range
df['age_valid'] = df['age'].between(0, 120)
df['age_valid'] = df['age'].between(0, 120)
Validate required fields
Validate required fields
df['valid'] = df[['name', 'email', 'age']].notna().all(axis=1)
df['valid'] = df[['name', 'email', 'age']].notna().all(axis=1)
Check data types
Check data types
def validate_types(df):
errors = []
# Check numeric columns
for col in ['age', 'salary', 'score']:
if col in df.columns:
if not pd.api.types.is_numeric_dtype(df[col]):
errors.append(f"{col} should be numeric")
# Check date columns
for col in ['created_at', 'updated_at']:
if col in df.columns:
try:
pd.to_datetime(df[col])
except:
errors.append(f"{col} has invalid dates")
return errorsdef validate_types(df):
errors = []
# Check numeric columns
for col in ['age', 'salary', 'score']:
if col in df.columns:
if not pd.api.types.is_numeric_dtype(df[col]):
errors.append(f"{col} should be numeric")
# Check date columns
for col in ['created_at', 'updated_at']:
if col in df.columns:
try:
pd.to_datetime(df[col])
except:
errors.append(f"{col} has invalid dates")
return errorsRemove invalid rows
Remove invalid rows
df_valid = df[df['email_valid'] & df['age_valid']]
undefineddf_valid = df[df['email_valid'] & df['age_valid']]
undefinedData Normalization
数据标准化
Python (pandas)
Python (pandas)
python
undefinedpython
undefinedTrim whitespace
Trim whitespace
df['name'] = df['name'].str.strip()
df['email'] = df['email'].str.strip()
df['name'] = df['name'].str.strip()
df['email'] = df['email'].str.strip()
Convert to lowercase
Convert to lowercase
df['email'] = df['email'].str.lower()
df['email'] = df['email'].str.lower()
Standardize phone numbers
Standardize phone numbers
df['phone'] = df['phone'].str.replace(r'[^0-9]', '', regex=True)
df['phone'] = df['phone'].str.replace(r'[^0-9]', '', regex=True)
Standardize dates
Standardize dates
df['created_at'] = pd.to_datetime(df['created_at'])
df['created_at'] = pd.to_datetime(df['created_at'])
Standardize country names
Standardize country names
country_mapping = {
'USA': 'United States',
'US': 'United States',
'United States of America': 'United States',
'UK': 'United Kingdom'
}
df['country'] = df['country'].replace(country_mapping)
country_mapping = {
'USA': 'United States',
'US': 'United States',
'United States of America': 'United States',
'UK': 'United Kingdom'
}
df['country'] = df['country'].replace(country_mapping)
Convert data types
Convert data types
df['age'] = pd.to_numeric(df['age'], errors='coerce')
df['active'] = df['active'].astype(bool)
df['score'] = df['score'].astype(float)
undefineddf['age'] = pd.to_numeric(df['age'], errors='coerce')
df['active'] = df['active'].astype(bool)
df['score'] = df['score'].astype(float)
undefinedData Analysis Operations
数据分析操作
Statistical Summary
统计摘要
Python (pandas)
Python (pandas)
python
undefinedpython
undefinedBasic statistics
Basic statistics
print(df.describe())
print(df.describe())
Statistics for all columns (including non-numeric)
Statistics for all columns (including non-numeric)
print(df.describe(include='all'))
print(df.describe(include='all'))
Specific statistics
Specific statistics
print(f"Mean age: {df['age'].mean()}")
print(f"Median age: {df['age'].median()}")
print(f"Std dev: {df['age'].std()}")
print(f"Min: {df['age'].min()}")
print(f"Max: {df['age'].max()}")
print(f"Mean age: {df['age'].mean()}")
print(f"Median age: {df['age'].median()}")
print(f"Std dev: {df['age'].std()}")
print(f"Min: {df['age'].min()}")
print(f"Max: {df['age'].max()}")
Count values
Count values
print(df['country'].value_counts())
print(df['country'].value_counts())
Percentage distribution
Percentage distribution
print(df['country'].value_counts(normalize=True) * 100)
print(df['country'].value_counts(normalize=True) * 100)
Cross-tabulation
Cross-tabulation
cross_tab = pd.crosstab(df['country'], df['active'])
cross_tab = pd.crosstab(df['country'], df['active'])
Correlation matrix
Correlation matrix
correlation = df[['age', 'salary', 'score']].corr()
undefinedcorrelation = df[['age', 'salary', 'score']].corr()
undefinedData Profiling
数据剖析
Python (pandas)
Python (pandas)
python
def profile_dataframe(df):
"""Generate comprehensive data profile"""
profile = {
'shape': df.shape,
'columns': list(df.columns),
'dtypes': df.dtypes.to_dict(),
'memory_usage': df.memory_usage(deep=True).sum() / 1024**2, # MB
'missing_values': df.isnull().sum().to_dict(),
'missing_percent': (df.isnull().sum() / len(df) * 100).to_dict(),
'duplicates': df.duplicated().sum(),
'numeric_summary': df.describe().to_dict(),
'unique_counts': df.nunique().to_dict()
}
# Column-specific analysis
for col in df.columns:
profile[f'{col}_sample'] = df[col].head(5).tolist()
if df[col].dtype == 'object':
profile[f'{col}_top_values'] = df[col].value_counts().head(10).to_dict()
if pd.api.types.is_numeric_dtype(df[col]):
profile[f'{col}_outliers'] = detect_outliers(df[col])
return profile
def detect_outliers(series):
"""Detect outliers using IQR method"""
Q1 = series.quantile(0.25)
Q3 = series.quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = series[(series < lower_bound) | (series > upper_bound)]
return {
'count': len(outliers),
'percent': (len(outliers) / len(series)) * 100,
'values': outliers.tolist()
}python
def profile_dataframe(df):
"""Generate comprehensive data profile"""
profile = {
'shape': df.shape,
'columns': list(df.columns),
'dtypes': df.dtypes.to_dict(),
'memory_usage': df.memory_usage(deep=True).sum() / 1024**2, # MB
'missing_values': df.isnull().sum().to_dict(),
'missing_percent': (df.isnull().sum() / len(df) * 100).to_dict(),
'duplicates': df.duplicated().sum(),
'numeric_summary': df.describe().to_dict(),
'unique_counts': df.nunique().to_dict()
}
# Column-specific analysis
for col in df.columns:
profile[f'{col}_sample'] = df[col].head(5).tolist()
if df[col].dtype == 'object':
profile[f'{col}_top_values'] = df[col].value_counts().head(10).to_dict()
if pd.api.types.is_numeric_dtype(df[col]):
profile[f'{col}_outliers'] = detect_outliers(df[col])
return profile
def detect_outliers(series):
"""Detect outliers using IQR method"""
Q1 = series.quantile(0.25)
Q3 = series.quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = series[(series < lower_bound) | (series > upper_bound)]
return {
'count': len(outliers),
'percent': (len(outliers) / len(series)) * 100,
'values': outliers.tolist()
}Generate Report
生成报告
python
def generate_csv_report(df, filename='report.md'):
"""Generate comprehensive analysis report"""
report = f"""# CSV Analysis Report
Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}python
def generate_csv_report(df, filename='report.md'):
"""Generate comprehensive analysis report"""
report = f"""# CSV Analysis Report
Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}Dataset Overview
Dataset Overview
- Rows: {len(df):,}
- Columns: {len(df.columns)}
- Memory Usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB
- Duplicates: {df.duplicated().sum():,}
- Rows: {len(df):,}
- Columns: {len(df.columns)}
- Memory Usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB
- Duplicates: {df.duplicated().sum():,}
Column Summary
Column Summary
| Column | Type | Non-Null | Unique | Missing % |
|---|---|---|---|---|
| """ |
for col in df.columns:
dtype = str(df[col].dtype)
non_null = df[col].count()
unique = df[col].nunique()
missing_pct = (df[col].isnull().sum() / len(df)) * 100
report += f"| {col} | {dtype} | {non_null:,} | {unique:,} | {missing_pct:.1f}% |\n"
report += "\n## Numeric Columns Statistics\n\n"
report += df.describe().to_markdown()
report += "\n\n## Data Quality Issues\n\n"
# Missing values
missing = df.isnull().sum()
if missing.sum() > 0:
report += "### Missing Values\n"
for col, count in missing[missing > 0].items():
pct = (count / len(df)) * 100
report += f"- **{col}**: {count:,} ({pct:.1f}%)\n"
# Duplicates
if df.duplicated().sum() > 0:
report += f"\n### Duplicates\n"
report += f"- Found {df.duplicated().sum():,} duplicate rows\n"
# Write report
with open(filename, 'w') as f:
f.write(report)
print(f"Report generated: {filename}")undefined| Column | Type | Non-Null | Unique | Missing % |
|---|---|---|---|---|
| """ |
for col in df.columns:
dtype = str(df[col].dtype)
non_null = df[col].count()
unique = df[col].nunique()
missing_pct = (df[col].isnull().sum() / len(df)) * 100
report += f"| {col} | {dtype} | {non_null:,} | {unique:,} | {missing_pct:.1f}% |\n"
report += "\n## Numeric Columns Statistics\n\n"
report += df.describe().to_markdown()
report += "\n\n## Data Quality Issues\n\n"
# Missing values
missing = df.isnull().sum()
if missing.sum() > 0:
report += "### Missing Values\n"
for col, count in missing[missing > 0].items():
pct = (count / len(df)) * 100
report += f"- **{col}**: {count:,} ({pct:.1f}%)\n"
# Duplicates
if df.duplicated().sum() > 0:
report += f"\n### Duplicates\n"
report += f"- Found {df.duplicated().sum():,} duplicate rows\n"
# Write report
with open(filename, 'w') as f:
f.write(report)
print(f"Report generated: {filename}")undefinedAdvanced Operations
高级操作
Splitting Large CSV Files
拆分大型CSV文件
python
def split_csv(input_file, rows_per_file=10000):
"""Split large CSV into smaller chunks"""
chunk_num = 0
for chunk in pd.read_csv(input_file, chunksize=rows_per_file):
output_file = f"{input_file.rsplit('.', 1)[0]}_part{chunk_num}.csv"
chunk.to_csv(output_file, index=False)
print(f"Created {output_file} with {len(chunk)} rows")
chunk_num += 1python
def split_csv(input_file, rows_per_file=10000):
"""Split large CSV into smaller chunks"""
chunk_num = 0
for chunk in pd.read_csv(input_file, chunksize=rows_per_file):
output_file = f"{input_file.rsplit('.', 1)[0]}_part{chunk_num}.csv"
chunk.to_csv(output_file, index=False)
print(f"Created {output_file} with {len(chunk)} rows")
chunk_num += 1Pivot and Unpivot
数据透视与逆透视
python
undefinedpython
undefinedPivot (wide format)
Pivot (wide format)
pivot = df.pivot_table(
values='revenue',
index='product',
columns='month',
aggfunc='sum'
)
pivot = df.pivot_table(
values='revenue',
index='product',
columns='month',
aggfunc='sum'
)
Unpivot (long format)
Unpivot (long format)
melted = df.melt(
id_vars=['product', 'category'],
value_vars=['jan', 'feb', 'mar'],
var_name='month',
value_name='revenue'
)
undefinedmelted = df.melt(
id_vars=['product', 'category'],
value_vars=['jan', 'feb', 'mar'],
var_name='month',
value_name='revenue'
)
undefinedData Type Conversion
数据类型转换
python
undefinedpython
undefinedConvert columns
Convert columns
df['age'] = pd.to_numeric(df['age'], errors='coerce')
df['created_at'] = pd.to_datetime(df['created_at'])
df['active'] = df['active'].astype(bool)
df['age'] = pd.to_numeric(df['age'], errors='coerce')
df['created_at'] = pd.to_datetime(df['created_at'])
df['active'] = df['active'].astype(bool)
Parse custom date formats
Parse custom date formats
df['date'] = pd.to_datetime(df['date'], format='%d/%m/%Y')
df['date'] = pd.to_datetime(df['date'], format='%d/%m/%Y')
Handle mixed types
Handle mixed types
df['mixed'] = df['mixed'].astype(str)
undefineddf['mixed'] = df['mixed'].astype(str)
undefinedPerformance Optimization
性能优化
Reading Large Files Efficiently
高效读取大型文件
python
undefinedpython
undefinedRead in chunks
Read in chunks
chunk_size = 10000
chunks = []
for chunk in pd.read_csv('large_file.csv', chunksize=chunk_size):
# Process chunk
processed = chunk[chunk['active'] == True]
chunks.append(processed)
result = pd.concat(chunks, ignore_index=True)
chunk_size = 10000
chunks = []
for chunk in pd.read_csv('large_file.csv', chunksize=chunk_size):
# Process chunk
processed = chunk[chunk['active'] == True]
chunks.append(processed)
result = pd.concat(chunks, ignore_index=True)
Read only needed columns
Read only needed columns
df = pd.read_csv('large_file.csv', usecols=['name', 'email', 'age'])
df = pd.read_csv('large_file.csv', usecols=['name', 'email', 'age'])
Use appropriate dtypes
Use appropriate dtypes
df = pd.read_csv('large_file.csv', dtype={
'id': 'int32', # instead of int64
'age': 'int8', # small integers
'category': 'category' # categorical data
})
undefineddf = pd.read_csv('large_file.csv', dtype={
'id': 'int32', # instead of int64
'age': 'int8', # small integers
'category': 'category' # categorical data
})
undefinedWriting Large Files
写入大型文件
python
undefinedpython
undefinedWrite in chunks
Write in chunks
chunk_size = 10000
for i in range(0, len(df), chunk_size):
chunk = df.iloc[i:i+chunk_size]
mode = 'w' if i == 0 else 'a'
header = i == 0
chunk.to_csv('output.csv', mode=mode, header=header, index=False)
undefinedchunk_size = 10000
for i in range(0, len(df), chunk_size):
chunk = df.iloc[i:i+chunk_size]
mode = 'w' if i == 0 else 'a'
header = i == 0
chunk.to_csv('output.csv', mode=mode, header=header, index=False)
undefinedCommand Line Tools
命令行工具
Using csvkit
使用csvkit
bash
undefinedbash
undefinedView CSV structure
View CSV structure
csvcut -n data.csv
csvcut -n data.csv
Filter columns
Filter columns
csvcut -c name,email,age data.csv > subset.csv
csvcut -c name,email,age data.csv > subset.csv
Filter rows
Filter rows
csvgrep -c age -r "^[3-9][0-9]$" data.csv > age_30plus.csv
csvgrep -c age -r "^[3-9][0-9]$" data.csv > age_30plus.csv
Convert to JSON
Convert to JSON
csvjson data.csv > data.json
csvjson data.csv > data.json
Statistics
Statistics
csvstat data.csv
csvstat data.csv
SQL queries on CSV
SQL queries on CSV
csvsql --query "SELECT country, COUNT(*) FROM data GROUP BY country" data.csv
undefinedcsvsql --query "SELECT country, COUNT(*) FROM data GROUP BY country" data.csv
undefinedUsing awk
使用awk
bash
undefinedbash
undefinedPrint specific columns
Print specific columns
awk -F',' '{print $1, $3}' data.csv
awk -F',' '{print $1, $3}' data.csv
Filter rows
Filter rows
awk -F',' '$3 > 30' data.csv
awk -F',' '$3 > 30' data.csv
Sum column
Sum column
awk -F',' '{sum+=$3} END {print sum}' data.csv
undefinedawk -F',' '{sum+=$3} END {print sum}' data.csv
undefinedBest Practices
最佳实践
- Always validate data before processing
- Use appropriate data types to save memory
- Handle encoding issues early in the process
- Profile data first to understand structure
- Use chunks for large files
- Back up original files before transformations
- Document transformations for reproducibility
- Validate output after processing
- Use version control for CSV processing scripts
- Test with sample data before processing full datasets
- 处理前始终验证数据
- 使用合适的数据类型以节省内存
- 尽早处理编码问题
- 先剖析数据以了解其结构
- 对大文件使用分块处理
- 转换前备份原始文件
- 记录转换操作以保证可复现性
- 处理后验证输出结果
- 对CSV处理脚本使用版本控制
- 处理完整数据集前先用样本数据测试
Common Issues and Solutions
常见问题与解决方案
Issue: Encoding Errors
问题:编码错误
python
undefinedpython
undefinedTry different encodings
Try different encodings
for encoding in ['utf-8', 'latin-1', 'cp1252', 'iso-8859-1']:
try:
df = pd.read_csv('data.csv', encoding=encoding)
print(f"Success with encoding: {encoding}")
break
except UnicodeDecodeError:
continue
undefinedfor encoding in ['utf-8', 'latin-1', 'cp1252', 'iso-8859-1']:
try:
df = pd.read_csv('data.csv', encoding=encoding)
print(f"Success with encoding: {encoding}")
break
except UnicodeDecodeError:
continue
undefinedIssue: Delimiter Detection
问题:分隔符检测
python
undefinedpython
undefinedAuto-detect delimiter
Auto-detect delimiter
with open('data.csv', 'r') as file:
sample = file.read(1024)
sniffer = csv.Sniffer()
delimiter = sniffer.sniff(sample).delimiter
df = pd.read_csv('data.csv', delimiter=delimiter)
undefinedwith open('data.csv', 'r') as file:
sample = file.read(1024)
sniffer = csv.Sniffer()
delimiter = sniffer.sniff(sample).delimiter
df = pd.read_csv('data.csv', delimiter=delimiter)
undefinedIssue: Memory Errors
问题:内存错误
python
undefinedpython
undefinedUse chunking
Use chunking
chunks = []
for chunk in pd.read_csv('large.csv', chunksize=10000):
# Process and filter
processed = chunk[chunk['keep'] == True]
chunks.append(processed)
df = pd.concat(chunks, ignore_index=True)
undefinedchunks = []
for chunk in pd.read_csv('large.csv', chunksize=10000):
# Process and filter
processed = chunk[chunk['keep'] == True]
chunks.append(processed)
df = pd.concat(chunks, ignore_index=True)
undefinedNotes
注意事项
- Always inspect CSV structure before processing
- Test transformations on a small sample first
- Consider using databases for very large datasets
- Document column meanings and data types
- Use consistent date and number formats
- Validate data quality regularly
- Keep processing scripts version controlled
- 处理前始终检查CSV结构
- 先在小样本上测试转换操作
- 对于超大型数据集,考虑使用数据库
- 记录列的含义和数据类型
- 使用一致的日期和数字格式
- 定期验证数据质量
- 对处理脚本进行版本控制