cost-prediction
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseConstruction Cost Prediction with Machine Learning
基于机器学习的建筑成本预测
Overview
概述
Based on DDC methodology (Chapter 4.5), this skill enables predicting construction project costs using historical data and machine learning algorithms. The approach transforms traditional expert-based estimation into data-driven prediction.
Book Reference: "Будущее: прогнозы и машинное обучение" / "Future: Predictions and Machine Learning"
"Предсказания и прогнозы на основе исторических данных позволяют компаниям принимать более точные решения о стоимости и сроках проектов." — DDC Book, Chapter 4.5
基于DDC方法论(第4.5章),该技能可借助历史数据和机器学习算法预测建筑项目成本。此方法将传统的专家估算转变为数据驱动的预测方式。
书籍参考:《Будущее: прогнозы и машинное обучение》 / 《Future: Predictions and Machine Learning》
"基于历史数据的预测和预估能够帮助公司在项目成本和工期方面做出更精准的决策。" — DDC书籍,第4.5章
Core Concepts
核心概念
Historical Data → Feature Engineering → ML Model → Cost Prediction
│ │ │ │
▼ ▼ ▼ ▼
Past projects Prepare data Train model New project
with costs for ML on history cost forecastHistorical Data → Feature Engineering → ML Model → Cost Prediction
│ │ │ │
▼ ▼ ▼ ▼
Past projects Prepare data Train model New project
with costs for ML on history cost forecastQuick Start
快速开始
python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, r2_scorepython
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, r2_scoreLoad historical project data
Load historical project data
df = pd.read_csv("historical_projects.csv")
df = pd.read_csv("historical_projects.csv")
Features and target
Features and target
X = df[['area_m2', 'floors', 'complexity_score']]
y = df['total_cost']
X = df[['area_m2', 'floors', 'complexity_score']]
y = df['total_cost']
Split data
Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
Train model
Train model
model = LinearRegression()
model.fit(X_train, y_train)
model = LinearRegression()
model.fit(X_train, y_train)
Predict
Predict
predictions = model.predict(X_test)
print(f"R² Score: {r2_score(y_test, predictions):.2f}")
print(f"MAE: ${mean_absolute_error(y_test, predictions):,.0f}")
predictions = model.predict(X_test)
print(f"R² Score: {r2_score(y_test, predictions):.2f}")
print(f"MAE: ${mean_absolute_error(y_test, predictions):,.0f}")
Predict new project
Predict new project
new_project = [[5000, 10, 3]] # area, floors, complexity
cost = model.predict(new_project)
print(f"Predicted cost: ${cost[0]:,.0f}")
undefinednew_project = [[5000, 10, 3]] # area, floors, complexity
cost = model.predict(new_project)
print(f"Predicted cost: ${cost[0]:,.0f}")
undefinedData Preparation
数据准备
Prepare Historical Dataset
准备历史数据集
python
import pandas as pd
import numpy as np
def prepare_cost_dataset(df):
"""Prepare historical project data for ML"""
# Select relevant features
features = [
'area_m2',
'floors',
'building_type',
'location',
'year_completed',
'complexity_score',
'material_quality',
'total_cost'
]
df = df[features].copy()
# Handle missing values
df = df.dropna(subset=['total_cost'])
df['complexity_score'] = df['complexity_score'].fillna(df['complexity_score'].median())
# Encode categorical variables
df = pd.get_dummies(df, columns=['building_type', 'location'])
# Calculate derived features
df['cost_per_m2'] = df['total_cost'] / df['area_m2']
df['cost_per_floor'] = df['total_cost'] / df['floors']
# Adjust for inflation (to current year prices)
current_year = 2024
inflation_rate = 0.03 # 3% annual
df['years_ago'] = current_year - df['year_completed']
df['adjusted_cost'] = df['total_cost'] * (1 + inflation_rate) ** df['years_ago']
return dfpython
import pandas as pd
import numpy as np
def prepare_cost_dataset(df):
"""Prepare historical project data for ML"""
# Select relevant features
features = [
'area_m2',
'floors',
'building_type',
'location',
'year_completed',
'complexity_score',
'material_quality',
'total_cost'
]
df = df[features].copy()
# Handle missing values
df = df.dropna(subset=['total_cost'])
df['complexity_score'] = df['complexity_score'].fillna(df['complexity_score'].median())
# Encode categorical variables
df = pd.get_dummies(df, columns=['building_type', 'location'])
# Calculate derived features
df['cost_per_m2'] = df['total_cost'] / df['area_m2']
df['cost_per_floor'] = df['total_cost'] / df['floors']
# Adjust for inflation (to current year prices)
current_year = 2024
inflation_rate = 0.03 # 3% annual
df['years_ago'] = current_year - df['year_completed']
df['adjusted_cost'] = df['total_cost'] * (1 + inflation_rate) ** df['years_ago']
return dfUsage
Usage
df = pd.read_csv("projects_history.csv")
df_prepared = prepare_cost_dataset(df)
undefineddf = pd.read_csv("projects_history.csv")
df_prepared = prepare_cost_dataset(df)
undefinedFeature Engineering
特征工程
python
def engineer_features(df):
"""Create additional features for better predictions"""
# Interaction features
df['area_x_floors'] = df['area_m2'] * df['floors']
df['area_x_complexity'] = df['area_m2'] * df['complexity_score']
# Polynomial features
df['area_squared'] = df['area_m2'] ** 2
# Log transforms (for skewed features)
df['log_area'] = np.log1p(df['area_m2'])
# Binned features
df['size_category'] = pd.cut(
df['area_m2'],
bins=[0, 1000, 5000, 10000, float('inf')],
labels=['small', 'medium', 'large', 'xlarge']
)
return dfpython
def engineer_features(df):
"""Create additional features for better predictions"""
# Interaction features
df['area_x_floors'] = df['area_m2'] * df['floors']
df['area_x_complexity'] = df['area_m2'] * df['complexity_score']
# Polynomial features
df['area_squared'] = df['area_m2'] ** 2
# Log transforms (for skewed features)
df['log_area'] = np.log1p(df['area_m2'])
# Binned features
df['size_category'] = pd.cut(
df['area_m2'],
bins=[0, 1000, 5000, 10000, float('inf')],
labels=['small', 'medium', 'large', 'xlarge']
)
return dfMachine Learning Models
机器学习模型
Linear Regression
Linear Regression
python
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
def train_linear_model(X_train, y_train):
"""Train Linear Regression model with scaling"""
pipeline = Pipeline([
('scaler', StandardScaler()),
('regressor', LinearRegression())
])
pipeline.fit(X_train, y_train)
# Feature importance (coefficients)
coefficients = pd.DataFrame({
'feature': X_train.columns,
'coefficient': pipeline.named_steps['regressor'].coef_
}).sort_values('coefficient', key=abs, ascending=False)
return pipeline, coefficientspython
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
def train_linear_model(X_train, y_train):
"""Train Linear Regression model with scaling"""
pipeline = Pipeline([
('scaler', StandardScaler()),
('regressor', LinearRegression())
])
pipeline.fit(X_train, y_train)
# Feature importance (coefficients)
coefficients = pd.DataFrame({
'feature': X_train.columns,
'coefficient': pipeline.named_steps['regressor'].coef_
}).sort_values('coefficient', key=abs, ascending=False)
return pipeline, coefficientsUsage
Usage
model, importance = train_linear_model(X_train, y_train)
print("Feature Importance:")
print(importance)
undefinedmodel, importance = train_linear_model(X_train, y_train)
print("Feature Importance:")
print(importance)
undefinedK-Nearest Neighbors (KNN)
K-Nearest Neighbors (KNN)
python
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
def train_knn_model(X_train, y_train):
"""Train KNN model with optimal k"""
# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)
# Find optimal k using cross-validation
param_grid = {'n_neighbors': range(3, 20)}
knn = KNeighborsRegressor()
grid_search = GridSearchCV(knn, param_grid, cv=5, scoring='neg_mean_absolute_error')
grid_search.fit(X_scaled, y_train)
print(f"Best k: {grid_search.best_params_['n_neighbors']}")
print(f"Best MAE: ${-grid_search.best_score_:,.0f}")
return grid_search.best_estimator_, scalerpython
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
def train_knn_model(X_train, y_train):
"""Train KNN model with optimal k"""
# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)
# Find optimal k using cross-validation
param_grid = {'n_neighbors': range(3, 20)}
knn = KNeighborsRegressor()
grid_search = GridSearchCV(knn, param_grid, cv=5, scoring='neg_mean_absolute_error')
grid_search.fit(X_scaled, y_train)
print(f"Best k: {grid_search.best_params_['n_neighbors']}")
print(f"Best MAE: ${-grid_search.best_score_:,.0f}")
return grid_search.best_estimator_, scalerUsage
Usage
knn_model, scaler = train_knn_model(X_train, y_train)
undefinedknn_model, scaler = train_knn_model(X_train, y_train)
undefinedRandom Forest
Random Forest
python
from sklearn.ensemble import RandomForestRegressor
def train_random_forest(X_train, y_train):
"""Train Random Forest model"""
rf = RandomForestRegressor(
n_estimators=100,
max_depth=10,
min_samples_split=5,
random_state=42
)
rf.fit(X_train, y_train)
# Feature importance
importance = pd.DataFrame({
'feature': X_train.columns,
'importance': rf.feature_importances_
}).sort_values('importance', ascending=False)
return rf, importancepython
from sklearn.ensemble import RandomForestRegressor
def train_random_forest(X_train, y_train):
"""Train Random Forest model"""
rf = RandomForestRegressor(
n_estimators=100,
max_depth=10,
min_samples_split=5,
random_state=42
)
rf.fit(X_train, y_train)
# Feature importance
importance = pd.DataFrame({
'feature': X_train.columns,
'importance': rf.feature_importances_
}).sort_values('importance', ascending=False)
return rf, importanceUsage
Usage
rf_model, importance = train_random_forest(X_train, y_train)
print("Feature Importance:")
print(importance.head(10))
undefinedrf_model, importance = train_random_forest(X_train, y_train)
print("Feature Importance:")
print(importance.head(10))
undefinedGradient Boosting
Gradient Boosting
python
from sklearn.ensemble import GradientBoostingRegressor
def train_gradient_boosting(X_train, y_train):
"""Train Gradient Boosting model"""
gb = GradientBoostingRegressor(
n_estimators=200,
learning_rate=0.1,
max_depth=5,
random_state=42
)
gb.fit(X_train, y_train)
return gbpython
from sklearn.ensemble import GradientBoostingRegressor
def train_gradient_boosting(X_train, y_train):
"""Train Gradient Boosting model"""
gb = GradientBoostingRegressor(
n_estimators=200,
learning_rate=0.1,
max_depth=5,
random_state=42
)
gb.fit(X_train, y_train)
return gbUsage
Usage
gb_model = train_gradient_boosting(X_train, y_train)
undefinedgb_model = train_gradient_boosting(X_train, y_train)
undefinedModel Evaluation
模型评估
Comprehensive Evaluation
综合评估
python
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np
def evaluate_model(model, X_test, y_test, model_name="Model"):
"""Comprehensive model evaluation"""
predictions = model.predict(X_test)
metrics = {
'MAE': mean_absolute_error(y_test, predictions),
'RMSE': np.sqrt(mean_squared_error(y_test, predictions)),
'R²': r2_score(y_test, predictions),
'MAPE': np.mean(np.abs((y_test - predictions) / y_test)) * 100
}
print(f"\n{model_name} Evaluation:")
print(f" MAE: ${metrics['MAE']:,.0f}")
print(f" RMSE: ${metrics['RMSE']:,.0f}")
print(f" R²: {metrics['R²']:.3f}")
print(f" MAPE: {metrics['MAPE']:.1f}%")
return metrics, predictionspython
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np
def evaluate_model(model, X_test, y_test, model_name="Model"):
"""Comprehensive model evaluation"""
predictions = model.predict(X_test)
metrics = {
'MAE': mean_absolute_error(y_test, predictions),
'RMSE': np.sqrt(mean_squared_error(y_test, predictions)),
'R²': r2_score(y_test, predictions),
'MAPE': np.mean(np.abs((y_test - predictions) / y_test)) * 100
}
print(f"\n{model_name} Evaluation:")
print(f" MAE: ${metrics['MAE']:,.0f}")
print(f" RMSE: ${metrics['RMSE']:,.0f}")
print(f" R²: {metrics['R²']:.3f}")
print(f" MAPE: {metrics['MAPE']:.1f}%")
return metrics, predictionsUsage
Usage
metrics, predictions = evaluate_model(model, X_test, y_test, "Linear Regression")
undefinedmetrics, predictions = evaluate_model(model, X_test, y_test, "Linear Regression")
undefinedCompare Multiple Models
多模型对比
python
def compare_models(models, X_test, y_test):
"""Compare multiple models"""
results = []
for name, model in models.items():
metrics, _ = evaluate_model(model, X_test, y_test, name)
metrics['Model'] = name
results.append(metrics)
comparison = pd.DataFrame(results)
comparison = comparison.set_index('Model')
print("\nModel Comparison:")
print(comparison.round(2))
return comparisonpython
def compare_models(models, X_test, y_test):
"""Compare multiple models"""
results = []
for name, model in models.items():
metrics, _ = evaluate_model(model, X_test, y_test, name)
metrics['Model'] = name
results.append(metrics)
comparison = pd.DataFrame(results)
comparison = comparison.set_index('Model')
print("\nModel Comparison:")
print(comparison.round(2))
return comparisonUsage
Usage
models = {
'Linear Regression': linear_model,
'KNN': knn_model,
'Random Forest': rf_model,
'Gradient Boosting': gb_model
}
comparison = compare_models(models, X_test, y_test)
undefinedmodels = {
'Linear Regression': linear_model,
'KNN': knn_model,
'Random Forest': rf_model,
'Gradient Boosting': gb_model
}
comparison = compare_models(models, X_test, y_test)
undefinedCross-Validation
交叉验证
python
from sklearn.model_selection import cross_val_score
def cross_validate_model(model, X, y, cv=5):
"""Perform cross-validation"""
scores = cross_val_score(model, X, y, cv=cv, scoring='neg_mean_absolute_error')
mae_scores = -scores
print(f"Cross-Validation MAE: ${mae_scores.mean():,.0f} (+/- ${mae_scores.std():,.0f})")
return mae_scorespython
from sklearn.model_selection import cross_val_score
def cross_validate_model(model, X, y, cv=5):
"""Perform cross-validation"""
scores = cross_val_score(model, X, y, cv=cv, scoring='neg_mean_absolute_error')
mae_scores = -scores
print(f"Cross-Validation MAE: ${mae_scores.mean():,.0f} (+/- ${mae_scores.std():,.0f})")
return mae_scoresUsage
Usage
cv_scores = cross_validate_model(rf_model, X, y)
undefinedcv_scores = cross_validate_model(rf_model, X, y)
undefinedPrediction Pipeline
预测流水线
Complete Prediction Function
完整预测函数
python
import joblib
def create_prediction_pipeline(model, feature_names, scaler=None):
"""Create a reusable prediction pipeline"""
def predict_cost(project_data):
"""
Predict cost for new project
Args:
project_data: dict with project features
Returns:
Predicted cost and confidence interval
"""
# Create DataFrame from input
df = pd.DataFrame([project_data])
# Ensure all required features
for col in feature_names:
if col not in df.columns:
df[col] = 0
df = df[feature_names]
# Scale if necessary
if scaler:
df = scaler.transform(df)
# Predict
prediction = model.predict(df)[0]
# Confidence interval (simple estimation)
confidence = 0.15 # 15% margin
lower = prediction * (1 - confidence)
upper = prediction * (1 + confidence)
return {
'predicted_cost': prediction,
'lower_bound': lower,
'upper_bound': upper,
'confidence_level': f"{(1-confidence)*100:.0f}%"
}
return predict_costpython
import joblib
def create_prediction_pipeline(model, feature_names, scaler=None):
"""Create a reusable prediction pipeline"""
def predict_cost(project_data):
"""
Predict cost for new project
Args:
project_data: dict with project features
Returns:
Predicted cost and confidence interval
"""
# Create DataFrame from input
df = pd.DataFrame([project_data])
# Ensure all required features
for col in feature_names:
if col not in df.columns:
df[col] = 0
df = df[feature_names]
# Scale if necessary
if scaler:
df = scaler.transform(df)
# Predict
prediction = model.predict(df)[0]
# Confidence interval (simple estimation)
confidence = 0.15 # 15% margin
lower = prediction * (1 - confidence)
upper = prediction * (1 + confidence)
return {
'predicted_cost': prediction,
'lower_bound': lower,
'upper_bound': upper,
'confidence_level': f"{(1-confidence)*100:.0f}%"
}
return predict_costUsage
Usage
predictor = create_prediction_pipeline(rf_model, X.columns.tolist())
predictor = create_prediction_pipeline(rf_model, X.columns.tolist())
Predict new project
Predict new project
new_project = {
'area_m2': 5000,
'floors': 8,
'complexity_score': 3,
'material_quality': 2
}
result = predictor(new_project)
print(f"Predicted Cost: ${result['predicted_cost']:,.0f}")
print(f"Range: ${result['lower_bound']:,.0f} - ${result['upper_bound']:,.0f}")
undefinednew_project = {
'area_m2': 5000,
'floors': 8,
'complexity_score': 3,
'material_quality': 2
}
result = predictor(new_project)
print(f"Predicted Cost: ${result['predicted_cost']:,.0f}")
print(f"Range: ${result['lower_bound']:,.0f} - ${result['upper_bound']:,.0f}")
undefinedSave and Load Model
保存与加载模型
python
import joblibpython
import joblibSave model
Save model
def save_model(model, filepath):
"""Save trained model to file"""
joblib.dump(model, filepath)
print(f"Model saved to {filepath}")
def save_model(model, filepath):
"""Save trained model to file"""
joblib.dump(model, filepath)
print(f"Model saved to {filepath}")
Load model
Load model
def load_model(filepath):
"""Load model from file"""
model = joblib.load(filepath)
print(f"Model loaded from {filepath}")
return model
def load_model(filepath):
"""Load model from file"""
model = joblib.load(filepath)
print(f"Model loaded from {filepath}")
return model
Usage
Usage
save_model(rf_model, "cost_prediction_model.pkl")
loaded_model = load_model("cost_prediction_model.pkl")
undefinedsave_model(rf_model, "cost_prediction_model.pkl")
loaded_model = load_model("cost_prediction_model.pkl")
undefinedUsing with ChatGPT
结合ChatGPT使用
python
undefinedpython
undefinedPrompt for ChatGPT to help with cost prediction
Prompt for ChatGPT to help with cost prediction
prompt = """
I have historical construction project data with these columns:
- area_m2: Building area in square meters
- floors: Number of floors
- building_type: residential, commercial, industrial
- total_cost: Total project cost in USD
Write Python code using scikit-learn to:
- Prepare the data for machine learning
- Train a Random Forest model
- Evaluate the model
- Predict cost for a new 3000 m² commercial building with 5 floors """
undefinedprompt = """
I have historical construction project data with these columns:
- area_m2: Building area in square meters
- floors: Number of floors
- building_type: residential, commercial, industrial
- total_cost: Total project cost in USD
Write Python code using scikit-learn to:
- Prepare the data for machine learning
- Train a Random Forest model
- Evaluate the model
- Predict cost for a new 3000 m² commercial building with 5 floors """
undefinedQuick Reference
快速参考
| Task | Code |
|---|---|
| Split data | |
| Linear Regression | |
| KNN | |
| Random Forest | |
| Predict | |
| MAE | |
| R² Score | |
| Cross-validate | |
| Save model | |
| 任务 | 代码 |
|---|---|
| 拆分数据 | |
| Linear Regression | |
| KNN | |
| Random Forest | |
| 预测 | |
| MAE | |
| R²分数 | |
| 交叉验证 | |
| 保存模型 | |
Best Practices
最佳实践
- Data Quality: More historical data = better predictions
- Feature Selection: Include relevant project characteristics
- Inflation Adjustment: Normalize costs to current prices
- Regular Retraining: Update model with new completed projects
- Ensemble Methods: Combine multiple models for robustness
- Confidence Intervals: Always provide prediction ranges
- 数据质量:历史数据越多,预测效果越好
- 特征选择:纳入相关的项目特征
- 通货膨胀调整:将成本标准化为当前价格
- 定期重训练:用新完工的项目更新模型
- 集成方法:结合多个模型以提高鲁棒性
- 置信区间:始终提供预测范围
Resources
资源
- Book: "Data-Driven Construction" by Artem Boiko, Chapter 4.5
- Website: https://datadrivenconstruction.io
- scikit-learn: https://scikit-learn.org
- 书籍:Artem Boiko所著《Data-Driven Construction》第4.5章
- 网站:https://datadrivenconstruction.io
- scikit-learn:https://scikit-learn.org
Next Steps
下一步
- See for project duration forecasting
duration-prediction - See for custom ML workflows
ml-model-builder - See for visualization
kpi-dashboard - See for large dataset processing
big-data-analysis
- 查看以了解项目工期预测
duration-prediction - 查看以了解自定义机器学习工作流
ml-model-builder - 查看以了解可视化功能
kpi-dashboard - 查看以了解大数据处理
big-data-analysis