cost-prediction

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Construction Cost Prediction with Machine Learning

基于机器学习的建筑成本预测

Overview

概述

Based on DDC methodology (Chapter 4.5), this skill enables predicting construction project costs using historical data and machine learning algorithms. The approach transforms traditional expert-based estimation into data-driven prediction.
Book Reference: "Будущее: прогнозы и машинное обучение" / "Future: Predictions and Machine Learning"
"Предсказания и прогнозы на основе исторических данных позволяют компаниям принимать более точные решения о стоимости и сроках проектов." — DDC Book, Chapter 4.5
基于DDC方法论(第4.5章),该技能可借助历史数据和机器学习算法预测建筑项目成本。此方法将传统的专家估算转变为数据驱动的预测方式。
书籍参考:《Будущее: прогнозы и машинное обучение》 / 《Future: Predictions and Machine Learning》
"基于历史数据的预测和预估能够帮助公司在项目成本和工期方面做出更精准的决策。" — DDC书籍,第4.5章

Core Concepts

核心概念

Historical Data → Feature Engineering → ML Model → Cost Prediction
    │                    │                │              │
    ▼                    ▼                ▼              ▼
Past projects      Prepare data      Train model    New project
with costs         for ML            on history     cost forecast
Historical Data → Feature Engineering → ML Model → Cost Prediction
    │                    │                │              │
    ▼                    ▼                ▼              ▼
Past projects      Prepare data      Train model    New project
with costs         for ML            on history     cost forecast

Quick Start

快速开始

python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, r2_score
python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, r2_score

Load historical project data

Load historical project data

df = pd.read_csv("historical_projects.csv")
df = pd.read_csv("historical_projects.csv")

Features and target

Features and target

X = df[['area_m2', 'floors', 'complexity_score']] y = df['total_cost']
X = df[['area_m2', 'floors', 'complexity_score']] y = df['total_cost']

Split data

Split data

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Train model

Train model

model = LinearRegression() model.fit(X_train, y_train)
model = LinearRegression() model.fit(X_train, y_train)

Predict

Predict

predictions = model.predict(X_test) print(f"R² Score: {r2_score(y_test, predictions):.2f}") print(f"MAE: ${mean_absolute_error(y_test, predictions):,.0f}")
predictions = model.predict(X_test) print(f"R² Score: {r2_score(y_test, predictions):.2f}") print(f"MAE: ${mean_absolute_error(y_test, predictions):,.0f}")

Predict new project

Predict new project

new_project = [[5000, 10, 3]] # area, floors, complexity cost = model.predict(new_project) print(f"Predicted cost: ${cost[0]:,.0f}")
undefined
new_project = [[5000, 10, 3]] # area, floors, complexity cost = model.predict(new_project) print(f"Predicted cost: ${cost[0]:,.0f}")
undefined

Data Preparation

数据准备

Prepare Historical Dataset

准备历史数据集

python
import pandas as pd
import numpy as np

def prepare_cost_dataset(df):
    """Prepare historical project data for ML"""
    # Select relevant features
    features = [
        'area_m2',
        'floors',
        'building_type',
        'location',
        'year_completed',
        'complexity_score',
        'material_quality',
        'total_cost'
    ]

    df = df[features].copy()

    # Handle missing values
    df = df.dropna(subset=['total_cost'])
    df['complexity_score'] = df['complexity_score'].fillna(df['complexity_score'].median())

    # Encode categorical variables
    df = pd.get_dummies(df, columns=['building_type', 'location'])

    # Calculate derived features
    df['cost_per_m2'] = df['total_cost'] / df['area_m2']
    df['cost_per_floor'] = df['total_cost'] / df['floors']

    # Adjust for inflation (to current year prices)
    current_year = 2024
    inflation_rate = 0.03  # 3% annual
    df['years_ago'] = current_year - df['year_completed']
    df['adjusted_cost'] = df['total_cost'] * (1 + inflation_rate) ** df['years_ago']

    return df
python
import pandas as pd
import numpy as np

def prepare_cost_dataset(df):
    """Prepare historical project data for ML"""
    # Select relevant features
    features = [
        'area_m2',
        'floors',
        'building_type',
        'location',
        'year_completed',
        'complexity_score',
        'material_quality',
        'total_cost'
    ]

    df = df[features].copy()

    # Handle missing values
    df = df.dropna(subset=['total_cost'])
    df['complexity_score'] = df['complexity_score'].fillna(df['complexity_score'].median())

    # Encode categorical variables
    df = pd.get_dummies(df, columns=['building_type', 'location'])

    # Calculate derived features
    df['cost_per_m2'] = df['total_cost'] / df['area_m2']
    df['cost_per_floor'] = df['total_cost'] / df['floors']

    # Adjust for inflation (to current year prices)
    current_year = 2024
    inflation_rate = 0.03  # 3% annual
    df['years_ago'] = current_year - df['year_completed']
    df['adjusted_cost'] = df['total_cost'] * (1 + inflation_rate) ** df['years_ago']

    return df

Usage

Usage

df = pd.read_csv("projects_history.csv") df_prepared = prepare_cost_dataset(df)
undefined
df = pd.read_csv("projects_history.csv") df_prepared = prepare_cost_dataset(df)
undefined

Feature Engineering

特征工程

python
def engineer_features(df):
    """Create additional features for better predictions"""
    # Interaction features
    df['area_x_floors'] = df['area_m2'] * df['floors']
    df['area_x_complexity'] = df['area_m2'] * df['complexity_score']

    # Polynomial features
    df['area_squared'] = df['area_m2'] ** 2

    # Log transforms (for skewed features)
    df['log_area'] = np.log1p(df['area_m2'])

    # Binned features
    df['size_category'] = pd.cut(
        df['area_m2'],
        bins=[0, 1000, 5000, 10000, float('inf')],
        labels=['small', 'medium', 'large', 'xlarge']
    )

    return df
python
def engineer_features(df):
    """Create additional features for better predictions"""
    # Interaction features
    df['area_x_floors'] = df['area_m2'] * df['floors']
    df['area_x_complexity'] = df['area_m2'] * df['complexity_score']

    # Polynomial features
    df['area_squared'] = df['area_m2'] ** 2

    # Log transforms (for skewed features)
    df['log_area'] = np.log1p(df['area_m2'])

    # Binned features
    df['size_category'] = pd.cut(
        df['area_m2'],
        bins=[0, 1000, 5000, 10000, float('inf')],
        labels=['small', 'medium', 'large', 'xlarge']
    )

    return df

Machine Learning Models

机器学习模型

Linear Regression

Linear Regression

python
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

def train_linear_model(X_train, y_train):
    """Train Linear Regression model with scaling"""
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('regressor', LinearRegression())
    ])

    pipeline.fit(X_train, y_train)

    # Feature importance (coefficients)
    coefficients = pd.DataFrame({
        'feature': X_train.columns,
        'coefficient': pipeline.named_steps['regressor'].coef_
    }).sort_values('coefficient', key=abs, ascending=False)

    return pipeline, coefficients
python
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

def train_linear_model(X_train, y_train):
    """Train Linear Regression model with scaling"""
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('regressor', LinearRegression())
    ])

    pipeline.fit(X_train, y_train)

    # Feature importance (coefficients)
    coefficients = pd.DataFrame({
        'feature': X_train.columns,
        'coefficient': pipeline.named_steps['regressor'].coef_
    }).sort_values('coefficient', key=abs, ascending=False)

    return pipeline, coefficients

Usage

Usage

model, importance = train_linear_model(X_train, y_train) print("Feature Importance:") print(importance)
undefined
model, importance = train_linear_model(X_train, y_train) print("Feature Importance:") print(importance)
undefined

K-Nearest Neighbors (KNN)

K-Nearest Neighbors (KNN)

python
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV

def train_knn_model(X_train, y_train):
    """Train KNN model with optimal k"""
    # Scale features
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X_train)

    # Find optimal k using cross-validation
    param_grid = {'n_neighbors': range(3, 20)}
    knn = KNeighborsRegressor()
    grid_search = GridSearchCV(knn, param_grid, cv=5, scoring='neg_mean_absolute_error')
    grid_search.fit(X_scaled, y_train)

    print(f"Best k: {grid_search.best_params_['n_neighbors']}")
    print(f"Best MAE: ${-grid_search.best_score_:,.0f}")

    return grid_search.best_estimator_, scaler
python
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV

def train_knn_model(X_train, y_train):
    """Train KNN model with optimal k"""
    # Scale features
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X_train)

    # Find optimal k using cross-validation
    param_grid = {'n_neighbors': range(3, 20)}
    knn = KNeighborsRegressor()
    grid_search = GridSearchCV(knn, param_grid, cv=5, scoring='neg_mean_absolute_error')
    grid_search.fit(X_scaled, y_train)

    print(f"Best k: {grid_search.best_params_['n_neighbors']}")
    print(f"Best MAE: ${-grid_search.best_score_:,.0f}")

    return grid_search.best_estimator_, scaler

Usage

Usage

knn_model, scaler = train_knn_model(X_train, y_train)
undefined
knn_model, scaler = train_knn_model(X_train, y_train)
undefined

Random Forest

Random Forest

python
from sklearn.ensemble import RandomForestRegressor

def train_random_forest(X_train, y_train):
    """Train Random Forest model"""
    rf = RandomForestRegressor(
        n_estimators=100,
        max_depth=10,
        min_samples_split=5,
        random_state=42
    )

    rf.fit(X_train, y_train)

    # Feature importance
    importance = pd.DataFrame({
        'feature': X_train.columns,
        'importance': rf.feature_importances_
    }).sort_values('importance', ascending=False)

    return rf, importance
python
from sklearn.ensemble import RandomForestRegressor

def train_random_forest(X_train, y_train):
    """Train Random Forest model"""
    rf = RandomForestRegressor(
        n_estimators=100,
        max_depth=10,
        min_samples_split=5,
        random_state=42
    )

    rf.fit(X_train, y_train)

    # Feature importance
    importance = pd.DataFrame({
        'feature': X_train.columns,
        'importance': rf.feature_importances_
    }).sort_values('importance', ascending=False)

    return rf, importance

Usage

Usage

rf_model, importance = train_random_forest(X_train, y_train) print("Feature Importance:") print(importance.head(10))
undefined
rf_model, importance = train_random_forest(X_train, y_train) print("Feature Importance:") print(importance.head(10))
undefined

Gradient Boosting

Gradient Boosting

python
from sklearn.ensemble import GradientBoostingRegressor

def train_gradient_boosting(X_train, y_train):
    """Train Gradient Boosting model"""
    gb = GradientBoostingRegressor(
        n_estimators=200,
        learning_rate=0.1,
        max_depth=5,
        random_state=42
    )

    gb.fit(X_train, y_train)
    return gb
python
from sklearn.ensemble import GradientBoostingRegressor

def train_gradient_boosting(X_train, y_train):
    """Train Gradient Boosting model"""
    gb = GradientBoostingRegressor(
        n_estimators=200,
        learning_rate=0.1,
        max_depth=5,
        random_state=42
    )

    gb.fit(X_train, y_train)
    return gb

Usage

Usage

gb_model = train_gradient_boosting(X_train, y_train)
undefined
gb_model = train_gradient_boosting(X_train, y_train)
undefined

Model Evaluation

模型评估

Comprehensive Evaluation

综合评估

python
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

def evaluate_model(model, X_test, y_test, model_name="Model"):
    """Comprehensive model evaluation"""
    predictions = model.predict(X_test)

    metrics = {
        'MAE': mean_absolute_error(y_test, predictions),
        'RMSE': np.sqrt(mean_squared_error(y_test, predictions)),
        'R²': r2_score(y_test, predictions),
        'MAPE': np.mean(np.abs((y_test - predictions) / y_test)) * 100
    }

    print(f"\n{model_name} Evaluation:")
    print(f"  MAE:  ${metrics['MAE']:,.0f}")
    print(f"  RMSE: ${metrics['RMSE']:,.0f}")
    print(f"  R²:   {metrics['R²']:.3f}")
    print(f"  MAPE: {metrics['MAPE']:.1f}%")

    return metrics, predictions
python
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

def evaluate_model(model, X_test, y_test, model_name="Model"):
    """Comprehensive model evaluation"""
    predictions = model.predict(X_test)

    metrics = {
        'MAE': mean_absolute_error(y_test, predictions),
        'RMSE': np.sqrt(mean_squared_error(y_test, predictions)),
        'R²': r2_score(y_test, predictions),
        'MAPE': np.mean(np.abs((y_test - predictions) / y_test)) * 100
    }

    print(f"\n{model_name} Evaluation:")
    print(f"  MAE:  ${metrics['MAE']:,.0f}")
    print(f"  RMSE: ${metrics['RMSE']:,.0f}")
    print(f"  R²:   {metrics['R²']:.3f}")
    print(f"  MAPE: {metrics['MAPE']:.1f}%")

    return metrics, predictions

Usage

Usage

metrics, predictions = evaluate_model(model, X_test, y_test, "Linear Regression")
undefined
metrics, predictions = evaluate_model(model, X_test, y_test, "Linear Regression")
undefined

Compare Multiple Models

多模型对比

python
def compare_models(models, X_test, y_test):
    """Compare multiple models"""
    results = []

    for name, model in models.items():
        metrics, _ = evaluate_model(model, X_test, y_test, name)
        metrics['Model'] = name
        results.append(metrics)

    comparison = pd.DataFrame(results)
    comparison = comparison.set_index('Model')

    print("\nModel Comparison:")
    print(comparison.round(2))

    return comparison
python
def compare_models(models, X_test, y_test):
    """Compare multiple models"""
    results = []

    for name, model in models.items():
        metrics, _ = evaluate_model(model, X_test, y_test, name)
        metrics['Model'] = name
        results.append(metrics)

    comparison = pd.DataFrame(results)
    comparison = comparison.set_index('Model')

    print("\nModel Comparison:")
    print(comparison.round(2))

    return comparison

Usage

Usage

models = { 'Linear Regression': linear_model, 'KNN': knn_model, 'Random Forest': rf_model, 'Gradient Boosting': gb_model } comparison = compare_models(models, X_test, y_test)
undefined
models = { 'Linear Regression': linear_model, 'KNN': knn_model, 'Random Forest': rf_model, 'Gradient Boosting': gb_model } comparison = compare_models(models, X_test, y_test)
undefined

Cross-Validation

交叉验证

python
from sklearn.model_selection import cross_val_score

def cross_validate_model(model, X, y, cv=5):
    """Perform cross-validation"""
    scores = cross_val_score(model, X, y, cv=cv, scoring='neg_mean_absolute_error')
    mae_scores = -scores

    print(f"Cross-Validation MAE: ${mae_scores.mean():,.0f} (+/- ${mae_scores.std():,.0f})")
    return mae_scores
python
from sklearn.model_selection import cross_val_score

def cross_validate_model(model, X, y, cv=5):
    """Perform cross-validation"""
    scores = cross_val_score(model, X, y, cv=cv, scoring='neg_mean_absolute_error')
    mae_scores = -scores

    print(f"Cross-Validation MAE: ${mae_scores.mean():,.0f} (+/- ${mae_scores.std():,.0f})")
    return mae_scores

Usage

Usage

cv_scores = cross_validate_model(rf_model, X, y)
undefined
cv_scores = cross_validate_model(rf_model, X, y)
undefined

Prediction Pipeline

预测流水线

Complete Prediction Function

完整预测函数

python
import joblib

def create_prediction_pipeline(model, feature_names, scaler=None):
    """Create a reusable prediction pipeline"""

    def predict_cost(project_data):
        """
        Predict cost for new project

        Args:
            project_data: dict with project features

        Returns:
            Predicted cost and confidence interval
        """
        # Create DataFrame from input
        df = pd.DataFrame([project_data])

        # Ensure all required features
        for col in feature_names:
            if col not in df.columns:
                df[col] = 0

        df = df[feature_names]

        # Scale if necessary
        if scaler:
            df = scaler.transform(df)

        # Predict
        prediction = model.predict(df)[0]

        # Confidence interval (simple estimation)
        confidence = 0.15  # 15% margin
        lower = prediction * (1 - confidence)
        upper = prediction * (1 + confidence)

        return {
            'predicted_cost': prediction,
            'lower_bound': lower,
            'upper_bound': upper,
            'confidence_level': f"{(1-confidence)*100:.0f}%"
        }

    return predict_cost
python
import joblib

def create_prediction_pipeline(model, feature_names, scaler=None):
    """Create a reusable prediction pipeline"""

    def predict_cost(project_data):
        """
        Predict cost for new project

        Args:
            project_data: dict with project features

        Returns:
            Predicted cost and confidence interval
        """
        # Create DataFrame from input
        df = pd.DataFrame([project_data])

        # Ensure all required features
        for col in feature_names:
            if col not in df.columns:
                df[col] = 0

        df = df[feature_names]

        # Scale if necessary
        if scaler:
            df = scaler.transform(df)

        # Predict
        prediction = model.predict(df)[0]

        # Confidence interval (simple estimation)
        confidence = 0.15  # 15% margin
        lower = prediction * (1 - confidence)
        upper = prediction * (1 + confidence)

        return {
            'predicted_cost': prediction,
            'lower_bound': lower,
            'upper_bound': upper,
            'confidence_level': f"{(1-confidence)*100:.0f}%"
        }

    return predict_cost

Usage

Usage

predictor = create_prediction_pipeline(rf_model, X.columns.tolist())
predictor = create_prediction_pipeline(rf_model, X.columns.tolist())

Predict new project

Predict new project

new_project = { 'area_m2': 5000, 'floors': 8, 'complexity_score': 3, 'material_quality': 2 }
result = predictor(new_project) print(f"Predicted Cost: ${result['predicted_cost']:,.0f}") print(f"Range: ${result['lower_bound']:,.0f} - ${result['upper_bound']:,.0f}")
undefined
new_project = { 'area_m2': 5000, 'floors': 8, 'complexity_score': 3, 'material_quality': 2 }
result = predictor(new_project) print(f"Predicted Cost: ${result['predicted_cost']:,.0f}") print(f"Range: ${result['lower_bound']:,.0f} - ${result['upper_bound']:,.0f}")
undefined

Save and Load Model

保存与加载模型

python
import joblib
python
import joblib

Save model

Save model

def save_model(model, filepath): """Save trained model to file""" joblib.dump(model, filepath) print(f"Model saved to {filepath}")
def save_model(model, filepath): """Save trained model to file""" joblib.dump(model, filepath) print(f"Model saved to {filepath}")

Load model

Load model

def load_model(filepath): """Load model from file""" model = joblib.load(filepath) print(f"Model loaded from {filepath}") return model
def load_model(filepath): """Load model from file""" model = joblib.load(filepath) print(f"Model loaded from {filepath}") return model

Usage

Usage

save_model(rf_model, "cost_prediction_model.pkl") loaded_model = load_model("cost_prediction_model.pkl")
undefined
save_model(rf_model, "cost_prediction_model.pkl") loaded_model = load_model("cost_prediction_model.pkl")
undefined

Using with ChatGPT

结合ChatGPT使用

python
undefined
python
undefined

Prompt for ChatGPT to help with cost prediction

Prompt for ChatGPT to help with cost prediction

prompt = """ I have historical construction project data with these columns:
  • area_m2: Building area in square meters
  • floors: Number of floors
  • building_type: residential, commercial, industrial
  • total_cost: Total project cost in USD
Write Python code using scikit-learn to:
  1. Prepare the data for machine learning
  2. Train a Random Forest model
  3. Evaluate the model
  4. Predict cost for a new 3000 m² commercial building with 5 floors """
undefined
prompt = """ I have historical construction project data with these columns:
  • area_m2: Building area in square meters
  • floors: Number of floors
  • building_type: residential, commercial, industrial
  • total_cost: Total project cost in USD
Write Python code using scikit-learn to:
  1. Prepare the data for machine learning
  2. Train a Random Forest model
  3. Evaluate the model
  4. Predict cost for a new 3000 m² commercial building with 5 floors """
undefined

Quick Reference

快速参考

TaskCode
Split data
train_test_split(X, y, test_size=0.2)
Linear Regression
LinearRegression().fit(X, y)
KNN
KNeighborsRegressor(n_neighbors=5)
Random Forest
RandomForestRegressor(n_estimators=100)
Predict
model.predict(X_new)
MAE
mean_absolute_error(y_true, y_pred)
R² Score
r2_score(y_true, y_pred)
Cross-validate
cross_val_score(model, X, y, cv=5)
Save model
joblib.dump(model, 'file.pkl')
任务代码
拆分数据
train_test_split(X, y, test_size=0.2)
Linear Regression
LinearRegression().fit(X, y)
KNN
KNeighborsRegressor(n_neighbors=5)
Random Forest
RandomForestRegressor(n_estimators=100)
预测
model.predict(X_new)
MAE
mean_absolute_error(y_true, y_pred)
R²分数
r2_score(y_true, y_pred)
交叉验证
cross_val_score(model, X, y, cv=5)
保存模型
joblib.dump(model, 'file.pkl')

Best Practices

最佳实践

  1. Data Quality: More historical data = better predictions
  2. Feature Selection: Include relevant project characteristics
  3. Inflation Adjustment: Normalize costs to current prices
  4. Regular Retraining: Update model with new completed projects
  5. Ensemble Methods: Combine multiple models for robustness
  6. Confidence Intervals: Always provide prediction ranges
  1. 数据质量:历史数据越多,预测效果越好
  2. 特征选择:纳入相关的项目特征
  3. 通货膨胀调整:将成本标准化为当前价格
  4. 定期重训练:用新完工的项目更新模型
  5. 集成方法:结合多个模型以提高鲁棒性
  6. 置信区间:始终提供预测范围

Resources

资源

Next Steps

下一步

  • See
    duration-prediction
    for project duration forecasting
  • See
    ml-model-builder
    for custom ML workflows
  • See
    kpi-dashboard
    for visualization
  • See
    big-data-analysis
    for large dataset processing
  • 查看
    duration-prediction
    以了解项目工期预测
  • 查看
    ml-model-builder
    以了解自定义机器学习工作流
  • 查看
    kpi-dashboard
    以了解可视化功能
  • 查看
    big-data-analysis
    以了解大数据处理