📽️模型复现

使用固定模型

尽管本平台提供了自动化特征工程和自动化机器学习功能,但是有时在进行建模研究时被要求只能使用某个特定的机器学习模型算法,尽管这个算法可能不是在指标上的最优选择。

本平台支持在自动化机器学习环节中指定固定模型进行训练,在新建任务最后的任务配置阶段,高级设置中增加了固定模型的下拉列表,允许用户选择算法模型,系统将根据对指定的算法进行建模和超参调优,如下图:

这个过程不受前面AutoFE自动化特征工程的影响,如果按照默认勾选进行了自动化特征工程,这里指定的模型也是对原始特征和AutoFE衍生后的特征一起进行的训练。

由于用户指定了固定算法,在AutoML自动化特征工程阶段所有的时间预算都会在这个算法下进行调优,省去了搜索其他算法模型的时间,从而在精度上也会有所提升。

复现步骤

如果用户在建模配置中没有选择固定模型进行训练,本平台在这种情况下提供一系列算法对所训练的模型进行复现。用户可以点击高级选项卡中的复现按钮,其步骤主要分为如下三个步骤:

① 下载复现训练数据集。

② 下载复现验证数据集。

③ 选择期望的算法根据复现文档中的步骤提示完成复现流程。

探索记录

本平台在自动化机器学习阶段会搜索很多算法和相应的超参数组合。用户可以点击高级选项卡中的AutoML探索记录。这些记录中体现了不同轮次的算法和对应参数进行复现,同时提供了更多灵活的选择空间,如下图:

XGBoost

经过实验发现,部分模型在不同设备上的复现结果不一致,这主要由于算法本身内部的随机数设置不同,例如XGBoost,以及我们的AutoML受时间预算限制会部分设置提前结束模型训练。这导致复现模型的结果无法完全和本平台一致,但是准确率并不会相差太多,上下浮动不会超过0.2%。

复现步骤

步骤一:下载模型训练集和测试集

(1)选择任务右侧的高级按钮,然后点击下拉框的复现按钮。

(2)根据弹出的提示框按顺序点击。

① 点击下载训练集获取模型训练时的数据(处理过)。

② 点击下载验证集获取模型验证时的数据(处理过)。

③ 点击相应的模型获取模型复现代码,例如:以下使用的是extra_tree的算法模型,则点击extra_tree按钮。

(3)在代码中加载数据。

import pandas as pd

# 超参数设置
target_name = "" ### 请修改成您任务的目标列名

# 检查target_name是否合法
replace_str = [':', '[', ']', '(', ')', '!', '@', '#', '¥', '%', '…', '《', '》', '【', '】', ' ']
for s in target_name:
    if s in replace_str:
        target_name = target_name.replace(s, '_')

# 读取下载的训练集和测试集
train, validation = pd.read_csv("train.csv"), pd.read_csv("validation.csv") ### 修改成下载数据的路径
validation = validation[train.columns]

# 切分数据和标签
X_train, y_train = train.drop(columns=target_name), train[target_name]
X_val, y_val = validation.drop(columns=target_name), validation[target_name]

步骤二:复制模型参数

(1)选择任务右侧的更多按钮,然后点击下拉框的查看模型参数按钮。

(2)复制自动化机器学习参数配置下的参数。

(3)在代码中设置参数。

# 设置模型参数
params = {
  "reg_alpha": 0.010041324684096948,
  "subsample": 1,
  "max_leaves": 5,
  "reg_lambda": 0.06780780830355458,
  "n_estimators": 5,
  "learning_rate": 0.134709002918751,
  "colsample_bytree": 0.8820431130526493,
  "min_child_weight": 0.40726573498774193,
  "colsample_bylevel": 1
}  ### 请根据changtianml得到的结果修改

步骤三:模型复现

在代码中初级始化模型并训练。

import xgboost as xgb

# 设置任务类型
task_type = 'classification' ### 如果是分类任务请设置为classification,如果是回归任务请设置为regression

# 初始化模型参数
if task_type == 'classification':
        model = xgb.XGBClassifier(**params)
else:
        model = xgb.XGBRegressor(**params)

# 模型训练
model.fit(X_train, y_train)

步骤四:模型验证

在代码中验证训练的模型。

# 模型验证函数
def eval_score(task_type, y_val, y_pred, y_proba=None):
    """
    Args:
        task_type (str, optional): 任务类型,classification or regression
        y_val (pd.Series): 验证集的真实标签
        y_pred (pd.Series): 模型预测的标签
        y_proba (pd.Series, optional): 模型预测的概率,当只有是分类任务的时候才需要输入,默认为None.

    Returns:
        dict: 常见指标的计算值
    """
    if task_type == 'classification':
        from sklearn.metrics import log_loss
        from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, log_loss

        val_loss = log_loss(y_val, y_proba)
        val_f1 = f1_score(y_val, y_pred, average='weighted')
        val_accuracy = accuracy_score(y_val, y_pred)
        val_precision = precision_score(y_val, y_pred, average='weighted')
        val_recall = recall_score(y_val, y_pred, average='weighted')
        if y_proba.shape[1] > 1:
            num_classes = y_proba.shape[1]
            val_auc_roc = 0.0
            for class_idx in range(num_classes):
                val_auc_roc += roc_auc_score((y_val == class_idx).astype(int), y_proba[:, class_idx])
            val_auc_roc /= num_classes
        else:
            val_auc_roc = roc_auc_score(y_val, y_proba)
        return {
            "val_log_loss": val_loss,
            "val_f1": val_f1,
            "val_accuracy": val_accuracy,
            "val_precision": val_precision,
            "val_recall": val_recall,
            "val_auc": val_auc_roc,
        }
    else:
        from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
        import numpy as np

        val_mse = mean_squared_error(y_val, y_pred)
        val_rmse = np.sqrt(val_mse)
        val_mae = mean_absolute_error(y_val, y_pred)
        val_r_squared = r2_score(y_val, y_pred)
        if np.isnan(val_r_squared):
            val_r_squared = -1
        alpha = 0.5
        return {
            'val_MSE': val_mse,
            'val_RMSE': val_rmse,
            'val_MAE': val_mae,
            'val_R-squared': val_r_squared,
        }

# 模型预测结果
pred = model.predict(X_val)
if task_type == 'classification':
    proba = model.predict_proba(X_val)
else:
    proba = None

# 模型验证
model_res = eval_score(task_type, y_val, pred, proba)

# 查看输出
for k, v in model_res.items():
    print(f"{k}: {v}")

步骤五:模型保存

在代码中保存训练好的模型。

import pickle
save_path = "model.pkl" ### 修改为想要保存的路径
with open(save_path, "wb") as file:
    pickle.dump(model, file)
    file.close()

完整代码

import xgboost as xgb
import pandas as pd
import pickle

# 超参数设置
target_name = "" ### 请修改成您任务的目标列名

# 检查target_name是否合法
replace_str = [':', '[', ']', '(', ')', '!', '@', '#', '¥', '%', '…', '《', '》', '【', '】', ' ']
for s in target_name:
    if s in replace_str:
        target_name = target_name.replace(s, '_')

# 设置任务类型
task_type = 'classification' ### 如果是分类任务请设置为classification,如果是回归任务请设置为regression

# 设置模型参数
params = {
  "reg_alpha": 0.010041324684096948,
  "subsample": 1,
  "max_leaves": 5,
  "reg_lambda": 0.06780780830355458,
  "n_estimators": 5,
  "learning_rate": 0.134709002918751,
  "colsample_bytree": 0.8820431130526493,
  "min_child_weight": 0.40726573498774193,
  "colsample_bylevel": 1
}  ### 请根据changtianml得到的结果修改

# 读取下载的训练集和测试集
train, validation = pd.read_csv("train.csv"), pd.read_csv("validation.csv") ### 修改成下载数据的路径
validation = validation[train.columns]

# 切分数据和标签
X_train, y_train = train.drop(columns=target_name), train[target_name]
X_val, y_val = validation.drop(columns=target_name), validation[target_name]

# 初始化模型参数
if task_type == 'classification':
        model = xgb.XGBClassifier(**params)
else:
        model = xgb.XGBRegressor(**params)

# 模型训练
model.fit(X_train, y_train)

# 模型验证函数
def eval_score(task_type, y_val, y_pred, y_proba=None):
    """
    Args:
        task_type (str, optional): 任务类型,classification or regression
        y_val (pd.Series): 验证集的真实标签
        y_pred (pd.Series): 模型预测的标签
        y_proba (pd.Series, optional): 模型预测的概率,当只有是分类任务的时候才需要输入,默认为None.

    Returns:
        dict: 常见指标的计算值
    """
    if task_type == 'classification':
        from sklearn.metrics import log_loss
        from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, log_loss

        val_loss = log_loss(y_val, y_proba)
        val_f1 = f1_score(y_val, y_pred, average='weighted')
        val_accuracy = accuracy_score(y_val, y_pred)
        val_precision = precision_score(y_val, y_pred, average='weighted')
        val_recall = recall_score(y_val, y_pred, average='weighted')
        if y_proba.shape[1] > 1:
            num_classes = y_proba.shape[1]
            val_auc_roc = 0.0
            for class_idx in range(num_classes):
                val_auc_roc += roc_auc_score((y_val == class_idx).astype(int), y_proba[:, class_idx])
            val_auc_roc /= num_classes
        else:
            val_auc_roc = roc_auc_score(y_val, y_proba)
        return {
            "val_log_loss": val_loss,
            "val_f1": val_f1,
            "val_accuracy": val_accuracy,
            "val_precision": val_precision,
            "val_recall": val_recall,
            "val_auc": val_auc_roc,
        }
    else:
        from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
        import numpy as np

        val_mse = mean_squared_error(y_val, y_pred)
        val_rmse = np.sqrt(val_mse)
        val_mae = mean_absolute_error(y_val, y_pred)
        val_r_squared = r2_score(y_val, y_pred)
        if np.isnan(val_r_squared):
            val_r_squared = -1
        alpha = 0.5
        return {
            'val_MSE': val_mse,
            'val_RMSE': val_rmse,
            'val_MAE': val_mae,
            'val_R-squared': val_r_squared,
        }

# 模型预测结果
pred = model.predict(X_val)
if task_type == 'classification':
    proba = model.predict_proba(X_val)
else:
    proba = None

# 模型验证
model_res = eval_score(task_type, y_val, pred, proba)

# 查看输出
for k, v in model_res.items():
    print(f"{k}: {v}")
    
save_path = "model.pkl" ### 修改为想要保存的路径
with open(save_path, "wb") as file:
    pickle.dump(model, file)
    file.close()

XGB_limitdepth

该算法的复现步骤可以参考XGBoost

RandomForest

经过实验发现,部分模型在不同设备上的复现结果不一致,这主要由于算法本身内部的随机数设置不同,例如XGBoost,以及我们的AutoML受时间预算限制会部分设置提前结束模型训练。这导致复现模型的结果无法完全和本平台一致,但是准确率并不会相差太多,上下浮动不会超过0.2%。

复现步骤

步骤一:下载模型训练集和测试集

(1)选择任务右侧的高级按钮,然后点击下拉框的复现按钮。

(2)根据弹出的提示框按顺序点击。

① 点击下载训练集获取模型训练时的数据(处理过)。

② 点击下载验证集获取模型验证时的数据(处理过)。

③ 点击相应的模型获取模型复现代码,例如:以下使用的是extra_tree的算法模型,则点击extra_tree按钮。

(3)在代码中加载数据。

import pandas as pd

# 超参数设置
target_name = "" ### 请修改成您任务的目标列名

# 检查target_name是否合法
replace_str = [':', '[', ']', '(', ')', '!', '@', '#', '¥', '%', '…', '《', '》', '【', '】', ' ']
for s in target_name:
    if s in replace_str:
        target_name = target_name.replace(s, '_')

# 读取下载的训练集和测试集
train, validation = pd.read_csv("train.csv"), pd.read_csv("validation.csv") ### 修改成下载数据的路径
validation = validation[train.columns]

# 切分数据和标签
X_train, y_train = train.drop(columns=target_name), train[target_name]
X_val, y_val = validation.drop(columns=target_name), validation[target_name]

步骤二:复制模型参数

(1)选择任务右侧的更多按钮,然后点击下拉框的查看模型参数按钮。

(2)复制自动化机器学习参数配置下的参数。

(3)在代码中设置参数。

# 设置模型参数
params = {
      "criterion": "entropy",
      "max_leaves": 35,
      "max_features": 0.44995214704932407,
      "n_estimators": 34
}  ### 请根据changtianml得到的结果修改

if "max_leaves" in params:
    params["max_leaf_nodes"] = params.get("max_leaf_nodes", params.pop("max_leaves"))

步骤三:模型复现

在代码中初级始化模型并训练。

from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor

# 设置任务类型
task_type = 'classification' ### 如果是分类任务请设置为classification,如果是回归任务请设置为regression

# 初始化模型参数
if task_type == 'classification':
        model = RandomForestClassifier(**params, random_state=12032022)  
else:
        model = RandomForestRegressor(**params, random_state=12032022)

# 模型训练
model.fit(X_train, y_train)

步骤四:模型验证

在代码中验证训练的模型。

# 模型验证函数
def eval_score(task_type, y_val, y_pred, y_proba=None):
    """
    Args:
        task_type (str, optional): 任务类型,classification or regression
        y_val (pd.Series): 验证集的真实标签
        y_pred (pd.Series): 模型预测的标签
        y_proba (pd.Series, optional): 模型预测的概率,当只有是分类任务的时候才需要输入,默认为None.

    Returns:
        dict: 常见指标的计算值
    """
    if task_type == 'classification':
        from sklearn.metrics import log_loss
        from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, log_loss

        val_loss = log_loss(y_val, y_proba)
        val_f1 = f1_score(y_val, y_pred, average='weighted')
        val_accuracy = accuracy_score(y_val, y_pred)
        val_precision = precision_score(y_val, y_pred, average='weighted')
        val_recall = recall_score(y_val, y_pred, average='weighted')
        if y_proba.shape[1] > 1:
            num_classes = y_proba.shape[1]
            val_auc_roc = 0.0
            for class_idx in range(num_classes):
                val_auc_roc += roc_auc_score((y_val == class_idx).astype(int), y_proba[:, class_idx])
            val_auc_roc /= num_classes
        else:
            val_auc_roc = roc_auc_score(y_val, y_proba)
        return {
            "val_log_loss": val_loss,
            "val_f1": val_f1,
            "val_accuracy": val_accuracy,
            "val_precision": val_precision,
            "val_recall": val_recall,
            "val_auc": val_auc_roc,
        }
    else:
        from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
        import numpy as np

        val_mse = mean_squared_error(y_val, y_pred)
        val_rmse = np.sqrt(val_mse)
        val_mae = mean_absolute_error(y_val, y_pred)
        val_r_squared = r2_score(y_val, y_pred)
        if np.isnan(val_r_squared):
            val_r_squared = -1
        alpha = 0.5
        return {
            'val_MSE': val_mse,
            'val_RMSE': val_rmse,
            'val_MAE': val_mae,
            'val_R-squared': val_r_squared,
        }

# 模型预测结果
pred = model.predict(X_val)
if task_type == 'classification':
    proba = model.predict_proba(X_val)
else:
    proba = None

# 模型验证
model_res = eval_score(task_type, y_val, pred, proba)

# 查看输出
for k, v in model_res.items():
    print(f"{k}: {v}")

步骤五:模型保存

在代码中保存训练好的模型。

import pickle
save_path = "model.pkl" ### 修改为想要保存的路径
with open(save_path, "wb") as file:
    pickle.dump(model, file)
    file.close()

完整代码

from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
import pandas as pd
import pickle

# 超参数设置
target_name = "" ### 请修改成您任务的目标列名

# 检查target_name是否合法
replace_str = [':', '[', ']', '(', ')', '!', '@', '#', '¥', '%', '…', '《', '》', '【', '】', ' ']
for s in target_name:
    if s in replace_str:
        target_name = target_name.replace(s, '_')

# 设置任务类型
task_type = 'classification' ### 如果是分类任务请设置为classification,如果是回归任务请设置为regression

# 设置模型参数
params = {
      "criterion": "entropy",
      "max_leaves": 35,
      "max_features": 0.44995214704932407,
      "n_estimators": 34
}  ### 请根据changtianml得到的结果修改
if "max_leaves" in params:
    params["max_leaf_nodes"] = params.get("max_leaf_nodes", params.pop("max_leaves"))

# 读取下载的训练集和测试集
train, validation = pd.read_csv("train.csv"), pd.read_csv("validation.csv") ### 修改成下载数据的路径
validation = validation[train.columns]

# 切分数据和标签
X_train, y_train = train.drop(columns=target_name), train[target_name]
X_val, y_val = validation.drop(columns=target_name), validation[target_name]

# 初始化模型参数
if task_type == 'classification':
        model = RandomForestClassifier(**params, random_state=12032022)  
else:
        model = RandomForestRegressor(**params, random_state=12032022)

# 模型训练
model.fit(X_train, y_train)

# 模型验证函数
def eval_score(task_type, y_val, y_pred, y_proba=None):
    """
    Args:
        task_type (str, optional): 任务类型,classification or regression
        y_val (pd.Series): 验证集的真实标签
        y_pred (pd.Series): 模型预测的标签
        y_proba (pd.Series, optional): 模型预测的概率,当只有是分类任务的时候才需要输入,默认为None.

    Returns:
        dict: 常见指标的计算值
    """
    if task_type == 'classification':
        from sklearn.metrics import log_loss
        from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, log_loss

        val_loss = log_loss(y_val, y_proba)
        val_f1 = f1_score(y_val, y_pred, average='weighted')
        val_accuracy = accuracy_score(y_val, y_pred)
        val_precision = precision_score(y_val, y_pred, average='weighted')
        val_recall = recall_score(y_val, y_pred, average='weighted')
        if y_proba.shape[1] > 1:
            num_classes = y_proba.shape[1]
            val_auc_roc = 0.0
            for class_idx in range(num_classes):
                val_auc_roc += roc_auc_score((y_val == class_idx).astype(int), y_proba[:, class_idx])
            val_auc_roc /= num_classes
        else:
            val_auc_roc = roc_auc_score(y_val, y_proba)
        return {
            "val_log_loss": val_loss,
            "val_f1": val_f1,
            "val_accuracy": val_accuracy,
            "val_precision": val_precision,
            "val_recall": val_recall,
            "val_auc": val_auc_roc,
        }
    else:
        from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
        import numpy as np

        val_mse = mean_squared_error(y_val, y_pred)
        val_rmse = np.sqrt(val_mse)
        val_mae = mean_absolute_error(y_val, y_pred)
        val_r_squared = r2_score(y_val, y_pred)
        if np.isnan(val_r_squared):
            val_r_squared = -1
        alpha = 0.5
        return {
            'val_MSE': val_mse,
            'val_RMSE': val_rmse,
            'val_MAE': val_mae,
            'val_R-squared': val_r_squared,
        }

# 模型预测结果
pred = model.predict(X_val)
if task_type == 'classification':
    proba = model.predict_proba(X_val)
else:
    proba = None

# 模型验证
model_res = eval_score(task_type, y_val, pred, proba)

# 查看输出
for k, v in model_res.items():
    print(f"{k}: {v}")
    
save_path = "model.pkl" ### 修改为想要保存的路径
with open(save_path, "wb") as file:
    pickle.dump(model, file)
    file.close()

LightGBM

经过实验发现,部分模型在不同设备上的复现结果不一致,这主要由于算法本身内部的随机数设置不同,例如XGBoost,以及我们的AutoML受时间预算限制会部分设置提前结束模型训练。这导致复现模型的结果无法完全和本平台一致,但是准确率并不会相差太多,上下浮动不会超过0.2%。

复现步骤

步骤一:下载模型训练集和测试集

(1)选择任务右侧的高级按钮,然后点击下拉框的复现按钮。

(2)根据弹出的提示框按顺序点击。

① 点击下载训练集获取模型训练时的数据(处理过)。

② 点击下载验证集获取模型验证时的数据(处理过)。

③ 点击相应的模型获取模型复现代码,例如:以下使用的是extra_tree的算法模型,则点击extra_tree按钮。

(3)在代码中加载数据。

import pandas as pd

# 超参数设置
target_name = "" ### 请修改成您任务的目标列名

# 检查target_name是否合法
replace_str = [':', '[', ']', '(', ')', '!', '@', '#', '¥', '%', '…', '《', '》', '【', '】', ' ']
for s in target_name:
    if s in replace_str:
        target_name = target_name.replace(s, '_')

# 读取下载的训练集和测试集
train, validation = pd.read_csv("train.csv"), pd.read_csv("validation.csv") ### 修改成下载数据的路径
validation = validation[train.columns]

# 切分数据和标签
X_train, y_train = train.drop(columns=target_name), train[target_name]
X_val, y_val = validation.drop(columns=target_name), validation[target_name]

步骤二:复制模型参数

(1)选择任务右侧的更多按钮,然后点击下拉框的查看模型参数按钮。

(2)复制自动化机器学习参数配置下的参数。

(3)在代码中设置参数。

# 设置模型参数
params = {
      "reg_alpha": 0.0009765625,
      "num_leaves": 17,
      "reg_lambda": 5.050114945171102,
      "log_max_bin": 10,
      "n_estimators": 299,
      "learning_rate": 0.1381012964152795,
      "colsample_bytree": 0.5886737139361939,
      "min_child_samples": 4
}  ### 请根据changtianml得到的结果修改

# 修改参数名
if "log_max_bin" in params:
    params["max_bin"] = (1 << params.pop("log_max_bin")) - 1

步骤三:模型复现

在代码中初级始化模型并训练。

import lightgbm as lgbm

# 设置任务类型
task_type = 'classification' ### 如果是分类任务请设置为classification,如果是回归任务请设置为regression

# 初始化模型参数
if task_type == 'classification':
        model = lgbm.LGBMClassifier(**params)  
else:
        model = lgbm.LGBMRegressor(**params)

# 模型训练
model.fit(X_train, y_train)

步骤四:模型验证

在代码中验证训练的模型。

# 模型验证函数
def eval_score(task_type, y_val, y_pred, y_proba=None):
    """
    Args:
        task_type (str, optional): 任务类型,classification or regression
        y_val (pd.Series): 验证集的真实标签
        y_pred (pd.Series): 模型预测的标签
        y_proba (pd.Series, optional): 模型预测的概率,当只有是分类任务的时候才需要输入,默认为None.

    Returns:
        dict: 常见指标的计算值
    """
    if task_type == 'classification':
        from sklearn.metrics import log_loss
        from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, log_loss

        val_loss = log_loss(y_val, y_proba)
        val_f1 = f1_score(y_val, y_pred, average='weighted')
        val_accuracy = accuracy_score(y_val, y_pred)
        val_precision = precision_score(y_val, y_pred, average='weighted')
        val_recall = recall_score(y_val, y_pred, average='weighted')
        if y_proba.shape[1] > 1:
            num_classes = y_proba.shape[1]
            val_auc_roc = 0.0
            for class_idx in range(num_classes):
                val_auc_roc += roc_auc_score((y_val == class_idx).astype(int), y_proba[:, class_idx])
            val_auc_roc /= num_classes
        else:
            val_auc_roc = roc_auc_score(y_val, y_proba)
        return {
            "val_log_loss": val_loss,
            "val_f1": val_f1,
            "val_accuracy": val_accuracy,
            "val_precision": val_precision,
            "val_recall": val_recall,
            "val_auc": val_auc_roc,
        }
    else:
        from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
        import numpy as np

        val_mse = mean_squared_error(y_val, y_pred)
        val_rmse = np.sqrt(val_mse)
        val_mae = mean_absolute_error(y_val, y_pred)
        val_r_squared = r2_score(y_val, y_pred)
        if np.isnan(val_r_squared):
            val_r_squared = -1
        alpha = 0.5
        return {
            'val_MSE': val_mse,
            'val_RMSE': val_rmse,
            'val_MAE': val_mae,
            'val_R-squared': val_r_squared,
        }

# 模型预测结果
pred = model.predict(X_val)
if task_type == 'classification':
    proba = model.predict_proba(X_val)
else:
    proba = None

# 模型验证
model_res = eval_score(task_type, y_val, pred, proba)

# 查看输出
for k, v in model_res.items():
    print(f"{k}: {v}")

步骤五:模型保存

在代码中保存训练好的模型。

import pickle
save_path = "model.pkl" ### 修改为想要保存的路径
with open(save_path, "wb") as file:
    pickle.dump(model, file)
    file.close()

完整代码

import lightgbm as lgbm
import pandas as pd
import pickle

# 超参数设置
target_name = "" ### 请修改成您任务的目标列名

# 检查target_name是否合法
replace_str = [':', '[', ']', '(', ')', '!', '@', '#', '¥', '%', '…', '《', '》', '【', '】', ' ']
for s in target_name:
    if s in replace_str:
        target_name = target_name.replace(s, '_')

# 设置任务类型
task_type = 'classification' ### 如果是分类任务请设置为classification,如果是回归任务请设置为regression

# 设置模型参数
params = {
      "reg_alpha": 0.0009765625,
      "num_leaves": 17,
      "reg_lambda": 5.050114945171102,
      "log_max_bin": 10,
      "n_estimators": 299,
      "learning_rate": 0.1381012964152795,
      "colsample_bytree": 0.5886737139361939,
      "min_child_samples": 4
}  ### 请根据changtianml得到的结果修改

# 修改参数名
if "log_max_bin" in params:
    params["max_bin"] = (1 << params.pop("log_max_bin")) - 1

# 读取下载的训练集和测试集
train, validation = pd.read_csv("train.csv"), pd.read_csv("validation.csv") ### 修改成下载数据的路径
validation = validation[train.columns]

# 切分数据和标签
X_train, y_train = train.drop(columns=target_name), train[target_name]
X_val, y_val = validation.drop(columns=target_name), validation[target_name]

# 初始化模型参数
if task_type == 'classification':
        model = lgbm.LGBMClassifier(**params)  
else:
        model = lgbm.LGBMRegressor(**params)

# 模型训练
model.fit(X_train, y_train)

# 模型验证函数
def eval_score(task_type, y_val, y_pred, y_proba=None):
    """
    Args:
        task_type (str, optional): 任务类型,classification or regression
        y_val (pd.Series): 验证集的真实标签
        y_pred (pd.Series): 模型预测的标签
        y_proba (pd.Series, optional): 模型预测的概率,当只有是分类任务的时候才需要输入,默认为None.

    Returns:
        dict: 常见指标的计算值
    """
    if task_type == 'classification':
        from sklearn.metrics import log_loss
        from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, log_loss

        val_loss = log_loss(y_val, y_proba)
        val_f1 = f1_score(y_val, y_pred, average='weighted')
        val_accuracy = accuracy_score(y_val, y_pred)
        val_precision = precision_score(y_val, y_pred, average='weighted')
        val_recall = recall_score(y_val, y_pred, average='weighted')
        if y_proba.shape[1] > 1:
            num_classes = y_proba.shape[1]
            val_auc_roc = 0.0
            for class_idx in range(num_classes):
                val_auc_roc += roc_auc_score((y_val == class_idx).astype(int), y_proba[:, class_idx])
            val_auc_roc /= num_classes
        else:
            val_auc_roc = roc_auc_score(y_val, y_proba)
        return {
            "val_log_loss": val_loss,
            "val_f1": val_f1,
            "val_accuracy": val_accuracy,
            "val_precision": val_precision,
            "val_recall": val_recall,
            "val_auc": val_auc_roc,
        }
    else:
        from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
        import numpy as np

        val_mse = mean_squared_error(y_val, y_pred)
        val_rmse = np.sqrt(val_mse)
        val_mae = mean_absolute_error(y_val, y_pred)
        val_r_squared = r2_score(y_val, y_pred)
        if np.isnan(val_r_squared):
            val_r_squared = -1
        alpha = 0.5
        return {
            'val_MSE': val_mse,
            'val_RMSE': val_rmse,
            'val_MAE': val_mae,
            'val_R-squared': val_r_squared,
        }

# 模型预测结果
pred = model.predict(X_val)
if task_type == 'classification':
    proba = model.predict_proba(X_val)
else:
    proba = None

# 模型验证
model_res = eval_score(task_type, y_val, pred, proba)

# 查看输出
for k, v in model_res.items():
    print(f"{k}: {v}")
    
save_path = "model.pkl" ### 修改为想要保存的路径
with open(save_path, "wb") as file:
    pickle.dump(model, file)
    file.close()

CatBoost

经过实验发现,部分模型在不同设备上的复现结果不一致,这主要由于算法本身内部的随机数设置不同,例如XGBoost,以及我们的AutoML受时间预算限制会部分设置提前结束模型训练。这导致复现模型的结果无法完全和本平台一致,但是准确率并不会相差太多,上下浮动不会超过0.2%。

复现步骤

步骤一:下载模型训练集和测试集

(1)选择任务右侧的高级按钮,然后点击下拉框的复现按钮。

(2)根据弹出的提示框按顺序点击。

① 点击下载训练集获取模型训练时的数据(处理过)。

② 点击下载验证集获取模型验证时的数据(处理过)。

③ 点击相应的模型获取模型复现代码,例如:以下使用的是extra_tree的算法模型,则点击extra_tree按钮。

(3)在代码中加载数据。

import pandas as pd

# 超参数设置
target_name = "" ### 请修改成您任务的目标列名

# 检查target_name是否合法
replace_str = [':', '[', ']', '(', ')', '!', '@', '#', '¥', '%', '…', '《', '》', '【', '】', ' ']
for s in target_name:
    if s in replace_str:
        target_name = target_name.replace(s, '_')

# 读取下载的训练集和测试集
train, validation = pd.read_csv("train.csv"), pd.read_csv("validation.csv") ### 修改成下载数据的路径
validation = validation[train.columns]

# 切分数据和标签
X_train, y_train = train.drop(columns=target_name), train[target_name]
X_val, y_val = validation.drop(columns=target_name), validation[target_name]

步骤二:复制模型参数

(1)选择任务右侧的更多按钮,然后点击下拉框的查看模型参数按钮。

(2)复制自动化机器学习参数配置下的参数。

(3)在代码中设置参数。

# 设置模型参数
params = {
      "n_estimators": 92,
      "learning_rate": 0.07278722591827413,
      "early_stopping_rounds": 13
}  ### 请根据changtianml得到的结果修改

步骤三:模型复现

在代码中初级始化模型并训练。

import catboost as cb

# 切分数据和标签
n = max(int(len(y_train) * 0.9), len(y_train) - 1000)
eval_set = cb.Pool(data=X_train[n:], label=y_train[n:])

# 设置任务类型
task_type = 'classification' ### 如果是分类任务请设置为classification,如果是回归任务请设置为regression

# 初始化模型参数
if task_type == 'classification':
        model = cb.CatBoostClassifier(**params, random_seed=10242048)  
else:
        model = cb.CatBoostRegressor(**params, random_seed=10242048)

# 模型训练
model.fit(X_train[:n], y_train[:n], verbose=False, eval_set=eval_set)

步骤四:模型验证

在代码中验证训练的模型。

# 模型验证函数
def eval_score(task_type, y_val, y_pred, y_proba=None):
    """
    Args:
        task_type (str, optional): 任务类型,classification or regression
        y_val (pd.Series): 验证集的真实标签
        y_pred (pd.Series): 模型预测的标签
        y_proba (pd.Series, optional): 模型预测的概率,当只有是分类任务的时候才需要输入,默认为None.

    Returns:
        dict: 常见指标的计算值
    """
    if task_type == 'classification':
        from sklearn.metrics import log_loss
        from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, log_loss

        val_loss = log_loss(y_val, y_proba)
        val_f1 = f1_score(y_val, y_pred, average='weighted')
        val_accuracy = accuracy_score(y_val, y_pred)
        val_precision = precision_score(y_val, y_pred, average='weighted')
        val_recall = recall_score(y_val, y_pred, average='weighted')
        if y_proba.shape[1] > 1:
            num_classes = y_proba.shape[1]
            val_auc_roc = 0.0
            for class_idx in range(num_classes):
                val_auc_roc += roc_auc_score((y_val == class_idx).astype(int), y_proba[:, class_idx])
            val_auc_roc /= num_classes
        else:
            val_auc_roc = roc_auc_score(y_val, y_proba)
        return {
            "val_log_loss": val_loss,
            "val_f1": val_f1,
            "val_accuracy": val_accuracy,
            "val_precision": val_precision,
            "val_recall": val_recall,
            "val_auc": val_auc_roc,
        }
    else:
        from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
        import numpy as np

        val_mse = mean_squared_error(y_val, y_pred)
        val_rmse = np.sqrt(val_mse)
        val_mae = mean_absolute_error(y_val, y_pred)
        val_r_squared = r2_score(y_val, y_pred)
        if np.isnan(val_r_squared):
            val_r_squared = -1
        alpha = 0.5
        return {
            'val_MSE': val_mse,
            'val_RMSE': val_rmse,
            'val_MAE': val_mae,
            'val_R-squared': val_r_squared,
        }

# 模型预测结果
pred = model.predict(X_val)
if task_type == 'classification':
    proba = model.predict_proba(X_val)
else:
    proba = None

# 模型验证
model_res = eval_score(task_type, y_val, pred, proba)

# 查看输出
for k, v in model_res.items():
    print(f"{k}: {v}")

步骤五:模型保存

在代码中保存训练好的模型。

import pickle
save_path = "model.pkl" ### 修改为想要保存的路径
with open(save_path, "wb") as file:
    pickle.dump(model, file)
    file.close()

完整代码

import catboost as cb
import pandas as pd
import pickle

# 超参数设置
target_name = "" ### 请修改成您任务的目标列名

# 检查target_name是否合法
replace_str = [':', '[', ']', '(', ')', '!', '@', '#', '¥', '%', '…', '《', '》', '【', '】', ' ']
for s in target_name:
    if s in replace_str:
        target_name = target_name.replace(s, '_')

# 设置任务类型
task_type = 'classification' ### 如果是分类任务请设置为classification,如果是回归任务请设置为regression

# 设置模型参数
params = {
      "n_estimators": 92,
      "learning_rate": 0.07278722591827413,
      "early_stopping_rounds": 13
}  ### 请根据changtianml得到的结果修改

# 读取下载的训练集和测试集
train, validation = pd.read_csv("train.csv"), pd.read_csv("validation.csv") ### 修改成下载数据的路径
validation = validation[train.columns]

# 切分数据和标签
X_train, y_train = train.drop(columns=target_name), train[target_name]
X_val, y_val = validation.drop(columns=target_name), validation[target_name]

# 切分数据和标签
n = max(int(len(y_train) * 0.9), len(y_train) - 1000)
eval_set = cb.Pool(data=X_train[n:], label=y_train[n:])

# 初始化模型参数
if task_type == 'classification':
        model = cb.CatBoostClassifier(**params, random_seed=10242048)  
else:
        model = cb.CatBoostRegressor(**params, random_seed=10242048)

# 模型训练
model.fit(X_train[:n], y_train[:n], verbose=False, eval_set=eval_set)

# 模型验证函数
def eval_score(task_type, y_val, y_pred, y_proba=None):
    """
    Args:
        task_type (str, optional): 任务类型,classification or regression
        y_val (pd.Series): 验证集的真实标签
        y_pred (pd.Series): 模型预测的标签
        y_proba (pd.Series, optional): 模型预测的概率,当只有是分类任务的时候才需要输入,默认为None.

    Returns:
        dict: 常见指标的计算值
    """
    if task_type == 'classification':
        from sklearn.metrics import log_loss
        from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, log_loss

        val_loss = log_loss(y_val, y_proba)
        val_f1 = f1_score(y_val, y_pred, average='weighted')
        val_accuracy = accuracy_score(y_val, y_pred)
        val_precision = precision_score(y_val, y_pred, average='weighted')
        val_recall = recall_score(y_val, y_pred, average='weighted')
        if y_proba.shape[1] > 1:
            num_classes = y_proba.shape[1]
            val_auc_roc = 0.0
            for class_idx in range(num_classes):
                val_auc_roc += roc_auc_score((y_val == class_idx).astype(int), y_proba[:, class_idx])
            val_auc_roc /= num_classes
        else:
            val_auc_roc = roc_auc_score(y_val, y_proba)
        return {
            "val_log_loss": val_loss,
            "val_f1": val_f1,
            "val_accuracy": val_accuracy,
            "val_precision": val_precision,
            "val_recall": val_recall,
            "val_auc": val_auc_roc,
        }
    else:
        from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
        import numpy as np

        val_mse = mean_squared_error(y_val, y_pred)
        val_rmse = np.sqrt(val_mse)
        val_mae = mean_absolute_error(y_val, y_pred)
        val_r_squared = r2_score(y_val, y_pred)
        if np.isnan(val_r_squared):
            val_r_squared = -1
        alpha = 0.5
        return {
            'val_MSE': val_mse,
            'val_RMSE': val_rmse,
            'val_MAE': val_mae,
            'val_R-squared': val_r_squared,
        }

# 模型预测结果
pred = model.predict(X_val)
if task_type == 'classification':
    proba = model.predict_proba(X_val)
else:
    proba = None

# 模型验证
model_res = eval_score(task_type, y_val, pred, proba)

# 查看输出
for k, v in model_res.items():
    print(f"{k}: \t{v}")
    
save_path = "model.pkl" ### 修改为想要保存的路径
with open(save_path, "wb") as file:
    pickle.dump(model, file)
    file.close()

ExtraTree

经过实验发现,部分模型在不同设备上的复现结果不一致,这主要由于算法本身内部的随机数设置不同,例如XGBoost,以及我们的AutoML受时间预算限制会部分设置提前结束模型训练。这导致复现模型的结果无法完全和本平台一致,但是准确率并不会相差太多,上下浮动不会超过0.2%。

复现步骤

步骤一:下载模型训练集和测试集

(1)选择任务右侧的高级按钮,然后点击下拉框的复现按钮。

(2)根据弹出的提示框按顺序点击。

① 点击下载训练集获取模型训练时的数据(处理过)。

② 点击下载验证集获取模型验证时的数据(处理过)。

③ 点击相应的模型获取模型复现代码,例如:以下使用的是extra_tree的算法模型,则点击extra_tree按钮。

(3)在代码中加载数据。

import pandas as pd

# 超参数设置
target_name = "" ### 请修改成您任务的目标列名

# 检查target_name是否合法
replace_str = [':', '[', ']', '(', ')', '!', '@', '#', '¥', '%', '…', '《', '》', '【', '】', ' ']
for s in target_name:
    if s in replace_str:
        target_name = target_name.replace(s, '_')

# 读取下载的训练集和测试集
train, validation = pd.read_csv("train.csv"), pd.read_csv("validation.csv") ### 修改成下载数据的路径
validation = validation[train.columns]

# 切分数据和标签
X_train, y_train = train.drop(columns=target_name), train[target_name]
X_val, y_val = validation.drop(columns=target_name), validation[target_name]

步骤二:复制模型参数

(1)选择任务右侧的更多按钮,然后点击下拉框的查看模型参数按钮。

(2)复制自动化机器学习参数配置下的参数。

(3)在代码中设置参数。

# 设置模型参数
params = {
      "criterion": "gini",
      "max_leaves": 12,
      "max_features": 1,
      "n_estimators": 50
}  ### 请根据changtianml得到的结果修改

if "max_leaves" in params:
    params["max_leaf_nodes"] = params.get("max_leaf_nodes", params.pop("max_leaves"))

步骤三:模型复现

在代码中初级始化模型并训练。

from sklearn.ensemble import ExtraTreesClassifier,  ExtraTreesRegressor

# 设置任务类型
task_type = 'classification' ### 如果是分类任务请设置为classification,如果是回归任务请设置为regression

# 初始化模型参数
if task_type == 'classification':
        model = ExtraTreesClassifier(**params, random_state=12032022)  
else:
        model = ExtraTreesRegressor(**params, random_state=12032022)

# 模型训练
model.fit(X_train, y_train)

步骤四:模型验证

在代码中验证训练的模型。

# 模型验证函数
def eval_score(task_type, y_val, y_pred, y_proba=None):
    """
    Args:
        task_type (str, optional): 任务类型,classification or regression
        y_val (pd.Series): 验证集的真实标签
        y_pred (pd.Series): 模型预测的标签
        y_proba (pd.Series, optional): 模型预测的概率,当只有是分类任务的时候才需要输入,默认为None.

    Returns:
        dict: 常见指标的计算值
    """
    if task_type == 'classification':
        from sklearn.metrics import log_loss
        from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, log_loss

        val_loss = log_loss(y_val, y_proba)
        val_f1 = f1_score(y_val, y_pred, average='weighted')
        val_accuracy = accuracy_score(y_val, y_pred)
        val_precision = precision_score(y_val, y_pred, average='weighted')
        val_recall = recall_score(y_val, y_pred, average='weighted')
        if y_proba.shape[1] > 1:
            num_classes = y_proba.shape[1]
            val_auc_roc = 0.0
            for class_idx in range(num_classes):
                val_auc_roc += roc_auc_score((y_val == class_idx).astype(int), y_proba[:, class_idx])
            val_auc_roc /= num_classes
        else:
            val_auc_roc = roc_auc_score(y_val, y_proba)
        return {
            "val_log_loss": val_loss,
            "val_f1": val_f1,
            "val_accuracy": val_accuracy,
            "val_precision": val_precision,
            "val_recall": val_recall,
            "val_auc": val_auc_roc,
        }
    else:
        from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
        import numpy as np

        val_mse = mean_squared_error(y_val, y_pred)
        val_rmse = np.sqrt(val_mse)
        val_mae = mean_absolute_error(y_val, y_pred)
        val_r_squared = r2_score(y_val, y_pred)
        if np.isnan(val_r_squared):
            val_r_squared = -1
        alpha = 0.5
        return {
            'val_MSE': val_mse,
            'val_RMSE': val_rmse,
            'val_MAE': val_mae,
            'val_R-squared': val_r_squared,
        }

# 模型预测结果
pred = model.predict(X_val)
if task_type == 'classification':
    proba = model.predict_proba(X_val)
else:
    proba = None

# 模型验证
model_res = eval_score(task_type, y_val, pred, proba)

# 查看输出
for k, v in model_res.items():
    print(f"{k}: {v}")

步骤五:模型保存

在代码中保存训练好的模型。

import pickle
save_path = "model.pkl" ### 修改为想要保存的路径
with open(save_path, "wb") as file:
    pickle.dump(model, file)
    file.close()

完整代码

from sklearn.ensemble import ExtraTreesClassifier,  ExtraTreesRegressor
import pandas as pd
import pickle

# 超参数设置
target_name = "" ### 请修改成您任务的目标列名、

# 检查target_name是否合法
replace_str = [':', '[', ']', '(', ')', '!', '@', '#', '¥', '%', '…', '《', '》', '【', '】', ' ']
for s in target_name:
    if s in replace_str:
        target_name = target_name.replace(s, '_')

# 设置任务类型
task_type = 'classification' ### 如果是分类任务请设置为classification,如果是回归任务请设置为regression

# 设置模型参数
params = {
      "criterion": "gini",
      "max_leaves": 12,
      "max_features": 1,
      "n_estimators": 50
}  ### 请根据changtianml得到的结果修改
if "max_leaves" in params:
    params["max_leaf_nodes"] = params.get("max_leaf_nodes", params.pop("max_leaves"))

# 读取下载的训练集和测试集
train, validation = pd.read_csv("train.csv"), pd.read_csv("validation.csv") ### 修改成下载数据的路径
validation = validation[train.columns]

# 切分数据和标签
X_train, y_train = train.drop(columns=target_name), train[target_name]
X_val, y_val = validation.drop(columns=target_name), validation[target_name]

# 初始化模型参数
if task_type == 'classification':
        model = ExtraTreesClassifier(**params, random_state=12032022)  
else:
        model = ExtraTreesRegressor(**params, random_state=12032022)

# 模型训练
model.fit(X_train, y_train)

# 模型验证函数
def eval_score(task_type, y_val, y_pred, y_proba=None):
    """
    Args:
        task_type (str, optional): 任务类型,classification or regression
        y_val (pd.Series): 验证集的真实标签
        y_pred (pd.Series): 模型预测的标签
        y_proba (pd.Series, optional): 模型预测的概率,当只有是分类任务的时候才需要输入,默认为None.

    Returns:
        dict: 常见指标的计算值
    """
    if task_type == 'classification':
        from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, log_loss

        val_loss = log_loss(y_val, y_proba)
        val_f1 = f1_score(y_val, y_pred, average='weighted')
        val_accuracy = accuracy_score(y_val, y_pred)
        val_precision = precision_score(y_val, y_pred, average='weighted')
        val_recall = recall_score(y_val, y_pred, average='weighted')
        if y_proba.shape[1] > 1:
            num_classes = y_proba.shape[1]
            val_auc_roc = 0.0
            for class_idx in range(num_classes):
                val_auc_roc += roc_auc_score((y_val == class_idx).astype(int), y_proba[:, class_idx])
            val_auc_roc /= num_classes
        else:
            val_auc_roc = roc_auc_score(y_val, y_proba)
        return {
            "val_log_loss": val_loss,
            "val_f1": val_f1,
            "val_accuracy": val_accuracy,
            "val_precision": val_precision,
            "val_recall": val_recall,
            "val_auc": val_auc_roc,
        }
    else:
        from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
        import numpy as np

        val_mse = mean_squared_error(y_val, y_pred)
        val_rmse = np.sqrt(val_mse)
        val_mae = mean_absolute_error(y_val, y_pred)
        val_r_squared = r2_score(y_val, y_pred)
        if np.isnan(val_r_squared):
            val_r_squared = -1
        alpha = 0.5
        return {
            'val_MSE': val_mse,
            'val_RMSE': val_rmse,
            'val_MAE': val_mae,
            'val_R-squared': val_r_squared,
        }

# 模型预测结果
pred = model.predict(X_val)
if task_type == 'classification':
    proba = model.predict_proba(X_val)
else:
    proba = None

# 模型验证
model_res = eval_score(task_type, y_val, pred, proba)

# 查看输出
for k, v in model_res.items():
    print(f"{k}:\t{v}")
    
save_path = "model.pkl" ### 修改为想要保存的路径
with open(save_path, "wb") as file:
    pickle.dump(model, file)
    file.close()

线性回归

经过实验发现,部分模型在不同设备上的复现结果不一致,这主要由于算法本身内部的随机数设置不同,例如XGBoost,以及我们的AutoML受时间预算限制会部分设置提前结束模型训练。这导致复现模型的结果无法完全和本平台一致,但是准确率并不会相差太多,上下浮动不会超过0.2%。

复现步骤

步骤一:下载模型训练集和测试集

(1)选择任务右侧的高级按钮,然后点击下拉框的复现按钮。

(2)根据弹出的提示框按顺序点击。

① 点击下载训练集获取模型训练时的数据(处理过)。

② 点击下载验证集获取模型验证时的数据(处理过)。

③ 点击相应的模型获取模型复现代码,例如:以下使用的是extra_tree的算法模型,则点击extra_tree按钮。

(3)在代码中加载数据。

import pandas as pd

# 超参数设置
target_name = "" ### 请修改成您任务的目标列名

# 检查target_name是否合法
replace_str = [':', '[', ']', '(', ')', '!', '@', '#', '¥', '%', '…', '《', '》', '【', '】', ' ']
for s in target_name:
    if s in replace_str:
        target_name = target_name.replace(s, '_')

# 读取下载的训练集和测试集
train, validation = pd.read_csv("train.csv"), pd.read_csv("validation.csv") ### 修改成下载数据的路径
validation = validation[train.columns]

# 切分数据和标签
X_train, y_train = train.drop(columns=target_name), train[target_name]
X_val, y_val = validation.drop(columns=target_name), validation[target_name]

步骤二:复制模型参数

(1)选择任务右侧的更多按钮,然后点击下拉框的查看模型参数按钮。

(2)复制自动化机器学习参数配置下的参数。

其中:lrc代表了分类线性模型,lrr代表了回归线性模型。

步骤三:模型复现

在代码中初级始化模型并训练。

from sklearn.linear_model import LogisticRegression,LinearRegression

# 设置任务类型
task_type = 'classification' ### 如果是分类任务请设置为classification,如果是回归任务请设置为regression

# 初始化模型参数
if task_type == 'classification':
        model = LogisticRegression() 
else:
        model = LinearRegression()

# 模型训练
model.fit(X_train, y_train)

步骤四:模型验证

在代码中验证训练的模型。

# 模型验证函数
def eval_score(task_type, y_val, y_pred, y_proba=None):
    """
    Args:
        task_type (str, optional): 任务类型,classification or regression
        y_val (pd.Series): 验证集的真实标签
        y_pred (pd.Series): 模型预测的标签
        y_proba (pd.Series, optional): 模型预测的概率,当只有是分类任务的时候才需要输入,默认为None.

    Returns:
        dict: 常见指标的计算值
    """
    if task_type == 'classification':
        from sklearn.metrics import log_loss
        from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, log_loss

        val_loss = log_loss(y_val, y_proba)
        val_f1 = f1_score(y_val, y_pred, average='weighted')
        val_accuracy = accuracy_score(y_val, y_pred)
        val_precision = precision_score(y_val, y_pred, average='weighted')
        val_recall = recall_score(y_val, y_pred, average='weighted')
        if y_proba.shape[1] > 1:
            num_classes = y_proba.shape[1]
            val_auc_roc = 0.0
            for class_idx in range(num_classes):
                val_auc_roc += roc_auc_score((y_val == class_idx).astype(int), y_proba[:, class_idx])
            val_auc_roc /= num_classes
        else:
            val_auc_roc = roc_auc_score(y_val, y_proba)
        return {
            "val_log_loss": val_loss,
            "val_f1": val_f1,
            "val_accuracy": val_accuracy,
            "val_precision": val_precision,
            "val_recall": val_recall,
            "val_auc": val_auc_roc,
        }
    else:
        from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
        import numpy as np

        val_mse = mean_squared_error(y_val, y_pred)
        val_rmse = np.sqrt(val_mse)
        val_mae = mean_absolute_error(y_val, y_pred)
        val_r_squared = r2_score(y_val, y_pred)
        if np.isnan(val_r_squared):
            val_r_squared = -1
        alpha = 0.5
        return {
            'val_MSE': val_mse,
            'val_RMSE': val_rmse,
            'val_MAE': val_mae,
            'val_R-squared': val_r_squared,
        }

# 模型预测结果
pred = model.predict(X_val)
if task_type == 'classification':
    proba = model.predict_proba(X_val)
else:
    proba = None

# 模型验证
model_res = eval_score(task_type, y_val, pred, proba)

# 查看输出
for k, v in model_res.items():
    print(f"{k}: {v}")

步骤五:模型保存

在代码中保存训练好的模型。

import pickle
save_path = "model.pkl" ### 修改为想要保存的路径
with open(save_path, "wb") as file:
    pickle.dump(model, file)
    file.close()

完整代码

from sklearn.linear_model import LogisticRegression,LinearRegression
import pandas as pd
import pickle

# 超参数设置
target_name = "" ### 请修改成您任务的目标列名、

# 检查target_name是否合法
replace_str = [':', '[', ']', '(', ')', '!', '@', '#', '¥', '%', '…', '《', '》', '【', '】', ' ']
for s in target_name:
    if s in replace_str:
        target_name = target_name.replace(s, '_')

# 设置任务类型
task_type = 'classification' ### 如果是分类任务请设置为classification,如果是回归任务请设置为regression

# 读取下载的训练集和测试集
train, validation = pd.read_csv("train.csv"), pd.read_csv("validation.csv") ### 修改成下载数据的路径
validation = validation[train.columns]

# 切分数据和标签
X_train, y_train = train.drop(columns=target_name), train[target_name]
X_val, y_val = validation.drop(columns=target_name), validation[target_name]

# 初始化模型参数
if task_type == 'classification':
        model = LogisticRegression() 
else:
        model = LinearRegression()
        
 # 模型训练
model.fit(X_train, y_train)

# 模型验证函数
def eval_score(task_type, y_val, y_pred, y_proba=None):
    """
    Args:
        task_type (str, optional): 任务类型,classification or regression
        y_val (pd.Series): 验证集的真实标签
        y_pred (pd.Series): 模型预测的标签
        y_proba (pd.Series, optional): 模型预测的概率,当只有是分类任务的时候才需要输入,默认为None.

    Returns:
        dict: 常见指标的计算值
    """
    if task_type == 'classification':
        from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, log_loss

        val_loss = log_loss(y_val, y_proba)
        val_f1 = f1_score(y_val, y_pred, average='weighted')
        val_accuracy = accuracy_score(y_val, y_pred)
        val_precision = precision_score(y_val, y_pred, average='weighted')
        val_recall = recall_score(y_val, y_pred, average='weighted')
        if y_proba.shape[1] > 1:
            num_classes = y_proba.shape[1]
            val_auc_roc = 0.0
            for class_idx in range(num_classes):
                val_auc_roc += roc_auc_score((y_val == class_idx).astype(int), y_proba[:, class_idx])
            val_auc_roc /= num_classes
        else:
            val_auc_roc = roc_auc_score(y_val, y_proba)
        return {
            "val_log_loss": val_loss,
            "val_f1": val_f1,
            "val_accuracy": val_accuracy,
            "val_precision": val_precision,
            "val_recall": val_recall,
            "val_auc": val_auc_roc,
        }
    else:
        from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
        import numpy as np

        val_mse = mean_squared_error(y_val, y_pred)
        val_rmse = np.sqrt(val_mse)
        val_mae = mean_absolute_error(y_val, y_pred)
        val_r_squared = r2_score(y_val, y_pred)
        if np.isnan(val_r_squared):
            val_r_squared = -1
        alpha = 0.5
        return {
            'val_MSE': val_mse,
            'val_RMSE': val_rmse,
            'val_MAE': val_mae,
            'val_R-squared': val_r_squared,
        }

# 模型预测结果
pred = model.predict(X_val)
if task_type == 'classification':
    proba = model.predict_proba(X_val)
else:
    proba = None

# 模型验证
model_res = eval_score(task_type, y_val, pred, proba)

# 查看输出
for k, v in model_res.items():
    print(f"{k}:\t{v}")
    
save_path = "model.pkl" ### 修改为想要保存的路径
with open(save_path, "wb") as file:
    pickle.dump(model, file)
    file.close()

最后更新于

这有帮助吗?