No description
  • Jupyter Notebook 73.9%
  • Python 26.1%
Find a file
2026-04-02 13:13:41 +08:00
configs feat: add sim 2026-04-02 13:13:41 +08:00
eda fix: loss weights and bs and lr 2026-04-01 21:51:08 +08:00
kuairand_baseline feat: add sim 2026-04-02 13:13:41 +08:00
.gitignore init:baseline with din 2026-03-29 21:21:06 +08:00
.python-version init:baseline with din 2026-03-29 21:21:06 +08:00
main.py baseline 2026-04-01 20:45:49 +08:00
pyproject.toml feat: wandb and gauc support 2026-03-30 12:07:56 +08:00
README.md fix:change configs 2026-04-01 23:01:37 +08:00
uv.lock feat: wandb and gauc support 2026-03-30 12:07:56 +08:00

KuaiRand-1K DIN Multi-Task Baseline

KuaiRand-1K 标准曝光日志上构建可复现 baseline

  • 模型:DIN encoder + 可插拔多任务 backbone(当前实现 MMoE
  • 任务:is_click / is_like / long_view
  • 管理:原生 wandb(默认 offline

Pipeline

  1. preprocess
    • 历史窗口:2022-04-08 ~ 2022-04-21
    • 目标窗口:2022-04-22 ~ 2022-05-08
    • split
      • train: 2022-04-22 ~ 2022-05-04
      • val: 2022-05-05 ~ 2022-05-06
      • test: 2022-05-07 ~ 2022-05-08
    • DIN 序列仅使用历史 is_click=1 行为,支持历史 engagement 特征(is_click / is_like / long_view)。
    • 预处理流程:读取与 schema 校验 -> 编码与清洗 -> 严格时序样本构建 -> 仅用 train 统计量归一化 -> 落盘。
    • 预处理结果默认写入共享缓存:artifacts/preprocessed_cache/<signature>/,同数据与切分配置会自动复用;如需强制重建可设 preprocess.force_rebuild = true
  2. train
    • 三任务二分类联合训练BCE
  • 默认 baseline三任务统一使用 sqrt(neg/pos) capped 正样本重加权。
  • 可选 unweighted baselinetrain.loss.positive_weighting = "none",不做正样本重加权。
  • 验证集 mean_gAUC 早停,若不可用回退 mean_roc_auc
  1. evaluate
    • 输出三任务 ROC-AUC / gAUC / PR-AUC / LogLoss / base_rate

Project Layout

configs/
  din_mmoe_baseline.toml
  din_mmoe_interaction_level.toml
  din_mmoe_score_boost.toml
  din_mmoe_unweighted_baseline.toml
kuairand_baseline/
  config.py
  constants.py
  dataset.py
  evaluate.py
  metrics.py
  model.py
  preprocess.py
  torch_utils.py
  train.py
main.py

Install

uv sync

W&B

默认启用 wandbmode = "offline"

[wandb]
enabled = true
project = "kuairand"
mode = "offline"
watch_model = false
log_model = false

如果需要上传线上面板:

  1. wandb login
  2. mode 改为 online

Run

加权 baseline默认

python main.py preprocess --config configs/din_mmoe_baseline.toml
python main.py train --config configs/din_mmoe_baseline.toml
python main.py evaluate --config configs/din_mmoe_baseline.toml --split test

unweighted baseline不做正样本重加权

python main.py preprocess --config configs/din_mmoe_unweighted_baseline.toml
python main.py train --config configs/din_mmoe_unweighted_baseline.toml
python main.py evaluate --config configs/din_mmoe_unweighted_baseline.toml --split test

score-boost experiment

python main.py preprocess --config configs/din_mmoe_score_boost.toml
python main.py train --config configs/din_mmoe_score_boost.toml
python main.py evaluate --config configs/din_mmoe_score_boost.toml --split test

interaction-level experiment

python main.py preprocess --config configs/din_mmoe_interaction_level.toml
python main.py train --config configs/din_mmoe_interaction_level.toml
python main.py evaluate --config configs/din_mmoe_interaction_level.toml --split test

Outputs

共享预处理缓存(默认):

  • artifacts/preprocessed_cache/<signature>/
    • {train,val,test}/
      • sparse.npy
      • dense.npy
      • history.npy
      • history_engagement.npy
      • labels.npy
      • info.npy
    • item_sequence_lookup.npy
    • feature_meta.json

实验输出目录(按配置 output_dir 分开):

  • <output_dir>/checkpoints/best_model.pt
  • <output_dir>/metrics/metrics.json
  • <output_dir>/metrics/training_history.json
  • <output_dir>/metrics/predictions_test.csv
  • <output_dir>/resolved_config.json

Notes

  • model.multitask_backbone 接口仍保留,当前注册实现只有 mmoe
  • 可选 model.din.interaction_level_feature 会把历史互动压成单一离散等级并拼到 DIN activation 输入中:0=padding/none, 1=click, 2=long_view, 3=like
  • 可选 model.din.score_boost 会对 DIN 序列 attention score 做额外放大:is_like=15.0long_view=11.5,同时命中时按较大值处理。
  • labels.npy 固定为三列:is_click,is_like,long_view
  • 归一化统计严格由 train split 拟合并应用到 val/test,避免数据泄漏。