周二组会汇报:Phase-2 SFT 实验结果

名词解释

标签简化举例

LoTE-Animal(21 类,原始名称 → 简化名称):

原始标签 简化标签
Aggregation grouping
CircumanalGlandSigning scent marking
DrinkWater drinking
Defecating pooping
Exploratory exploring
Feeding eating
Miscellaneous other
Parental parenting
Smelling sniffing
Urinating peeing
UrineSigning urine marking

MammalNet(12 类,原始名称 → 简化名称):

原始标签 简化标签
drinks water drinking
eats food eating
fights against other animals fighting
gives birth to a baby birthing
grooms/cleans itself or other animal grooming
hunts other animals hunting
mates with other animals mating
nurses or breastfeeds its baby nursing

为什么这样做有效: 简化后的标签更接近 LLM 预训练分布的常用日常动词,候选池中的 token 也更短,使得 LLM 在多选生成时更容易给出正确动作;从下面的对比表可以看到,标签简化在两个数据集 + 两个 LLM 上都带来稳定且显著的提升。

1. 标签简化对比(actions_first, M=64)

Model 数据集 M Top1 Macro-F1
Qwen2.5-7B(无标签简化) LoTE-Animal 64 42.81 14.99
Qwen2.5-7B(有标签简化) LoTE-Animal 64 59.60 29.10
Qwen3-8B(无标签简化) LoTE-Animal 64 32.51 6.82
Qwen3-8B(有标签简化) LoTE-Animal 64 63.37 27.35
Qwen2.5-7B(无标签简化) MammalNet 64 47.22 35.18
Qwen2.5-7B(有标签简化) MammalNet 64 70.23 62.19
Qwen3-8B(无标签简化) MammalNet 64 53.12 42.04
Qwen3-8B(有标签简化) MammalNet 64 69.23 60.41

2. NewP1 + 标签简化:analysis_first

Model 数据集 M Top1 Macro-F1
Qwen2.5-7B LoTE-Animal 64 68.49 33.97
Qwen2.5-7B LoTE-Animal 128 68.54 35.95
Qwen2.5-7B LoTE-Animal 512 67.34 33.36
Qwen3-8B LoTE-Animal 64 66.73 36.34
Qwen3-8B LoTE-Animal 128 67.34 34.17
Qwen3-8B LoTE-Animal 512 68.14 34.69
Qwen2.5-7B MammalNet 64 70.45 60.92
Qwen2.5-7B MammalNet 128 69.93 59.72
Qwen2.5-7B MammalNet 512 68.68 58.05
Qwen3-8B MammalNet 64 70.14 61.96
Qwen3-8B MammalNet 128 70.33 61.17
Qwen3-8B MammalNet 512 69.87 60.41

3. NewP1 + 标签简化:actions_first

Model 数据集 M Top1 Macro-F1
Qwen2.5-7B LoTE-Animal 64 59.60 29.10
Qwen2.5-7B LoTE-Animal 128 49.60 23.70
Qwen2.5-7B LoTE-Animal 512 58.39 29.91
Qwen3-8B LoTE-Animal 64 63.37 27.35
Qwen3-8B LoTE-Animal 128 63.02 31.50
Qwen3-8B LoTE-Animal 512 64.72 30.15
Qwen2.5-7B MammalNet 64 70.23 62.19
Qwen2.5-7B MammalNet 128 69.29 58.91
Qwen2.5-7B MammalNet 512 68.71 60.86
Qwen3-8B MammalNet 64 69.23 60.41
Qwen3-8B MammalNet 128 69.78 59.56
Qwen3-8B MammalNet 512 69.38 59.19

4. 辅助分类头对比(actions_first, M=64)

Model 数据集 M Top1 Macro-F1
Qwen2.5-7B(无辅助分类头) LoTE-Animal 64 59.60 29.10
Qwen2.5-7B(+辅助分类头) LoTE-Animal 64 55.18 31.99
Qwen2.5-7B(无辅助分类头) MammalNet 64 70.23 62.19
Qwen2.5-7B(+辅助分类头) MammalNet 64 70.20 59.83

5. p1-own 变体对比(actions_first, M=64)

Model 数据集 M Top1 Macro-F1
Qwen2.5-7B(baseline) LoTE-Animal 64 59.60 29.10
Qwen2.5-7B(p1-own) LoTE-Animal 64 53.92 23.96
Qwen3-8B(baseline) LoTE-Animal 64 63.37 27.35
Qwen3-8B(p1-own) LoTE-Animal 64 62.11 28.16
Qwen2.5-7B(baseline) MammalNet 64 70.23 62.19
Qwen2.5-7B(p1-own) MammalNet 64 72.54 65.53
Qwen3-8B(baseline) MammalNet 64 69.23 60.41
Qwen3-8B(p1-own) MammalNet 64 70.90 62.24

6. e2-drop10 变体对比(actions_first, M=64)

Model 数据集 M Top1 Macro-F1
Qwen2.5-7B(baseline) LoTE-Animal 64 59.60 29.10
Qwen2.5-7B(e2-drop10) LoTE-Animal 64 60.65 30.35
Qwen3-8B(baseline) LoTE-Animal 64 63.37 27.35
Qwen3-8B(e2-drop10) LoTE-Animal 64 63.92 27.02
Qwen2.5-7B(baseline) MammalNet 64 70.23 62.19
Qwen2.5-7B(e2-drop10) MammalNet 64 69.99 61.53
Qwen3-8B(baseline) MammalNet 64 69.23 60.41
Qwen3-8B(e2-drop10) MammalNet 64 70.05 60.23

7. flip-balanced 变体对比(actions_first, M=64)

Model 数据集 M Top1 Macro-F1
Qwen2.5-7B(baseline) LoTE-Animal 64 59.60 29.10
Qwen2.5-7B(flip-balanced) LoTE-Animal 64 53.27 25.74
Qwen3-8B(baseline) LoTE-Animal 64 63.37 27.35
Qwen3-8B(flip-balanced) LoTE-Animal 64 46.38 19.11
Qwen2.5-7B(baseline) MammalNet 64 70.23 62.19
Qwen2.5-7B(flip-balanced) MammalNet 64 69.47 61.50
Qwen3-8B(baseline) MammalNet 64 69.23 60.41
Qwen3-8B(flip-balanced) MammalNet 64 68.84 60.26

8. Cross-dataset / Zero-shot(actions_first, M=64)

Model 数据集 M Top1 Macro-F1
Qwen2.5-7B(MammalNet→LoTE) LoTE-Animal 64 26.38 14.29
Qwen3-8B(MammalNet→LoTE) LoTE-Animal 64 10.75 9.57
Qwen2.5-7B(LoTE→MammalNet) MammalNet 64 34.42 20.79
Qwen3-8B(LoTE→MammalNet) MammalNet 64 38.70 23.91

9. Q-former Query 数量分析(M ∈ {64, 128, 512})

为解释“M 越大反而 Top1/F1 没有继续提升”的现象,我们对 4 组 newp1_simple 训出的 best ckpt(Qwen2.5-7B / Qwen3-8B × LoTE / MammalNet)的 Q-former query_tokens 做了多样性分析,看看不同 M 下这些 query 是否真的被“用满”。所有图都是 4 个组合的平均结果,绘图脚本:plot_qformer_query_diversity.py,输出目录 qformer_diversity_analysis/

9.1 奇异值谱(log 纵轴)

Q-former singular value spectrum

9.2 累积能量曲线

Cumulative spectral energy

9.3 Pairwise abs(cosine) 分布

Pairwise abs cosine histogram

9.4 Effective rank 与 Gini 系数

Effective rank and Gini

10. 当前还缺 metrics 的关键实验

这些目录已有 predictions.jsonl,但还没有 metrics.json,所以未进入上面的统计表: