PAPER_TITLE

FIRST_AUTHOR_LAST, FIRST_AUTHOR_FIRST; SECOND_AUTHOR_LAST, SECOND_AUTHOR_FIRST

Local Dimension Enhancement Representation Learning for Skeleton-Based Action Segmentation

Shaofan Sun, Lilang Lin, Jiahang Zhang, Ling-Yu Duan, Jiaying Liu

Peking University
IEEE TIP 2026

Abstract

Most existing self-supervised learning methods for skeleton-based temporal action segmentation (TAS) fail to capture the short-term motion semantics essential for dense frame-level prediction, as they typically learn representations that are either too coarse or motion-insensitive. This issue is reflected in local dimension collapse, which highlights the limitations of current approaches and suggests directions for improvement. Specifically, to address the issue of local dimension collapse for self-supervised learning in TAS, we propose the Local Dimension Enhancement (LoDE) framework, which introduces the local effective rank (LER) as a metric to measure and a learning objective to reduce this collapse. A new fine-grained representation scale, termed a motion unit, is defined as a temporal clip of consecutive skeleton frames to model skeleton data. Centered on this representation scale, we analyze existing methods (sequence-scale and frame-scale learning) with the tool of LER and theoretically demonstrate that introducing motion unit-scale learning is essential to alleviate local dimension collapse. Inspired by our theoretical insights, we design a multi-scale semantics module that integrates frame-, sequence-, and motion unit-scale learning, with LER-based regularization to enrich local representation diversity. These designs effectively alleviate local dimension collapse and lead to significant improvements in TAS, as evidenced by LoDE's superior performance over state-of-the-art methods on three large-scale untrimmed datasets: PKUMMD, TSU, and BABEL.

Framework

Our Local Dimension Enhancement (LoDE) framework. (a) LoDE learns motion unit-scale representations by applying a masked modeling strategy with weight-sharing Siamese encoders. (b) The Multi-scale Action Semantics Learning (MASL) module captures multi-scale semantics by aligning the reconstructed objectives from the masked view with original skeletons and quantized representations from the unmasked view. (c) LER regularization (LERR) is applied to the masked view to encourage a more uniform singular value distribution and increase local intrinsic dimension.

Core Theoretical Insights

1. Quantifying Local Dimension Collapse

We introduce Local Effective Rank (LER) to measure the dimensionality of the latent space within local regions. Based on the singular values \( \sigma_i \) of the local representation matrix, LER is defined as:

\[ \text{LER}(X) = \exp \left( -\sum_{i=1}^N p_i \log p_i \right), \quad p_i = \frac{\sigma_i}{\sum \sigma_j} \]

As stated in Proposition 1, LER is a tight upper bound for the matrix rank: \( 1 \leq \text{LER}(Z) \leq \text{rank}(Z) \), providing a stable indicator of representation diversity.

Empirically, our analysis confirms that a higher LER corresponds to a larger minimal achievable rank (preserving more information), and exhibits a strong positive correlation indicated by high Pearson correlation coefficient (PCC) value with downstream temporal action segmentation (TAS) performance evaluated by mean average precision (mAP).

(a) LER vs. minimal achievable rank r.

(b) LER vs. downstream TAS performance (PCC=0.986).

Comparison of LER across different modeling scales. F, S, and M denote models trained with frame-, sequence- and motion unit-scale objectives, respectively.

2. Why Motion Unit Scale?

With the tool of LER, we reveal the theoretical limitations of existing modeling paradigms: sequence-scale learning heavily suffers from local dimension collapse by indiscriminately pulling local representations together, while frame-scale learning is inherently bounded by the low dimensionality of raw skeletons.

Our Proposition 4 reveals that the lower bound of LER is inversely proportional to the squared Frobenius norm of the similarity matrix of local representations \( Z \):

\[ \text{LER}(Z) \geq \frac{m^2}{\|Z^\top Z\|_F^2} \]

By introducing Motion Units (clips of consecutive frames), we enrich local short-term semantics, effectively minimizing \( \|Z^\top Z\|_F^2 \) and increasing the lower bound of LER to alleviate local dimension collapse.

Experimental Results

Remark 1: Fine-tuning temporal action segmentation performance on PKU-I. It demonstrates the superior TAS performance achieved with the learned high-quality representations.

Linear evaluation TAS performance on PKU-I

Remark 2: Linear evaluation of temporal action segmentation performance with transfer learning protocol. The results shows the great generalizability of the learned representations across different input distributions.

Remark 3: Fine-tuning action recognition performance. It demonstrates the compatibility of our method with action recognition, verifying the generalizability of the representations across action understanding tasks of different granularities.

BibTeX

@ARTICLE{11481594,
  author={Sun, Shaofan and Lin, Lilang and Zhang, Jiahang and Duan, Ling-Yu and Liu, Jiaying},
  journal={IEEE Transactions on Image Processing}, 
  title={Local Dimension Enhancement Representation Learning for Skeleton-Based Action Segmentation}, 
  year={2026},
  volume={35},
  number={},
  pages={3970-3983}}

References

[1] O. Roy and M. Vetterli, “The effective rank: A measure of effective dimensionality,” in Proc. European Signal Processing Conference, 2007.

[2] D. Yang, Y. Wang, A. Dantcheva, Q. Kong, L. Garattoni, G. Francesca, and F. Bremond, “LAC - Latent action composition for skeleton-based action segmentation,” in Proc. International Conference on Computer Vision, 2023.

[3] L. Wu, L. Lin, J. Zhang, Y. Ma, and J. Liu, “MacDiff: Unified skeleton modeling with masked conditional diffusion,” in Proc. European Conference on Computer Vision, 2024.

[4] J. Liu, S. Song, C. Liu, Y. Li, and Y. Hu, “A benchmark dataset and comparison study for multi-modal human action analytics,” ACM Trans. on Multimedia Computing, Communications, and Applications, vol. 16, pp. 1–24, 2020.

[5] A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang, “NTU RGB+D: A large scale dataset for 3D human activity analysis,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2016.

[6] D. Yang, Y. Wang, A. Dantcheva, L. Garattoni, G. Francesca, and F. F. Bremond, “UNIK: A unified framework for real-world skeleton-based action recognition,” in Proc. British Machine Vision Conference, 2021.

If you have any questions, please contact Shaofan Sun (carefree_sun@stu.pku.edu.cn).