TalkSummary: Less Time, Same Meaning for Talking Video Summarization

Hu, Guangli; Xiao, Yayun; Guo, Yudong; Zhang, Juyong

2026 May

TalkSummary: Less Time, Same Meaning for Talking Video Summarization

Guangli Hu^* Yayun Xiao^* Yudong Guo Juyong Zhang^†

University of Science and Technology of China

^* Equal contribution. ^† Corresponding author.

Paper Coming Soon Code Coming Soon Video BibTeX

TalkSummary teaser overview — TalkSummary condenses long-form talking-head videos by preserving semantic anchors, multimodal affective trajectories, and digital-human performance continuity.

Demo Video and Comparisons

Demo and comparison videos are under preparation. The slots below are reserved for the main result video and related visual comparisons.

Main demo video

Coming soon

Baseline comparison

Coming soon

Ablation study

Coming soon

Digital human rendering

Coming soon

Emotion trajectory

Coming soon

Abstract

Long-form talking-head footage, including lectures, keynotes, interviews, and instructional content, carries dense knowledge, presentation rhythm, and affective trajectories expressed through speech prosody, facial dynamics, and body motion, yet is increasingly consumed in short-form formats.

We define long-form talking-head condensation as reducing duration while preserving linguistic content, affective trajectory, and on-camera performance continuity, so the source video's information and affective expression can be conveyed in less time. Existing text summarization and jump-cut summarization methods address only one side of this problem, while recent talking-head and digital-human synthesis methods do not directly solve interval selection, affect propagation, or cross-segment continuity.

We present TalkSummary, a multimodal condensation-and-re-synthesis framework that renders long-form speech into a shorter, performance-preserving talking-head video under a user-specified compression ratio. TalkSummary segments the source by multimodal affective divergence, allocates compression budget with keyword priors and segment-level importance, predicts compressed transcripts and animation-ready Affective Control Packages, and drives voice, body motion, and lip synchronization with temporal smoothing.