2026 May

TalkSummary: Less Time, Same Meaning for Talking Video Summarization

Guangli Hu* Yayun Xiao* Yudong Guo Juyong Zhang

University of Science and Technology of China

* Equal contribution. Corresponding author.

TalkSummary teaser overview
TalkSummary condenses long-form talking-head videos by preserving semantic anchors, multimodal affective trajectories, and digital-human performance continuity.

Demo Video and Comparisons

Demo and comparison videos are under preparation. The slots below are reserved for the main result video and related visual comparisons.

Main demo video

Coming soon

Baseline comparison

Coming soon

Ablation study

Coming soon

Digital human rendering

Coming soon

Emotion trajectory

Coming soon

Abstract

Long-form talking-head footage, including lectures, keynotes, interviews, and instructional content, carries dense knowledge, presentation rhythm, and affective trajectories expressed through speech prosody, facial dynamics, and body motion, yet is increasingly consumed in short-form formats.

We define long-form talking-head condensation as reducing duration while preserving linguistic content, affective trajectory, and on-camera performance continuity, so the source video's information and affective expression can be conveyed in less time. Existing text summarization and jump-cut summarization methods address only one side of this problem, while recent talking-head and digital-human synthesis methods do not directly solve interval selection, affect propagation, or cross-segment continuity.

We present TalkSummary, a multimodal condensation-and-re-synthesis framework that renders long-form speech into a shorter, performance-preserving talking-head video under a user-specified compression ratio. TalkSummary segments the source by multimodal affective divergence, allocates compression budget with keyword priors and segment-level importance, predicts compressed transcripts and animation-ready Affective Control Packages, and drives voice, body motion, and lip synchronization with temporal smoothing.

Method

TalkSummary framework pipeline
Overview of the TalkSummary framework: upstream multimodal affective divergence clustering and dual-head omnimodal compression are connected to downstream emotion-conditioned motion generation and high-fidelity lip synchronization.

Multimodal affective distillation

Text, audio, and visual channels are aligned to estimate semantic importance and affective dynamics across sentence-level segments.

Budget-aware condensation

Dynamic programming allocates the target compression budget while retaining keyword priors and high-importance affective anchors.

Digital human re-synthesis

ACP signals guide voice cloning, emotion-conditioned body motion, lip synchronization, and boundary smoothing for continuous output video.

Paper Figures

Affective control package and temporal blending
Structured affective control and cross-segment temporal blending.
Emotion-conditioned motion graph search
Emotion-conditioned motion graph search for expressive gesture synthesis.
Affective trajectory preservation comparison
Affective trajectory preservation across full method, global compression, no smoothing, and model ablations.
Dual-head omnimodal output structure
Dual-head output structure for compressed transcript and continuous affective controls.
TalkSummary interface illustration
Interface and result visualization prepared for the project.

Resources

Paper

Coming soon

Code

Coming soon

Citation

Coming soon