Skip to content

Latest commit

 

History

History
29 lines (15 loc) · 3.91 KB

File metadata and controls

29 lines (15 loc) · 3.91 KB

2018-08-15

这篇文章介绍两篇 ECCV 2018最新的 paper,一篇提出新颖的运动变换变分自动编码器(MT-VAE),用于学习运动序列生成;另一篇提出利用FiLM来调节语言上基于图像的卷积网络计算,解决视推理问题。

VAE

《MT-VAE: Learning Motion Transformations to Generate Multimodal Human Dynamics》

ECCV 2018

Abstract:Long-term human motion can be represented as a series of motion modes---motion sequences that capture short-term temporal dynamics---with transitions between them. We leverage this structure and present a novel Motion Transformation Variational Auto-Encoders (MT-VAE) for learning motion sequence generation. Our model jointly learns a feature embedding for motion modes (that the motion sequence can be reconstructed from) and a feature transformation that represents the transition of one motion mode to the next motion mode. Our model is able to generate multiple diverse and plausible motion sequences in the future from the same input. We apply our approach to both facial and full body motion, and demonstrate applications like analogy-based motion transfer and video synthesis.

摘要:长期(long-term)人体运动可以表示为一系列运动模式 - 捕捉短期时间动态的运动序列 - 它们之间的过渡。我们利用这种结构,提出了一种新颖的运动变换变分自动编码器(MT-VAE),用于学习运动序列生成。我们的模型联合学习运动模式的特征嵌入(可以从中重建运动序列)和表示一个运动模式到下一个运动模式的转换的特征变换。我们的模型能够从相同的输入生成"未来"的多种多样且可信的运动序列。我们将此方法应用于面部和全身运动,并演示了基于类比的运动传递和视频合成等应用。

arXiv:https://arxiv.org/abs/1808.04545

Visual Reasoning

《Visual Reasoning with Multi-hop Feature Modulation》

ECCV 2018

Abstract:Recent breakthroughs in computer vision and natural language processing have spurred interest in challenging multi-modal tasks such as visual question-answering and visual dialogue. For such tasks, one successful approach is to condition image-based convolutional network computation on language via Feature-wise Linear Modulation (FiLM) layers, i.e., per-channel scaling and shifting. We propose to generate the parameters of FiLM layers going up the hierarchy of a convolutional network in a multi-hop fashion rather than all at once, as in prior work. By alternating between attending to the language input and generating FiLM layer parameters, this approach is better able to scale to settings with longer input sequences such as dialogue. We demonstrate that multi-hop FiLM generation achieves state-of-the-art for the short input sequence task ReferIt --- on-par with single-hop FiLM generation --- while also significantly outperforming prior state-of-the-art and single-hop FiLM generation on the GuessWhat?! visual dialogue task.

摘要:最近计算机视觉和自然语言处理方面的突破激发了人们对挑战多模式任务(如视觉问答和视觉对话)的兴趣。对于这样的任务,一种成功的方法是通过特征线性调制(FiLM)层(即,每通道缩放和移位)来调节语言上基于图像的卷积网络计算。我们提出以多跳方式生成在卷积网络的层次结构上的FiLM层的参数,而不是像在先前的工作中那样一次生成。通过在参与语言输入和生成FiLM层参数之间交替,这种方法能够更好地扩展到具有较长输入序列的设置,例如对话(dialogue)。我们证明了多跳FiLM生成实现了短输入序列任务的最新技术参考 - 与单跳FiLM生成相媲美 - 同时也明显优于先前的先进技术GuessWhat上的单跳FiLM生成?!视觉对话任务。

arXiv:https://arxiv.org/abs/1808.04446

注:Amusi觉得将CV与NLP结合有非常大的研究意义和前景。