The capability of manipulating object (i.e., changing its pose, viewpoint) in a single RGB image is a highly under-constrained problem. However, even from single images, humans can picture and easily imagine what an object would look like under a different pose/viewpoint. In this work we aim to approach this problem by learning a disentangled representation of the object identity (i.e., object’s appearance) and pose.
Our model takes an ID image and a pose image as input, and then generates a novel image composed of the ID of the first image and the pose of the second. Unlike previous unsupervised work which rely on cyclic constraints, we used unlabeled videos to automatically produce pseudo-groundtruth targets, allowing our network to train with direct supervision yet without any annotations.
Both quantitative and qualitative results are presented on car, bus, and truck classes of the YouTube-BoundingBoxes dataset, and demonstrate improved realism, diversity, and ID/pose disentanglement and preservation compared to existing unsupervised methods. We also present results on the application of our model on image composition.
In submission to a top-tier vision conference.
Our generation results for
Image composition demo with the generation images of our model
Paper & Source Code
If you have any questions or comments, feel free to contact me, and I will reply as soon as possible.