Header logo is


2024


HOLD: Category-agnostic 3D Reconstruction of Interacting Hands and Objects from Video
HOLD: Category-agnostic 3D Reconstruction of Interacting Hands and Objects from Video

Fan, Z., Parelli, M., Kadoglou, M. E., Kocabas, M., Chen, X., Black, M. J., Hilliges, O.

Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2024 (conference)

ps

Paper Project Code [BibTex]

2024


Paper Project Code [BibTex]


no image
GraphDreamer: Compositional 3D Scene Synthesis from Scene Graphs

Gao, G., Liu, W., Chen, A., Geiger, A., Schölkopf, B.

The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024 (conference) Accepted

ei

[BibTex]

[BibTex]


AMUSE: Emotional Speech-driven 3D Body Animation via Disentangled Latent Diffusion
AMUSE: Emotional Speech-driven 3D Body Animation via Disentangled Latent Diffusion

Chhatre, K., Daněček, R., Athanasiou, N., Becherini, G., Peters, C., Black, M. J., Bolkart, T.

Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2024 (conference) To be published

Abstract
Existing methods for synthesizing 3D human gestures from speech have shown promising results, but they do not explicitly model the impact of emotions on the generated gestures. Instead, these methods directly output animations from speech without control over the expressed emotion. To address this limitation, we present AMUSE, an emotional speech-driven body animation model based on latent diffusion. Our observation is that content (i.e., gestures related to speech rhythm and word utterances), emotion, and personal style are separable. To account for this, AMUSE maps the driving audio to three disentangled latent vectors: one for content, one for emotion, and one for personal style. A latent diffusion model, trained to generate gesture motion sequences, is then conditioned on these latent vectors. Once trained, AMUSE synthesizes 3D human gestures directly from speech with control over the expressed emotions and style by combining the content from the driving speech with the emotion and style of another speech sequence. Randomly sampling the noise of the diffusion model further generates variations of the gesture with the same emotional expressivity. Qualitative, quantitative, and perceptual evaluations demonstrate that AMUSE outputs realistic gesture sequences. Compared to the state of the art, the generated gestures are better synchronized with the speech content and better represent the emotion expressed by the input speech.

ps

Project Paper Code link (url) [BibTex]

Project Paper Code link (url) [BibTex]


no image
Out-of-Variable Generalization for Discriminative Models

Guo, S., Wildberger, J., Schölkopf, B.

Proceedings of the Twelfth International Conference on Learning Representations (ICLR), May 2024 (conference) Accepted

ei

arXiv [BibTex]

arXiv [BibTex]


no image
Delphic Offline Reinforcement Learning under Nonidentifiable Hidden Confounding

Pace, A., Yèche, H., Schölkopf, B., Rätsch, G., Tennenholtz, G.

Proceedings of the Twelfth International Conference on Learning Representations (ICLR), May 2024 (conference) Accepted

ei

arXiv [BibTex]

arXiv [BibTex]


no image
Towards Training Without Depth Limits: Batch Normalization Without Gradient Explosion

Meterez*, A., Joudaki*, A., Orabona, F., Immer, A., Rätsch, G., Daneshmand, H.

Proceedings of the Twelfth International Conference on Learning Representations (ICLR), May 2024, *equal contribution (conference) Accepted

ei

arXiv [BibTex]

arXiv [BibTex]


no image
The Expressive Leaky Memory Neuron: an Efficient and Expressive Phenomenological Neuron Model Can Solve Long-Horizon Tasks

Spieler, A., Rahaman, N., Martius, G., Schölkopf, B., Levina, A.

Proceedings of the Twelfth International Conference on Learning Representations (ICLR), May 2024 (conference) Accepted

ei al

arXiv [BibTex]

arXiv [BibTex]


no image
Open X-Embodiment: Robotic Learning Datasets and RT-X Models

Open X-Embodiment Collaboration ( incl. Guist, S., Schneider, J., Schölkopf, B., Büchler, D. ).

IEEE International Conference on Robotics and Automation (ICRA), May 2024 (conference) Accepted

ei

[BibTex]

[BibTex]


no image
Can Large Language Models Infer Causation from Correlation?

Jin, Z., Liu, J., Lyu, Z., Poff, S., Sachan, M., Mihalcea, R., Diab*, M., Schölkopf*, B.

Proceedings of the Twelfth International Conference on Learning Representations (ICLR), May 2024, *equal supervision (conference) Accepted

ei

arXiv [BibTex]

arXiv [BibTex]


no image
Certified private data release for sparse Lipschitz functions

Donhauser, K., Lokna, J., Sanyal, A., Boedihardjo, M., Hönig, R., Yang, F.

27th International Conference on Artificial Intelligence and Statistics (AISTATS), May 2024 (conference) Accepted

ei

[BibTex]

[BibTex]


Ghost on the Shell: An Expressive Representation of General 3D Shapes
Ghost on the Shell: An Expressive Representation of General 3D Shapes

(Oral)

Liu, Z., Feng, Y., Xiu, Y., Liu, W., Paull, L., Black, M. J., Schölkopf, B.

In Proceedings of the Twelfth International Conference on Learning Representations, The Twelfth International Conference on Learning Representations, May 2024 (inproceedings) Accepted

Abstract
The creation of photorealistic virtual worlds requires the accurate modeling of 3D surface geometry for a wide range of objects. For this, meshes are appealing since they 1) enable fast physics-based rendering with realistic material and lighting, 2) support physical simulation, and 3) are memory-efficient for modern graphics pipelines. Recent work on reconstructing and statistically modeling 3D shape, however, has critiqued meshes as being topologically inflexible. To capture a wide range of object shapes, any 3D representation must be able to model solid, watertight, shapes as well as thin, open, surfaces. Recent work has focused on the former, and methods for reconstructing open surfaces do not support fast reconstruction with material and lighting or unconditional generative modelling. Inspired by the observation that open surfaces can be seen as islands floating on watertight surfaces, we parameterize open surfaces by defining a manifold signed distance field on watertight templates. With this parameterization, we further develop a grid-based and differentiable representation that parameterizes both watertight and non-watertight meshes of arbitrary topology. Our new representation, called Ghost-on-the-Shell (G-Shell), enables two important applications: differentiable rasterization-based reconstruction from multiview images and generative modelling of non-watertight meshes. We empirically demonstrate that G-Shell achieves state-of-the-art performance on non-watertight mesh reconstruction and generation tasks, while also performing effectively for watertight meshes.

ei ps

Home Code Video Project [BibTex]

Home Code Video Project [BibTex]


no image
Identifying Policy Gradient Subspaces

Schneider, J., Schumacher, P., Guist, S., Chen, L., Häufle, D., Schölkopf, B., Büchler, D.

Proceedings of the Twelfth International Conference on Learning Representations (ICLR), May 2024 (conference) Accepted

ei

arXiv [BibTex]

arXiv [BibTex]


no image
Emergent mechanisms for long timescales depend on training curriculum and affect performance in memory tasks

Khajehabdollahi, S., Zeraati, R., Giannakakis, E., Schäfer, T. J., Martius, G., Levina, A.

In The Twelfth International Conference on Learning Representations, ICLR 2024, May 2024 (inproceedings)

al

link (url) [BibTex]

link (url) [BibTex]


no image
Some Intriguing Aspects about Lipschitz Continuity of Neural Networks

Khromov*, G., Singh*, S. P.

Proceedings of the Twelfth International Conference on Learning Representations (ICLR), May 2024, *equal contribution (conference) Accepted

ei

arXiv [BibTex]

arXiv [BibTex]


Parameter-Efficient Orthogonal Finetuning via Butterfly Factorization
Parameter-Efficient Orthogonal Finetuning via Butterfly Factorization

Liu, W., Qiu, Z., Feng, Y., Xiu, Y., Xue, Y., Yu, L., Feng, H., Liu, Z., Heo, J., Peng, S., Wen, Y., Black, M. J., Weller, A., Schölkopf, B.

In Proceedings of the Twelfth International Conference on Learning Representations, The Twelfth International Conference on Learning Representations, May 2024 (inproceedings) Accepted

Abstract
Large foundation models are becoming ubiquitous, but training them from scratch is prohibitively expensive. Thus, efficiently adapting these powerful models to downstream tasks is increasingly important. In this paper, we study a principled finetuning paradigm -- Orthogonal Finetuning (OFT) -- for downstream task adaptation. Despite demonstrating good generalizability, OFT still uses a fairly large number of trainable parameters due to the high dimensionality of orthogonal matrices. To address this, we start by examining OFT from an information transmission perspective, and then identify a few key desiderata that enable better parameter-efficiency. Inspired by how the Cooley-Tukey fast Fourier transform algorithm enables efficient information transmission, we propose an efficient orthogonal parameterization using butterfly structures. We apply this parameterization to OFT, creating a novel parameter-efficient finetuning method, called Orthogonal Butterfly (BOFT). By subsuming OFT as a special case, BOFT introduces a generalized orthogonal finetuning framework. Finally, we conduct an extensive empirical study of adapting large vision transformers, large language models, and text-to-image diffusion models to various downstream tasks in vision and language.

ei ps

Home Code HuggingFace project [BibTex]

Home Code HuggingFace project [BibTex]


no image
Skill or Luck? Return Decomposition via Advantage Functions

Pan, H., Schölkopf, B.

Proceedings of the Twelfth International Conference on Learning Representations (ICLR), May 2024 (conference) Accepted

ei

[BibTex]

[BibTex]


no image
Transformer Fusion with Optimal Transport

Imfeld*, M., Graldi*, J., Giordano*, M., Hofmann, T., Anagnostidis, S., Singh, S. P.

Proceedings of the Twelfth International Conference on Learning Representations (ICLR), May 2024, *equal contribution (conference) Accepted

ei

arXiv [BibTex]

arXiv [BibTex]


no image
Learning Hierarchical World Models with Adaptive Temporal Abstractions from Discrete Latent Dynamics

Gumbsch, C., Sajid, N., Martius, G., Butz, M. V.

In The Twelfth International Conference on Learning Representations, ICLR 2024, May 2024 (inproceedings)

al

link (url) [BibTex]

link (url) [BibTex]


no image
Causal Modeling with Stationary Diffusions

Lorch, L., Krause*, A., Schölkopf*, B.

27th International Conference on Artificial Intelligence and Statistics (AISTATS), May 2024, *equal supervision (conference) Accepted

ei

[BibTex]

[BibTex]


no image
Multi-View Causal Representation Learning with Partial Observability

Yao, D., Xu, D., Lachapelle, S., Magliacane, S., Taslakian, P., Martius, G., von Kügelgen, J., Locatello, F.

Proceedings of the Twelfth International Conference on Learning Representations (ICLR), May 2024 (conference) Accepted

ei al

arXiv [BibTex]

arXiv [BibTex]


no image
Towards Meta-Pruning via Optimal Transport

Theus, A., Geimer, O., Wicke, F., Hofmann, T., Anagnostidis, S., Singh, S. P.

Proceedings of the Twelfth International Conference on Learning Representations (ICLR), May 2024 (conference) Accepted

ei

[BibTex]

[BibTex]


no image
Stochastic Gradient Descent for Gaussian Processes Done Right

Lin*, J. A., Padhy*, S., Antorán*, J., Tripp, A., Terenin, A., Szepesvari, C., Hernández-Lobato, J. M., Janz, D.

Proceedings of the Twelfth International Conference on Learning Representations (ICLR), May 2024, *equal contribution (conference) Accepted

ei

arXiv [BibTex]

arXiv [BibTex]


no image
PILLAR: How to make semi-private learning more effective

Hu, Y., Pinto, F., Yang, F., Sanyal, A.

2nd IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), April 2024 (conference) Accepted

ei

[BibTex]

[BibTex]


no image
Expert Perception of Teleoperated Social Exercise Robots

Mohan, M., Mat Husin, H., Kuchenbecker, K. J.

In Proceedings of the ACM/IEEE International Conference on Human-Robot Interaction (HRI), pages: 769-773, Boulder, USA, March 2024, Late-Breaking Report (LBR) (5 pages) presented at the IEEE/ACM International Conference on Human-Robot Interaction (HRI) (inproceedings)

Abstract
Social robots could help address the growing issue of physical inactivity by inspiring users to engage in interactive exercise. Nevertheless, the practical implementation of social exercise robots poses substantial challenges, particularly in terms of personalizing their activities to individuals. We propose that motion-capture-based teleoperation could serve as a viable solution to address these needs by enabling experts to record custom motions that could later be played back without their real-time involvement. To gather feedback about this idea, we conducted semi-structured interviews with eight exercise-therapy professionals. Our findings indicate that experts' attitudes toward social exercise robots become more positive when considering the prospect of teleoperation to record and customize robot behaviors.

hi

DOI Project Page [BibTex]

DOI Project Page [BibTex]


TADA! Text to Animatable Digital Avatars
TADA! Text to Animatable Digital Avatars

Liao, T., Yi, H., Xiu, Y., Tang, J., Huang, Y., Thies, J., Black, M. J.

In International Conference on 3D Vision (3DV 2024), 3DV 2024, March 2024 (inproceedings) Accepted

Abstract
We introduce TADA, a simple-yet-effective approach that takes textual descriptions and produces expressive 3D avatars with high-quality geometry and lifelike textures, that can be animated and rendered with traditional graphics pipelines. Existing text-based character generation methods are limited in terms of geometry and texture quality, and cannot be realistically animated due to inconsistent align-007 ment between the geometry and the texture, particularly in the face region. To overcome these limitations, TADA leverages the synergy of a 2D diffusion model and an animatable parametric body model. Specifically, we derive an optimizable high-resolution body model from SMPL-X with 3D displacements and a texture map, and use hierarchical rendering with score distillation sampling (SDS) to create high-quality, detailed, holistic 3D avatars from text. To ensure alignment between the geometry and texture, we render normals and RGB images of the generated character and exploit their latent embeddings in the SDS training process. We further introduce various expression parameters to deform the generated character during training, ensuring that the semantics of our generated character remain consistent with the original SMPL-X model, resulting in an animatable character. Comprehensive evaluations demonstrate that TADA significantly surpasses existing approaches on both qualitative and quantitative measures. TADA enables creation of large-scale digital character assets that are ready for animation and rendering, while also being easily editable through natural language. The code will be public for research purposes.

ps

Home Code Video [BibTex]

Home Code Video [BibTex]


POCO: {3D} Pose and Shape Estimation using Confidence
POCO: 3D Pose and Shape Estimation using Confidence

Dwivedi, S. K., Schmid, C., Yi, H., Black, M. J., Tzionas, D.

In International Conference on 3D Vision (3DV 2024), 3DV 2024, March 2024 (inproceedings)

Abstract
The regression of 3D Human Pose and Shape HPS from an image is becoming increasingly accurate. This makes the results useful for downstream tasks like human action recognition or 3D graphics. Yet, no regressor is perfect, and accuracy can be affected by ambiguous image evidence or by poses and appearance that are unseen during training. Most current HPS regressors, however, do not report the confidence of their outputs, meaning that downstream tasks cannot differentiate accurate estimates from inaccurate ones. To address this, we develop POCO, a novel framework for training HPS regressors to estimate not only a 3D human body, but also their confidence, in a single feed-forward pass. Specifically, POCO estimates both the 3D body pose and a per-sample variance. The key idea is to introduce a Dual Conditioning Strategy (DCS) for regressing uncertainty that is highly correlated to pose reconstruction quality. The POCO framework can be applied to any HPS regressor and here we evaluate it by modifying HMR, PARE, and CLIFF. In all cases, training the network to reason about uncertainty helps it learn to more accurately estimate 3D pose. While this was not our goal, the improvement is modest but consistent. Our main motivation is to provide uncertainty estimates for downstream tasks; we demonstrate this in two ways: (1) We use the confidence estimates to bootstrap HPS training. Given unlabelled image data, we take the confident estimates of a POCO-trained regressor as pseudo ground truth. Retraining with this automatically-curated data improves accuracy. (2) We exploit uncertainty in video pose estimation by automatically identifying uncertain frames (e.g. due to occlusion) and inpainting these from confident frames.

ps

Paper SupMat Poster link (url) [BibTex]

Paper SupMat Poster link (url) [BibTex]


TECA: Text-Guided Generation and Editing of Compositional 3D Avatars
TECA: Text-Guided Generation and Editing of Compositional 3D Avatars

Zhang, H., Feng, Y., Kulits, P., Wen, Y., Thies, J., Black, M. J.

In International Conference on 3D Vision (3DV 2024), 3DV 2024, March 2024 (inproceedings) To be published

Abstract
Our goal is to create a realistic 3D facial avatar with hair and accessories using only a text description. While this challenge has attracted significant recent interest, existing methods either lack realism, produce unrealistic shapes, or do not support editing, such as modifications to the hairstyle. We argue that existing methods are limited because they employ a monolithic modeling approach, using a single representation for the head, face, hair, and accessories. Our observation is that the hair and face, for example, have very different structural qualities that benefit from different representations. Building on this insight, we generate avatars with a compositional model, in which the head, face, and upper body are represented with traditional 3D meshes, and the hair, clothing, and accessories with neural radiance fields (NeRF). The model-based mesh representation provides a strong geometric prior for the face region, improving realism while enabling editing of the person's appearance. By using NeRFs to represent the remaining components, our method is able to model and synthesize parts with complex geometry and appearance, such as curly hair and fluffy scarves. Our novel system synthesizes these high-quality compositional avatars from text descriptions. The experimental results demonstrate that our method, Text-guided generation and Editing of Compositional Avatars (TECA), produces avatars that are more realistic than those of recent methods while being editable because of their compositional nature. For example, our TECA enables the seamless transfer of compositional features like hairstyles, scarves, and other accessories between avatars. This capability supports applications such as virtual try-on.

ncs ps

arXiv project link (url) [BibTex]

arXiv project link (url) [BibTex]


TeCH: Text-guided Reconstruction of Lifelike Clothed Humans
TeCH: Text-guided Reconstruction of Lifelike Clothed Humans

Huang, Y., Yi, H., Xiu, Y., Liao, T., Tang, J., Cai, D., Thies, J.

In International Conference on 3D Vision (3DV 2024), 3DV 2024, March 2024 (inproceedings) Accepted

Abstract
Despite recent research advancements in reconstructing clothed humans from a single image, accurately restoring the "unseen regions" with high-level details remains an unsolved challenge that lacks attention. Existing methods often generate overly smooth back-side surfaces with a blurry texture. But how to effectively capture all visual attributes of an individual from a single image, which are sufficient to reconstruct unseen areas (e.g., the back view)? Motivated by the power of foundation models, TeCH reconstructs the 3D human by leveraging 1) descriptive text prompts (e.g., garments, colors, hairstyles) which are automatically generated via a garment parsing model and Visual Question Answering (VQA), 2) a personalized fine-tuned Text-to-Image diffusion model (T2I) which learns the "indescribable" appearance. To represent high-resolution 3D clothed humans at an affordable cost, we propose a hybrid 3D representation based on DMTet, which consists of an explicit body shape grid and an implicit distance field. Guided by the descriptive prompts + personalized T2I diffusion model, the geometry and texture of the 3D humans are optimized through multi-view Score Distillation Sampling (SDS) and reconstruction losses based on the original observation. TeCH produces high-fidelity 3D clothed humans with consistent & delicate texture, and detailed full-body geometry. Quantitative and qualitative experiments demonstrate that TeCH outperforms the state-of-the-art methods in terms of reconstruction accuracy and rendering quality.

ps

Code Home Video arXiv [BibTex]

Code Home Video arXiv [BibTex]


{ArtiGrasp}: Physically Plausible Synthesis of Bi-Manual Dexterous Grasping and Articulation
ArtiGrasp: Physically Plausible Synthesis of Bi-Manual Dexterous Grasping and Articulation

Zhang, H., Christen, S., Fan, Z., Zheng, L., Hwangbo, J., Song, J., Hilliges, O.

In International Conference on 3D Vision (3DV 2024), 3DV 2024, March 2024 (inproceedings) Accepted

Abstract
We present ArtiGrasp, a novel method to synthesize bi-manual hand-object interactions that include grasping and articulation. This task is challenging due to the diversity of the global wrist motions and the precise finger control that are necessary to articulate objects. ArtiGrasp leverages reinforcement learning and physics simulations to train a policy that controls the global and local hand pose. Our framework unifies grasping and articulation within a single policy guided by a single hand pose reference. Moreover, to facilitate the training of the precise finger control required for articulation, we present a learning curriculum with increasing difficulty. It starts with single-hand manipulation of stationary objects and continues with multi-agent training including both hands and non-stationary objects. To evaluate our method, we introduce Dynamic Object Grasping and Articulation, a task that involves bringing an object into a target articulated pose. This task requires grasping, relocation, and articulation. We show our method's efficacy towards this task. We further demonstrate that our method can generate motions with noisy hand-object pose estimates from an off-the-shelf image-based regressor.

ps

pdf project code [BibTex]

pdf project code [BibTex]


Adversarial Likelihood Estimation With One-Way Flows
Adversarial Likelihood Estimation With One-Way Flows

Ben-Dov, O., Gupta, P. S., Abrevaya, V., Black, M. J., Ghosh, P.

In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages: 3779-3788, January 2024 (inproceedings)

Abstract
Generative Adversarial Networks (GANs) can produce high-quality samples, but do not provide an estimate of the probability density around the samples. However, it has been noted that maximizing the log-likelihood within an energy-based setting can lead to an adversarial framework where the discriminator provides unnormalized density (often called energy). We further develop this perspective, incorporate importance sampling, and show that 1) Wasserstein GAN performs a biased estimate of the partition function, and we propose instead to use an unbiased estimator; and 2) when optimizing for likelihood, one must maximize generator entropy. This is hypothesized to provide a better mode coverage. Different from previous works, we explicitly compute the density of the generated samples. This is the key enabler to designing an unbiased estimator of the partition function and computation of the generator entropy term. The generator density is obtained via a new type of flow network, called one-way flow network, that is less constrained in terms of architecture, as it does not require a tractable inverse function. Our experimental results show that our method converges faster, produces comparable sample quality to GANs with similar architecture, successfully avoids over-fitting to commonly used datasets and produces smooth low-dimensional latent representations of the training data.

ps

pdf arXiv [BibTex]

pdf arXiv [BibTex]


no image
Physics-Based Rigid Body Object Tracking and Friction Filtering From RGB-D Videos

Kandukuri, R. K., Strecke, M., Stueckler, J.

In International Conference on 3D Vision (3DV), 2024, accepted, preprint arXiv: 2309.15703 (inproceedings) Accepted

Abstract
Physics-based understanding of object interactions from sensory observations is an essential capability in augmented reality and robotics. It enables capturing the properties of a scene for simulation and control. In this paper, we propose a novel approach for real-to-sim which tracks rigid objects in 3D from RGB-D images and infers physical properties of the objects. We use a differentiable physics simulation as state-transition model in an Extended Kalman Filter which can model contact and friction for arbitrary mesh-based shapes and in this way estimate physically plausible trajectories. We demonstrate that our approach can filter position, orientation, velocities, and concurrently can estimate the coefficient of friction of the objects. We analyse our approach on various sliding scenarios in synthetic image sequences of single objects and colliding objects. We also demonstrate and evaluate our approach on a real-world dataset. We will make our novel benchmark datasets publicly available to foster future research in this novel problem setting and comparison with our method.

ev

preprint supplemental video dataset [BibTex]

preprint supplemental video dataset [BibTex]


no image
Online Calibration of a Single-Track Ground Vehicle Dynamics Model by Tight Fusion with Visual-Inertial Odometry

Li, H., Stueckler, J.

In Accepted for IEEE International Conference on Robotics and Automation (ICRA), 2024, accepted, preprint arXiv:2309.11148 (inproceedings) Accepted

Abstract
Wheeled mobile robots need the ability to estimate their motion and the effect of their control actions for navigation planning. In this paper, we present ST-VIO, a novel approach which tightly fuses a single-track dynamics model for wheeled ground vehicles with visual inertial odometry. Our method calibrates and adapts the dynamics model online and facilitates accurate forward prediction conditioned on future control inputs. The single-track dynamics model approximates wheeled vehicle motion under specific control inputs on flat ground using ordinary differential equations. We use a singularity-free and differentiable variant of the single-track model to enable seamless integration as dynamics factor into VIO and to optimize the model parameters online together with the VIO state variables. We validate our method with real-world data in both indoor and outdoor environments with different terrain types and wheels. In our experiments, we demonstrate that our ST-VIO can not only adapt to the change of the environments and achieve accurate prediction under new control inputs, but even improves the tracking accuracy.

ev

preprint supplemental video [BibTex]

preprint supplemental video [BibTex]

2023


no image
Transitivity Recovering Decompositions: Interpretable and Robust Fine-Grained Relationships

Chaudhuri, A., Mancini, M., Akata, Z., Dutta, A.

Advances in Neural Information Processing Systems 36 (NeurIPS 2023), 37th Annual Conference on Neural Information Processing Systems, December 2023 (conference) Accepted

ei

[BibTex]

2023


[BibTex]


no image
Spuriosity Didn’t Kill the Classifier: Using Invariant Predictions to Harness Spurious Features

Eastwood*, C., Singh*, S., Nicolicioiu, A. L., Vlastelica, M., von Kügelgen, J., Schölkopf, B.

Advances in Neural Information Processing Systems 36 (NeurIPS 2023), 36, pages: 18291-18324, (Editors: A. Oh and T. Neumann and A. Globerson and K. Saenko and M. Hardt and S. Levine), Curran Associates, Inc., 37th Annual Conference on Neural Information Processing Systems, December 2023, *equal contribution (conference)

ei

link (url) [BibTex]

link (url) [BibTex]


no image
CLadder: A Benchmark to Assess Causal Reasoning Capabilities of Language Models

Jin*, Z., Chen*, Y., Leeb*, F., Gresele*, L., Kamal, O., Lyu, Z., Blin, K., Gonzalez, F., Kleiman-Weiner, M., Sachan, M., Schölkopf, B.

Advances in Neural Information Processing Systems 36 (NeurIPS 2023), 36, pages: 31038-31065, (Editors: A. Oh and T. Neumann and A. Globerson and K. Saenko and M. Hardt and S. Levine), Curran Associates, Inc., 37th Annual Conference on Neural Information Processing Systems, December 2023, *main contributors (conference)

ei

link (url) [BibTex]

link (url) [BibTex]


no image
Online Learning under Adversarial Nonlinear Constraints

Kolev, P., Martius, G., Muehlebach, M.

In Advances in Neural Information Processing Systems 36, December 2023 (inproceedings)

al

link (url) [BibTex]

link (url) [BibTex]


no image
Generalized Bayesian Inference for Scientific Simulators via Amortized Cost Estimation

Gao*, R., Deistler*, M., Macke, J. H.

Advances in Neural Information Processing Systems 36 (NeurIPS 2023), 36, pages: 80191-80219, (Editors: A. Oh and T. Neumann and A. Globerson and K. Saenko and M. Hardt and S. Levine), Curran Associates, Inc., 37th Annual Conference on Neural Information Processing Systems, December 2023, *equal contribution (conference)

ei

link (url) [BibTex]

link (url) [BibTex]


no image
Meta-in-context learning in large language models

Coda-Forno, J., Binz, M., Akata, Z., Botvinick, M., Wang, J. X., Schulz, E.

Advances in Neural Information Processing Systems 36 (NeurIPS 2023), 37th Annual Conference on Neural Information Processing Systems, December 2023 (conference) Accepted

ei

[BibTex]

[BibTex]


no image
On the Importance of Step-wise Embeddings for Heterogeneous Clinical Time-Series

Kuznetsova*, R., Pace*, A., Burger*, M., Yèche, H., Rätsch, G.

Proceedings of the 3rd Machine Learning for Health Symposium (ML4H) , 225, pages: 268-291, Proceedings of Machine Learning Research, (Editors: Hegselmann, S.and Parziale, A. and Shanmugam, D. and Tang, S. and Asiedu, M. N. and Chang, S. and Hartvigsen, T. and Singh, H.), PMLR, December 2023, *equal contribution (conference)

ei

link (url) [BibTex]

link (url) [BibTex]


no image
Regularity as Intrinsic Reward for Free Play

Sancaktar, C., Piater, J., Martius, G.

In Advances in Neural Information Processing Systems 37, December 2023 (inproceedings)

al

Website Code link (url) [BibTex]

Website Code link (url) [BibTex]


no image
Causal de Finetti: On the Identification of Invariant Causal Structure in Exchangeable Data

Guo*, S., Tóth*, V., Schölkopf, B., Huszár, F.

Advances in Neural Information Processing Systems 36 (NeurIPS 2023), 36, pages: 36463-36475, (Editors: A. Oh and T. Neumann and A. Globerson and K. Saenko and M. Hardt and S. Levine), Curran Associates, Inc., 37th Annual Conference on Neural Information Processing Systems, December 2023, *equal contribution (conference)

ei

link (url) [BibTex]

link (url) [BibTex]


no image
Sampling from Gaussian Process Posteriors using Stochastic Gradient Descent

Lin*, J. A., Antorán*, J., Padhy*, S., Janz, D., Hernández-Lobato, J. M., Terenin, A.

Advances in Neural Information Processing Systems 36 (NeurIPS 2023), 36, pages: 36886-36912, (Editors: A. Oh and T. Neumann and A. Globerson and K. Saenko and M. Hardt and S. Levine), Curran Associates, Inc., 37th Annual Conference on Neural Information Processing Systems, December 2023, *equal contribution (conference)

ei

link (url) [BibTex]

link (url) [BibTex]


Controlling Text-to-Image Diffusion by Orthogonal Finetuning
Controlling Text-to-Image Diffusion by Orthogonal Finetuning

Qiu*, Z., Liu*, W., Feng, H., Xue, Y., Feng, Y., Liu, Z., Zhang, D., Weller, A., Schölkopf, B.

Advances in Neural Information Processing Systems 36 (NeurIPS 2023), 36, pages: 79320-79362, (Editors: A. Oh and T. Neumann and A. Globerson and K. Saenko and M. Hardt and S. Levine), Curran Associates, Inc., 37th Annual Conference on Neural Information Processing Systems, December 2023, *equal contribution (conference)

Abstract
Large text-to-image diffusion models have impressive capabilities in generating photorealistic images from text prompts. How to effectively guide or control these powerful models to perform different downstream tasks becomes an important open problem. To tackle this challenge, we introduce a principled finetuning method -- Orthogonal Finetuning (OFT), for adapting text-to-image diffusion models to downstream tasks. Unlike existing methods, OFT can provably preserve hyperspherical energy which characterizes the pairwise neuron relationship on the unit hypersphere. We find that this property is crucial for preserving the semantic generation ability of text-to-image diffusion models. To improve finetuning stability, we further propose Constrained Orthogonal Finetuning (COFT) which imposes an additional radius constraint to the hypersphere. Specifically, we consider two important finetuning text-to-image tasks: subject-driven generation where the goal is to generate subject-specific images given a few images of a subject and a text prompt, and controllable generation where the goal is to enable the model to take in additional control signals. We empirically show that our OFT framework outperforms existing methods in generation quality and convergence speed.

ei ps

Home Code link (url) [BibTex]

Home Code link (url) [BibTex]


no image
Object-Centric Learning for Real-World Videos by Predicting Temporal Feature Similarities

Zadaianchuk, A., Seitzer, M., Martius, G.

In Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS 2023), Advances in Neural Information Processing Systems 36, December 2023 (inproceedings)

Abstract
Unsupervised video-based object-centric learning is a promising avenue to learn structured representations from large, unlabeled video collections, but previous approaches have only managed to scale to real-world datasets in restricted domains. Recently, it was shown that the reconstruction of pre-trained self-supervised features leads to object-centric representations on unconstrained real-world image datasets. Building on this approach, we propose a novel way to use such pre-trained features in the form of a temporal feature similarity loss. This loss encodes semantic and temporal correlations between image patches and is a natural way to introduce a motion bias for object discovery. We demonstrate that this loss leads to state-of-the-art performance on the challenging synthetic MOVi datasets. When used in combination with the feature reconstruction loss, our model is the first object-centric video model that scales to unconstrained video datasets such as YouTube-VIS.

al

arXiv Website OpenReview link (url) [BibTex]

arXiv Website OpenReview link (url) [BibTex]


no image
In-Context Impersonation Reveals Large Language Models’ Strengths and Biases

Salewski, L., Alaniz, S., Rio-Torto, I., Schulz, E., Akata, Z.

Advances in Neural Information Processing Systems 36 (NeurIPS 2023), 37th Annual Conference on Neural Information Processing Systems, December 2023 (conference) Accepted

ei

[BibTex]

[BibTex]


no image
Flow Matching for Scalable Simulation-Based Inference

Wildberger*, J., Dax*, M., Buchholz*, S., Green, S. R., Macke, J. H., Schölkopf, B.

Advances in Neural Information Processing Systems 36 (NeurIPS 2023), 36, pages: 16837-16864, (Editors: A. Oh and T. Neumann and A. Globerson and K. Saenko and M. Hardt and S. Levine), Curran Associates, Inc., 37th Annual Conference on Neural Information Processing Systems, December 2023, *equal contribution (conference)

ei

link (url) [BibTex]

link (url) [BibTex]


SE(3) Equivariant Augmented Coupling Flows
SE(3) Equivariant Augmented Coupling Flows

Midgley*, L. I., Stimper*, V., Antorán*, J., Mathieu*, E., Schölkopf, B., Hernández-Lobato, J. M.

Advances in Neural Information Processing Systems 36 (NeurIPS 2023), 36, pages: 79200-79225, (Editors: A. Oh and T. Neumann and A. Globerson and K. Saenko and M. Hardt and S. Levine), Curran Associates, Inc., 37th Annual Conference on Neural Information Processing Systems, December 2023, *equal contribution (conference)

Abstract
Coupling normalizing flows allow for fast sampling and density evaluation, making them the tool of choice for probabilistic modeling of physical systems. However, the standard coupling architecture precludes endowing flows that operate on the Cartesian coordinates of atoms with the SE(3) and permutation invariances of physical systems. This work proposes a coupling flow that preserves SE(3) and permutation equivariance by performing coordinate splits along additional augmented dimensions. At each layer, the flow maps atoms’ positions into learned SE(3) invariant bases, where we apply standard flow transformations, such as monotonic rational-quadratic splines, before returning to the original basis. Crucially, our flow preserves fast sampling and density evaluation, and may be used to produce unbiased estimates of expectations with respect to the target distribution via importance sampling. When trained on the DW4, LJ13 and QM9-positional datasets, our flow is competitive with equivariant continuous normalizing flows, while allowing sampling two orders of magnitude faster. Moreover, to the best of our knowledge, we are the first to learn the full Boltzmann distribution of alanine dipeptide by only modeling the Cartesian positions of its atoms. Lastly, we demonstrate that our flow can be trained to approximately sample from the Boltzmann distribution of the DW4 and LJ13 particle systems using only their energy functions.

ei

arXiv link (url) [BibTex]

arXiv link (url) [BibTex]


no image
Neural Harmonics: Bridging Spectral Embedding and Matrix Completion in Self-Supervised Learning

Munkhoeva, M., Oseledets, I.

Advances in Neural Information Processing Systems 36 (NeurIPS 2023), 36, pages: 60712-60723, (Editors: A. Oh and T. Neumann and A. Globerson and K. Saenko and M. Hardt and S. Levine), Curran Associates, Inc., 37th Annual Conference on Neural Information Processing Systems, December 2023 (conference)

ei

link (url) [BibTex]

link (url) [BibTex]


no image
Learning Linear Causal Representations from Interventions under General Nonlinear Mixing

Buchholz*, S., Rajendran*, G., Rosenfeld, E., Aragam, B., Schölkopf, B., Ravikumar, P.

Advances in Neural Information Processing Systems 36 (NeurIPS 2023), 36, pages: 45419-45462, (Editors: A. Oh and T. Neumann and A. Globerson and K. Saenko and M. Hardt and S. Levine), Curran Associates, Inc., 37th Annual Conference on Neural Information Processing Systems, December 2023, *equal contribution (conference)

ei

link (url) [BibTex]

link (url) [BibTex]


no image
Causal Modeling with Stationary Diffusions

Lorch, L., Krause*, A., Schölkopf*, B.

Causal Representation Learning Workshop at NeurIPS 2023, December 2023, *equal supervision (conference)

ei

link (url) [BibTex]

link (url) [BibTex]