![]() Towards Video Large Langauge Models for View-Invariant Video Understanding Minjoon Jung, Junbin Xiao*, Junghyun Kim, Il-Jae Kwon, Byoung-Tak Zhang, Angela Yao
Technical Report We introduce EgoExo-Con, a new benchmark comprising synchronized ego-exo video with human-refined queries, to study view-invariant video understanding in Video-LLMs. We also propose a scalable self-guided learning method that achieves consistent temporal comprehension with comparable performance to state-of-the-art models. ![]() On the Consistency of Video Large Language Models in Temporal Comprehension Minjoon Jung, Junbin Xiao*, Byoung-Tak Zhang, Angela Yao
Conference on Computer Vision and Pattern Recognition (CVPR), 2025 We reveal that Video Large Language Models (Video-LLMs) struggle to maintain consistency in grounding and verification. We systematically analyze this issue and introduce VTune, an effective instruction tuning method, leading to substantial improvements in both grounding and consistency. ![]() Background-aware Moment Detection for Video Moment Retrieval Minjoon Jung, Youwon Jang, Seongho Choi, Joochan Kim, Jin-Hwa Kim*, Byoung-Tak Zhang*
Winter Conference on Applications of Computer Vision (WACV), 2025 We propose Background-aware Moment Detection TRansformer (BM-DETR), which carefully adopts a contrastive approach for robust prediction. BM-DETR achieves state-of-the-art performance on various benchmarks while being highly efficient. ![]() Modal-specific Pseudo Query Generation for Video Corpus Moment Retrieval Minjoon Jung, Seongho Choi, Joochan Kim, Jin-Hwa Kim*, Byoung-Tak Zhang*
Empirical Methods in Natural Language Processing (EMNLP), 2022 We propose Modal-specific Pseudo Query Generation Network (MPGN), a self-supervised framework for Video Corpus Moment Retrieval (VCMR). MPGN captures orthogonal axis of information in videos and generates pseudo-queries that provide a considerable performance boost, even without human annotations. ![]() Stagemix Video Generation using Face and Body Keypoints Detection Minjoon Jung, Seung-Hyun Lee, Eunseon Sim, Minho Jo, Yujin Lee, Hyebin Choi, Junseok Kwon*
Multimedia Tools and Applications (MTAP), 2022 We design a method for automatically creating Stagemix videos, which seamlessly combine multiple stage performances of a singer into a single cohesive video. We effectively produces natural-looking Stagemix videos while significantly reducing the effort involved compared to manual editing. ![]() Confidence-guided Refinement Reasoning for Zero-shot Question Answering Youwon Jang, Minjoon Jung, Woo-Suk Choi, Minsoo Lee, Byoung-Tak Zhang,
Technical Report We propose Confidence-guided Refinement Reasoning (C2R), a novel training-free framework applicable to question-answering (QA) tasks across text, image, and video domains. C2R relies solely on model-derived confidence scores, allowing seamless integration with a variety of QA models and consistently improving performance across different models and benchmarks. ![]() Towards Video Large Langauge Models for View-Invariant Video Understanding Minjoon Jung, Junbin Xiao*, Junghyun Kim, Il-Jae Kwon, Byoung-Tak Zhang, Angela Yao
Technical Report We introduce EgoExo-Con, a new benchmark comprising synchronized ego-exo video with human-refined queries, to study view-invariant video understanding in Video-LLMs. We also propose a scalable self-guided learning method that achieves consistent temporal comprehension with comparable performance to state-of-the-art models. ![]() On the Consistency of Video Large Language Models in Temporal Comprehension Minjoon Jung, Junbin Xiao*, Byoung-Tak Zhang, Angela Yao
Conference on Computer Vision and Pattern Recognition (CVPR), 2025 We reveal that Video Large Language Models (Video-LLMs) struggle to maintain consistency in grounding and verification. We systematically analyze this issue and introduce VTune, an effective instruction tuning method, leading to substantial improvements in both grounding and consistency. ![]() Exploring Ordinal Bias in Action Recognition for Instructional Videos Joochan Kim, Minjoon Jung, Byoung-Tak Zhang*
ICLR Workshop on Spurious Correlation and Shortcut Learning: Foundations and Solutions, 2025 We study that ordinal bias leads action recognition models to over-rely on dominant action pairs, inflating performance and lacking true video comprehension. ![]() Background-aware Moment Detection for Video Moment Retrieval Minjoon Jung, Youwon Jang, Seongho Choi, Joochan Kim, Jin-Hwa Kim*, Byoung-Tak Zhang*
Winter Conference on Applications of Computer Vision (WACV), 2025 We propose Background-aware Moment Detection TRansformer (BM-DETR), which carefully adopts a contrastive approach for robust prediction. BM-DETR achieves state-of-the-art performance on various benchmarks while being highly efficient. ![]() PGA: Personalizing Grasping Agents with Single Human-Robot Interaction Junghyun Kim, Gi-Cheon Kang, Jaein Kim, Seoyun Yang, Minjoon Jung, Byoung-Tak Zhang*
International Conference on Intelligent Robots and Systems (IROS), 2024 (Oral) We propose Personalized Grasping Agent (PGA), which enables robots to grasp user-specific objects from just a single interaction. PGA captures multi-view object data and uses label propagation to adapt its grasping model without requiring extensive annotations, achieving performance close to fully supervised methods. ![]() Modal-specific Pseudo Query Generation for Video Corpus Moment Retrieval Minjoon Jung, Seongho Choi, Joochan Kim, Jin-Hwa Kim*, Byoung-Tak Zhang*
Empirical Methods in Natural Language Processing (EMNLP), 2022 We propose Modal-specific Pseudo Query Generation Network (MPGN), a self-supervised framework for Video Corpus Moment Retrieval (VCMR). MPGN captures orthogonal axis of information in videos and generates pseudo-queries that provide a considerable performance boost, even without human annotations. ![]() Stagemix Video Generation using Face and Body Keypoints Detection Minjoon Jung, Seung-Hyun Lee, Eunseon Sim, Minho Jo, Yujin Lee, Hyebin Choi, Junseok Kwon*
Multimedia Tools and Applications (MTAP), 2022 We design a method for automatically creating Stagemix videos, which seamlessly combine multiple stage performances of a singer into a single cohesive video. We effectively produces natural-looking Stagemix videos while significantly reducing the effort involved compared to manual editing. ![]() Toward a Human-Level Video Understanding Intelligence Yu-Jung Heo, Minsu Lee, Seongho Choi, Woo Suk Choi, Minjung Shin, Minjoon Jung, Jeh-Kwang Ryu, Byoung-Tak Zhang*
AAAI Fall Symposium Series on Artificial Intelligence for Human-Robot Interaction, 2021 We aim to develop an AI agent that can watch video clips and have a conversation with human about the video story.
|