Minjoon Jung

Hi there! My name is Minjoon Jung, and I'm a Ph.D. student at Seoul National University. I have a broad interest in AI systems that interact with multimodal data (Language + X, where X includes vision, video, and robotics). My long-term research goal is to develop fine-grained, trustworthy multimodal AI agents that are both interpretable and highly efficient. Previously, I interned at NUS@CVML, where I had the great opportunity to conduct research with Dr. Junbin Xiao and Prof. Angela Yao. If you’d like to chat about research or potential collaborations, feel free to reach out!

Email  /  CV  /  Google Scholar  /  Github  

profile photo
News
  • [2025.04] One paper has been accepted by ICLR 2025 Workshop!New
  • [2025.02] One paper has been accepted by CVPR 2025!
  • [2024.11] One paper has been accepted by NeurIPS 2024 Workshop! We also released the extended version on arXiv.
  • [2024.08] One paper has been early accepted by WACV 2025 Round 1!
  • [2024.02] One paper has been accepted by IROS 2024!
  • [2024.02] I'll be joining at National University of Singapore as a research intern.
  • [2022.10] One paper has been accepted by EMNLP 2022!
Publications  
Show Selected
Show by Date

Towards Video Large Langauge Models for View-Invariant Video Understanding

Minjoon Jung, Junbin Xiao*, Junghyun Kim, Il-Jae Kwon, Byoung-Tak Zhang, Angela Yao

Technical Report

TBD

We introduce EgoExo-Con, a new benchmark comprising synchronized ego-exo video with human-refined queries, to study view-invariant video understanding in Video-LLMs. We also propose a scalable self-guided learning method that achieves consistent temporal comprehension with comparable performance to state-of-the-art models.

Animated GIF

On the Consistency of Video Large Language Models in Temporal Comprehension

Minjoon Jung, Junbin Xiao*, Byoung-Tak Zhang, Angela Yao

Conference on Computer Vision and Pattern Recognition (CVPR), 2025
*Earlier version has been accepted by NeurIPS 2024 Workshop on Video-Language Models.
paper / code

We reveal that Video Large Language Models (Video-LLMs) struggle to maintain consistency in grounding and verification. We systematically analyze this issue and introduce VTune, an effective instruction tuning method, leading to substantial improvements in both grounding and consistency.

Image

Background-aware Moment Detection for Video Moment Retrieval

Minjoon Jung, Youwon Jang, Seongho Choi, Joochan Kim, Jin-Hwa Kim*, Byoung-Tak Zhang*

Winter Conference on Applications of Computer Vision (WACV), 2025
*Early accepted in Round 1.
paper / code

We propose Background-aware Moment Detection TRansformer (BM-DETR), which carefully adopts a contrastive approach for robust prediction. BM-DETR achieves state-of-the-art performance on various benchmarks while being highly efficient.

Image

Modal-specific Pseudo Query Generation for Video Corpus Moment Retrieval

Minjoon Jung, Seongho Choi, Joochan Kim, Jin-Hwa Kim*, Byoung-Tak Zhang*

Empirical Methods in Natural Language Processing (EMNLP), 2022
paper / code

We propose Modal-specific Pseudo Query Generation Network (MPGN), a self-supervised framework for Video Corpus Moment Retrieval (VCMR). MPGN captures orthogonal axis of information in videos and generates pseudo-queries that provide a considerable performance boost, even without human annotations.

Animated GIF

Stagemix Video Generation using Face and Body Keypoints Detection

Minjoon Jung, Seung-Hyun Lee, Eunseon Sim, Minho Jo, Yujin Lee, Hyebin Choi, Junseok Kwon*

Multimedia Tools and Applications (MTAP), 2022
paper / code

We design a method for automatically creating Stagemix videos, which seamlessly combine multiple stage performances of a singer into a single cohesive video. We effectively produces natural-looking Stagemix videos while significantly reducing the effort involved compared to manual editing.

Confidence-guided Refinement Reasoning for Zero-shot Question Answering

Youwon Jang, Minjoon Jung, Woo-Suk Choi, Minsoo Lee, Byoung-Tak Zhang,

Technical Report

TBD

We propose Confidence-guided Refinement Reasoning (C2R), a novel training-free framework applicable to question-answering (QA) tasks across text, image, and video domains. C2R relies solely on model-derived confidence scores, allowing seamless integration with a variety of QA models and consistently improving performance across different models and benchmarks.

Towards Video Large Langauge Models for View-Invariant Video Understanding

Minjoon Jung, Junbin Xiao*, Junghyun Kim, Il-Jae Kwon, Byoung-Tak Zhang, Angela Yao

Technical Report

TBD

We introduce EgoExo-Con, a new benchmark comprising synchronized ego-exo video with human-refined queries, to study view-invariant video understanding in Video-LLMs. We also propose a scalable self-guided learning method that achieves consistent temporal comprehension with comparable performance to state-of-the-art models.

Animated GIF

On the Consistency of Video Large Language Models in Temporal Comprehension

Minjoon Jung, Junbin Xiao*, Byoung-Tak Zhang, Angela Yao

Conference on Computer Vision and Pattern Recognition (CVPR), 2025
*Earlier version has been accepted by NeurIPS 2024 Workshop on Video-Language Models.
paper / code

We reveal that Video Large Language Models (Video-LLMs) struggle to maintain consistency in grounding and verification. We systematically analyze this issue and introduce VTune, an effective instruction tuning method, leading to substantial improvements in both grounding and consistency.

Exploring Ordinal Bias in Action Recognition for Instructional Videos

Joochan Kim, Minjoon Jung, Byoung-Tak Zhang*

ICLR Workshop on Spurious Correlation and Shortcut Learning: Foundations and Solutions, 2025

paper / code

We study that ordinal bias leads action recognition models to over-rely on dominant action pairs, inflating performance and lacking true video comprehension.

Image

Background-aware Moment Detection for Video Moment Retrieval

Minjoon Jung, Youwon Jang, Seongho Choi, Joochan Kim, Jin-Hwa Kim*, Byoung-Tak Zhang*

Winter Conference on Applications of Computer Vision (WACV), 2025
*Early accepted in Round 1.
paper / code

We propose Background-aware Moment Detection TRansformer (BM-DETR), which carefully adopts a contrastive approach for robust prediction. BM-DETR achieves state-of-the-art performance on various benchmarks while being highly efficient.

Animated GIF

PGA: Personalizing Grasping Agents with Single Human-Robot Interaction

Junghyun Kim, Gi-Cheon Kang, Jaein Kim, Seoyun Yang, Minjoon Jung, Byoung-Tak Zhang*

International Conference on Intelligent Robots and Systems (IROS), 2024 (Oral)
paper / code

We propose Personalized Grasping Agent (PGA), which enables robots to grasp user-specific objects from just a single interaction. PGA captures multi-view object data and uses label propagation to adapt its grasping model without requiring extensive annotations, achieving performance close to fully supervised methods.

Image

Modal-specific Pseudo Query Generation for Video Corpus Moment Retrieval

Minjoon Jung, Seongho Choi, Joochan Kim, Jin-Hwa Kim*, Byoung-Tak Zhang*

Empirical Methods in Natural Language Processing (EMNLP), 2022
paper / code

We propose Modal-specific Pseudo Query Generation Network (MPGN), a self-supervised framework for Video Corpus Moment Retrieval (VCMR). MPGN captures orthogonal axis of information in videos and generates pseudo-queries that provide a considerable performance boost, even without human annotations.

Animated GIF

Stagemix Video Generation using Face and Body Keypoints Detection

Minjoon Jung, Seung-Hyun Lee, Eunseon Sim, Minho Jo, Yujin Lee, Hyebin Choi, Junseok Kwon*

Multimedia Tools and Applications (MTAP), 2022
paper / code

We design a method for automatically creating Stagemix videos, which seamlessly combine multiple stage performances of a singer into a single cohesive video. We effectively produces natural-looking Stagemix videos while significantly reducing the effort involved compared to manual editing.

Image

Toward a Human-Level Video Understanding Intelligence

Yu-Jung Heo, Minsu Lee, Seongho Choi, Woo Suk Choi, Minjung Shin, Minjoon Jung, Jeh-Kwang Ryu, Byoung-Tak Zhang*

AAAI Fall Symposium Series on Artificial Intelligence for Human-Robot Interaction, 2021
paper

We aim to develop an AI agent that can watch video clips and have a conversation with human about the video story.


Minjoon Jung. BI LAB, Seoul National University