Minjoon Jung

I am a Ph.D. student at Seoul National University, advised by Prof. Byoung-Tak Zhang. Previously, I interned at NUS@CVML, where I conduct research under the guidance of Dr. Junbin Xiao and Prof. Angela Yao. I earned my Bachelor's degree in Software Engineering from Chung-Ang University. Please feel free to reach out to me to discuss research and opportunities for collaboration.

Email  /  CV  /  Google Scholar  /  Github  

profile photo
Research

I have a broad interest in video-modeling and trustworthy video comprehension. These days, I have been working on video large language models for fine-grained video understanding.

News
  • [2025.04] One paper has been accepted by ICLR 2025 Workshop!New
  • [2025.02] One paper has been accepted by CVPR 2025!
  • [2024.11] One paper has been accepted by NeurIPS 2024 Workshop! We also released the extended version on arXiv.
  • [2024.08] One paper has been early accepted by WACV 2025 Round 1!
  • [2024.02] One paper has been accepted by IROS 2024!
  • [2024.02] I'll be joining at National University of Singapore as a research intern.
  • [2022.10] One paper has been accepted by EMNLP 2022!
Publications
Animated GIF

On the Consistency of Video Large Language Models in Temporal Comprehension

Minjoon Jung, Junbin Xiao, Byoung-Tak Zhang* Angela Yao*

Conference on Computer Vision and Pattern Recognition (CVPR), 2025
*Earlier version has been accepted by NeurIPS 2024 Workshop on Video-Language Models.
paper / code

We reveal that Video Large Language Models (Video-LLMs) struggle to maintain consistency in grounding and verification. We systematically analyze this issue and introduce VTune, an effective instruction tuning method, leading to substantial improvements in both grounding and consistency.

Image

Background-aware Moment Detection for Video Moment Retrieval

Minjoon Jung, Youwon Jang, Seongho Choi, Joochan Kim, Jin-Hwa Kim*, Byoung-Tak Zhang*

Winter Conference on Applications of Computer Vision (WACV), 2025
*Early accepted in Round 1.
paper / code

We propose Background-aware Moment Detection TRansformer (BM-DETR), which carefully adopts a contrastive approach for robust prediction. BM-DETR achieves state-of-the-art performance on various benchmarks while being highly efficient.

Image

PGA: Personalizing Grasping Agents with Single Human-Robot Interaction

Junghyun Kim, Gi-Cheon Kang, Jaein Kim, Seoyun Yang, Minjoon Jung, Byoung-Tak Zhang*

International Conference on Intelligent Robots and Systems (IROS), 2024 (Oral)
paper / code

We propose Personalized Grasping Agent (PGA), which enables robots to grasp user-specific objects from just a single interaction. PGA captures multi-view object data and uses label propagation to adapt its grasping model without requiring extensive annotations, achieving performance close to fully supervised methods.

Image

Modal-specific Pseudo Query Generation for Video Corpus Moment Retrieval

Minjoon Jung, Seongho Choi, Joochan Kim, Jin-Hwa Kim*, Byoung-Tak Zhang*

Empirical Methods in Natural Language Processing (EMNLP), 2022
paper / code

We propose Modal-specific Pseudo Query Generation Network (MPGN), a self-supervised framework for Video Corpus Moment Retrieval (VCMR). MPGN captures orthogonal axis of information in videos and generates pseudo-queries that provide a considerable performance boost, even without human annotations.

Animated GIF

Stagemix Video Generation using Face and Body Keypoints Detection

Minjoon Jung, Seung-Hyun Lee, Eunseon Sim, Minho Jo, Yujin Lee, Hyebin Choi, Junseok Kwon*

Multimedia Tools and Applications (MTAP), 2022
paper / code

We design a method for automatically creating Stagemix videos, which seamlessly combine multiple stage performances of a singer into a single cohesive video. We effectively produces natural-looking Stagemix videos while significantly reducing the effort involved compared to manual editing.


Minjoon Jung. BI LAB, Seoul National University