Artificial Intelligence and Image Understanding Lab
Research Field
Jun-Cheng Chen is an Associate Research Fellow at the Research Center for Information Technology Innovation (CITI), Academia Sinica. He joined CITI as an assistant research fellow in 2019. He received the BS and MS degrees advised by Prof. Ja-Ling Wu in Computer Science and Information Engineering from National Taiwan University, Taiwan (R.O.C), in 2004 and 2006, respectively, where he received the PhD degree advised by Prof. Rama Chellappa in Computer Science from University of Maryland, College Park, USA, in 2016. From 2017 to 2019, he was a postdoctoral research fellow at the University of Maryland Institute for Advanced Computer Studies. His research interests include computer vision, machine learning, deep learning and their applications to biometrics, such as face recognition/facial analytics, activity recognition/detection in the visual surveillance domain, visual generative AI and AI safety related topics. His works have been recognized in prestigious journals and conferences in the field, including PNAS, TBIOM, CVPR, ICCV, ECCV, FG, WACV, ICLR, etc. He was a recipient of the ACM Multimedia Best Technical Full Paper Award in 2006, APSIPA ASC Best Paper Award in 2023, and IEEE CE Magazine Best Paper Award in 2025.
The Artificial Intelligence & Image Understanding Lab is dedicated to advancing the frontiers of deep learning for computer vision, generative AI, embodied intelligence, and AI safety. Our mission is to build intelligent visual systems that are not only powerful and creative, but also robust, trustworthy, and aligned with real-world needs.
Our research spans four major directions:
🔹 Intelligent Visual Perception
We develop advanced methods for face recognition, action understanding, object detection and tracking, and anomaly detection. Our goal is to build robust visual perception systems capable of operating reliably under complex real-world conditions.
🔹 Visual Generative AI & 3D/4D World Modeling
We explore diffusion/flow-matching models, GANs, and large multimodal models for image, video, and 3D generation. Our research includes NeRF and Gaussian Splatting–based 3D reconstruction, digital twins, and physically consistent generative modeling. We aim to bridge perception and generation to enable controllable and interactive visual intelligence.
🔹 Embodied AI & Vision-Language Agents
We study multimodal embodied intelligence, focusing on Vision-Language Navigation (VLN), Vision-Language-Action (VLA) alignment, and policy optimization. We design agents that can understand, reason, and act in complex environments by integrating visual perception with language understanding.
🔹 AI Safety & Trustworthy AI
As generative AI becomes increasingly powerful, safety becomes critical. We work on deepfake detection, adversarial robustness for multimodal large language models, machine unlearning, watermarking, and secure AI systems to ensure responsible deployment of AI technologies.
My research focuses on deep learning for computer vision, spanning four major categories:
1. Traditional Computer Vision and Biometrics Research topics include: Face recognition, Action recognition, Object detection, and tracking Anomaly detection. This line of work investigates robust visual representation learning, biometric authentication, and reliable perception systems under real-world variations and domain shifts.
2. Visual Generative AI: This research direction explores generative models for visual content creation and understanding, including: Image and video generation using Diffusion/Flow-matching Models and GANs, 3D generation and 3D reconstruction (e.g., NeRF, Gaussian Splatting), Multimodal generative AI, Detection and analysis of AI-generated content, Digital twin modeling. The goal is to develop accurate, controllable, and physically consistent generative systems that bridge 2D–3D–4D representations and enable interactive world modeling.
3. Embodied AI: This research focuses on multimodal embodied intelligence grounded in vision and language, including: Vision-Language Navigation (VLN), Vision-Language-Action (VLA), alignment, Policy optimization for embodied agents, Multimodal perception, and decision-making. The objective is to design intelligent agents capable of understanding, reasoning, and acting in complex environments through integrated visual and linguistic representations.
4. AI Safety and Trustworthy AI: This direction addresses security, robustness, and reliability in AI systems, including: Deepfake detection Adversarial examples and prompt attacks for multimodal large language models (MLLMs) Machine unlearning Digital watermarking Computer vision–based AI security mechanisms.
- CE Magazine Best Paper Award, ICCE 2025, Las Vegas, USA (2025)
- Best Paper Implementation Award, CVGIP 2025, Taiwan (2025)
- The First Place of Task 6 Grounded VideoQA, ECCV Workshop for the Second Perception Test Challenge, Italy (2024)
- APSIPA ASC Best Paper Award, Taiwan (2023)
- Graduate Summer Research Fellowship, University of Maryland, College Park, USA (2016)
- Excellent Work Award, 1st Acer Long-term Smile Contest, Taiwan (2006)
- Best Technical Full Paper Award, ACM Multimedia Conference, Santa Barbara, USA (2006)
- Champion, Best Mind-Stimulating Award, and Best Creativity Award, National Creative Mobile Game Competition, Taiwan (2004)
- Ph.D., Computer Science , University of Maryland College Park, United States (2008/9 – 2016/12)
- M.S., Computer Science , University of Maryland College Park, United States (2008/9 – 2012/6)
- M.S., Computer Science and Information Engineering, National Taiwan University, Taiwan (2004/9 – 2006/6)
- B.S., Computer Science and Information Engineering, National Taiwan University, Taiwan (2000/9 – 2004/6)
Job Description
The internship spans at least two months and could be extended to three months depending on the regulation of NSTC.
There are two periods for the internship application.
If you prefer to the period from August to October, please apply by April 15, 2026. otherwise, for November to December, please apply by July 15, 2026.
Preferred Intern Educational Level
Undergraduate and graduate students in Computer Science, Electrical Engineering, AI, or related fields are encouraged to apply.
Skill sets or Qualities
Desired Skill Set and Background Qualifications
We are looking for international interns interested in deep learning, computer vision, and vision-language models (VLMs). Applicants should have basic knowledge of machine learning and deep learning, experience with Python and frameworks such as PyTorch or Jax, and familiarity with core topics in computer vision or multimodal learning. Prior project or research experience is a plus, and publications in well-known conferences such as CVPR, ICCV, ECCV, WACV, and ICIP are especially welcomed. Familiarity with Linux, and Git is also preferred. Strong problem-solving skills, research motivation, and the ability to work both independently and collaboratively are highly valued.