I am currently a Ph.D. student at Georgia Institute of Technology, advised by Prof. Bo Dai. Before starting my Ph.D., I received my M.Sc. and B.Sc. degree from Nanjing University.

My research focus lies at the intersection of reinforcement learning (RL), representation learning and generative models. A large proportion of my research is about leveraging the power of generative models to transform the RL pipeline, by using modern architectures and training techniques to improve the efficiency and scalability of RL algorithms. Apart from that, I had industrial experience of using RL to solve real-world problems, such as Game AI design @ Bytedance and agentic training for LLMs @ Moonshot.AI.

Feel free to contact me if you are interested in my research!

πŸ”₯ News

  • 2025.05: Β πŸŽ‰πŸŽ‰ BDPO is accepted by ICML 2025!
  • 2025.05: Β πŸŽ‰πŸŽ‰ RIBBO is accepted by IJCAI 2025!
  • 2024.09: Β πŸŽ‰πŸŽ‰ DiffSR is accepted by NeurIPS 2024!
  • 2024.03: Β πŸŽ‰πŸŽ‰ ReDM is accepted by ICLR 2024!

πŸ“ Publications

preprint

FlowRL: A Taxonomy and Modular Framework for Reinforcement Learning with Diffusion Policies

Chenxiao Gao, Edward Chen, Tianyi Chen, Bo Dai

Code

  • We organize the fast-growing landscape of RL with diffusion and flow policies into a unified taxonomy, clarifying the design choices behind existing algorithms.
  • We release a modular, JAX-based toolkit with JIT-compiled training and standardized benchmarks, making algorithms easy to compose, compare, and select across generative-modeling and robotics tasks.

preprint

GeMPO: Generalized Measure Matching for Online Diffusion Reinforcement Learning

Haitong Ma, Chenxiao Gao, Tianyi Chen, Na Li, Bo Dai

Code

  • We cast online diffusion RL through a measure-matching lens, generalizing the reweighting scheme from softmax to general monotonic functions.
  • By constructing a virtual target policy and minimizing divergence via reweighted flow matching, GeMPO enables flexible weight design and justifies negative reweighting that actively repels the policy from suboptimal actions.
preprint
sym

Spectral Representation-based Reinforcement Learning

Chenxiao Gao, Haotian Sun, Na Li, Dale Schuurmans, Bo Dai

Code | Project

  • We derive spectral representations from a decomposition of the transition operator, yielding an effective abstraction of the system dynamics with a clear theoretical characterization for downstream policy optimization.
  • The framework covers latent-variable and energy-based dynamics, extends to POMDPs, and matches or exceeds strong baselines on 20+ DeepMind Control Suite tasks.

Tech Report

Kimi K2: Open Agentic Intelligence

Kimi Team (incl. Chenxiao Gao)

  • Kimi K2 is a 1T-parameter (32B active) mixture-of-experts model trained with the MuonClip optimizer for stable large-scale pre-training, achieving strong agentic and coding results (e.g., 65.8 on SWE-Bench Verified).
  • As part of the Kimi Team during my internship at Moonshot.AI, I worked on strengthening the model’s interactive tool-use capability across agent scaffolding, supervised fine-tuning, and reinforcement learning.
ICML 2025
sym

Behavior-Regularized Diffusion Policy Optimization for Offline Reinforcement Learning

Chenxiao Gao, Chenyang Wu, Mingjun Cao, Chenjun Xiao, Yang Yu, Zongzhang Zhang

Code | Project Page

  • We Introduce pathwise KL to estimate and control the divergences between two diffusion policies;
  • Leveraging pathwise KL, we propose an actor-critic framework with two-time-scale temporal difference learning to efficiently optimize diffusion policies with behavior regularization.
IJCAI 2025
sym

Reinforced In-Context Black-Box Optimization

Lei Song*, Chenxiao Gao*, Ke Xue, Chenyang Wu, Dong Li, Jianye Hao, Zongzhang Zhang, Chao Qian

Code

  • RIBBO distills and reinforces existing black-box optimization algorithms by fitting regret augmented learning histories of the behavior algorithms.
  • By specifying a suitable regret-to-go, RIBBO generate better query decisions by auto-regressively predicting the next points.
NeurIPS 2024
sym

Diffusion Spectral Representation for Reinforcement Learning

Chenxiao Gao*, Dmitry Shribak*, Yitong Li, Chenjun Xiao, Bo Dai

Code | Project

  • We leverages the flexibility of diffusion models and extract spectral representations (Diff-SR) that capture the dynamics structure.
  • Diff-SR is able to represent the value function of any policy sufficiently, paving the way for efficient planning and exploration for downstream RL optimization.
preprint
sym

Hindsight Preference Learning for Offline Preference-based Reinforcement Learning

Chenxiao Gao, Shengjun Fang, Chenjun Xiao, Yang Yu, Zongzhang Zhang

Code

  • We identified shortages of the widely used preference modeling method in existing PbRL settings.
  • HPL leverages the vast unlabeled dataset to facilitate credit assignment, providing robust and advantageous rewards for downsteam RL optimization.
IJCAI 2024
sym

Efficient and Stable Offline-to-online Reinforcement Learning via Continual Policy Revitalization

Rui Kong, Chenyang Wu, Chenxiao Gao, Yang Yu, Zongzhang Zhang

Code

  • We identify two pain points in offline-to-online reinforcement learning: 1) value overestimation causes fluctuations during learning, and 2) the primacy bias hinders the policy from further improvement.
  • With the proposed Continual Policy Revitalization, we can fine-tune pret-rained policies efficiently and stably.
ICLR 2024
sym

Policy Rehearsing: Training Generalizable Policies for Reinforcement Learning

Chengxing Jia*, Chenxiao Gao*, Hao Yin, Fuxiang Zhang, Xiong-Hui Chen, Tian Xu, Lei Yuan, Zongzhang Zhang, Yang Yu, Zhi-Hua Zhou

  • We explore the idea of rehearsal for offline reinforcement learning, which generates diverse while eligible dynamics models using extremely limited amount of data and optimizes an contextual policy with the generated models.
  • By recognizing the context, the policy is able to generalize to the environment at hand during the online stage.
AAMAS 2024
sym

Disentangling Policy from Offline Task Rpresentation Learning via Adversarial Data Augmentation

Chengxing Jia, Fuxiang Zhang, Yi-Chen Li, Chenxiao Gao, Xu-Hui Liu, Lei Yuan, Zongzhang Zhang, Yang Yu.

Code

  • Learned task representations from previous OMRL methods tend to correlate spuriously with the behavior policy instead of the task.
  • We disentangle the effect of behavior policies from representation learning by adversarial data augmentation.
AAAI 2024
sym

ACT: Empowering Decision Transformer with Dynamic Programming via Advantage Conditioning

Chenxiao Gao, Chenyang Wu, Mingjun Cao, Rui Kong, Zongzhang Zhang, Yang Yu

Code

  • We identify failure modes of existing return-conditioned decision-making systems, and suggest to use advantages as the property token for conditional generation.
AAAI 2024
sym

Generalizable Task Representation Learning for Offline Meta-Reinforcement Learning with Data Limitations

Renzhe Zhou, Chenxiao Gao, Zongzhang Zhang, Yang Yu

Code

  • Real-world RL applications are with data limitations, such as limited tasks and limited behavior diversity.
  • We develop GENTLE, a simple yet effective task representation learning method to extract generalizable and accurate task representations from offline contextual datasets.

πŸ“ Academic Services

  • Reviewer for conferences: ICML 2025-2026, NeurIPS 2025, ICLR 2025-2026, IJCAI 2025, AAAI 2025-2026, UAI 2025
  • Reviewer for journals: TMLR
  • Teaching Assistant: CX4240 - Computing for Data Analysis

πŸŽ– Honors and Awards

  • 2021.12 National Scholarship
  • 2020.10 Chow Tai Fook Scholarship
  • 2020.10 People’s Scholarship of Nanjing University

πŸ“– Educations

  • 2022.09 - now, Computer Science, Nanjing University
  • 2018.09 - 2022.06, Computer Science, Nanjing University