NeurIPS Poster On-the-Fly Adapting Code Summarization on Trainable Cost-Effective Language Models

Poster

On-the-Fly Adapting Code Summarization on Trainable Cost-Effective Language Models

Yufan Cai · Yun Lin · Chenyan Liu · Jinglian Wu · Yifan Zhang · Yiming Liu · Yeyun Gong · Jin Song Dong

Great Hall & Hall B1+B2 (level 1) #420

[ Abstract ]

[ Paper] [ Poster] [ OpenReview]

Abstract: Deep learning models are emerging to summarize source code to comment, facilitating tasks of code documentation and program comprehension. Scaled-up large language models trained on large open corpus have achieved good performance in such tasks. However, in practice, the subject code in one certain project can be specific, which may not align with the overall training corpus. Some code samples from other projects may be contradictory and introduce inconsistencies when the models try to fit all the samples. In this work, we introduce a novel approach, Adacom, to improve the performance of comment generators by on-the-fly model adaptation. This research is motivated by the observation that deep comment generators often need to strike a balance as they need to fit all the training samples. Specifically, for one certain target code

c

$c$ , some training samples

S_{p}

$S_p$ could have made more contributions while other samples

S_{o}

$S_o$ could have counter effects. However, the traditional fine-tuned models need to fit both

S_{p}

$S_p$ and

S_{o}

$S_o$ from a global perspective, leading to compromised performance for one certain target code

c

$c$ . In this context, we design Adacom to (1) detect whether the model might have a compromised performance on a target code

c

$c$ and (2) retrieve a few helpful training samples

S_{p}

$S_p$ that have contradictory samples in the training dataset and, (3) adapt the model on the fly by re-training the

S_{p}

$S_p$ to strengthen the helpful samples and unlearn the harmful samples. Our extensive experiments on 7 comment generators and 4 public datasets show that (1) can significantly boost the performance of comment generation (BLEU4 score by on average 14.9\%, METEOR by 12.2\%, and ROUGE-L by 7.4\%), (2) the adaptation on one code sample is cost-effective and acceptable as an on-the-fly solution, and (3) can adapt well on out-of-distribution code samples.

Chat is not available.