Paper: Prompt Engineering or Fine-Tuning: An Empirical Assessment of LLMs for Code. 2025
1 What do they do, and what’s the result?
1.1 What do they do?
1. Two primary methodologies are used in LLMs for code: prompt engineering and fine-tuning. In this paper, they conduct a comparative analysis between these two methodologies.
2. Quantitative analysis of GPT-4:
(1) evaluate GPT-4 using 3 prompt engineering strategies: basic prompting, in-context learning, and task-specific prompting.
Basic prompting: directly query GPT-4 with the input (code or description) and ask to generate solutions in the form of the desired output.
In-context prompting: together with the basic prompt, we give a set of input/output examples to GPT-4. This idea is very similar to few-shot learning in the context of fine-tuned language models.
Task specific engineered prompting: together with the basic prompt, we design additional prompts to guide GPT-4 in generating better results for each task.
(2) compare GPT-4 against 17 fine-tuned models across 3 tasks: code summarization, code generation, and code translation.
3. Qualitative study involves 37 users.
1.2 What’s the result?
1. Our results indicate that GPT-4 with prompt engineering does not consistently outperform fine-tuned models.
(1) In code summarization, GPT-4 with task-specific prompting outperforms the top fine-tuned model by 8.33% points in the BLEU score.
(2) In code generation, GPT-4 is outperformed by fine-tuned models by 28.3% points on the MBPP dataset.
(3) In code translation, GPT-4 shows mixed results.
2. The user study shows that, GPT-4 with “conversational prompts + incorporating human feedback during interaction” significantly improved performance compared to automated prompting. It improves 15.8% points for code summarization, 18.3% points for code generation, and 16.1% points for code translation.
3. We identified 9 types of conversational prompts from the chat logs with GPT-4:
(1) Request improvements with certain keywords;
(2) Provide more context;
(3) Add specific instructions; –> was found to be the most common for code generation and translation tasks.
(4) Point mistakes then request fixes;
(5) Ask questions to guide the correct way;
(6) Request verification;
(7) Request more examples;
(8) Request more detailed description;
(9) Request another or a different version of generation.
4. [52] found that the most prevalent translation bug was data-related. I.e., data types, parsing input data, and output formatting issues.
[52] Lost in translation: A study of bugs introduced by large language models while translating code. 2024 ICSE
5. Our qualitative user study revealed that the key potential of GPT-4 is on conversation-based prompting, which includes human feedback in the loop.
6. Two takeaway messages from out results:
(1) To leverage LLMs to their best, a sequence of back-and-forth queries with the model may be needed.
(2) Human feedback still plays a crucial role in optimizing the prompts, as shown by conversational prompting. Thus, to remove humans from the loop completely, future studies are needed to analyze developer-written prompts more carefully and extract patterns that can be implemented as rules, fitness functions, rewards, and policies.
2: Can I use any of it in my work? If yes, how?
2.1 Research Questions
RQ1: how does cursorrules compare to conversational prompts?
RQ2: how do participants refine their prompts when interacting with LLMs models (e.g., Cursor, GPT-4)?
RQ3: what is the impact of different prompt evolution patterns? This question aims to investigate which category of conversation prompts contributes to better results.
For code generation, the top 5 ranks of conversational prompts that contribute the most are:
(1) Add more context
(2) Add instructions
(3) Request improvements
(4) Ask questions
(5) Point mistake then fix
For code summarization: Requesting improvement worked best.
For code translation, Asking questions worked best.
2.2 Experiments
1. Three typical automated code-related software engineering tasks
(1) Code Summarization task:
The model generates a short natural language summary from a source code snippet (SC-to-NL).
[45] Few-shot training llms for project-specific code-summarization. 2022
[46] Automatic semantic augmentation of language model prompts (for code summarization). 2024
[47] On the evaluation of neural code summarization. 2022
(2) Code Generation task:
The model generates the corresponding code snippet from a natural language description (NL-to-SC).
[48] Clarifygpt: A framework for enhancing llm-based code generation via requirements clarification. 2024
[49] Competition-level code generation with alphacode. 2022 Science.
[50] Self-collaboration code generation via chatgpt. 2024
(3) Code Translation task:
The model translates a source code snippet into another programming language (SC-to-SC).
[51] Bridging gaps in llm code translation: Reducing errors with call graphs and bridged debuggers. 2024
[53] Exploring and unleashing the power of large language models in automated code translation. 2024
2. Evaluation Metrics
(1) For code generation, we use pass@k [56], defined as the probability that one of the top k-generated samples passes the unit tests with k set to 1.
(2) For code summarization, we use BLEU [70], which assesses similarity to the ground truth using n-gram precision.
(3) For code translation, we use BLEU, ACC [71], and CodeBLEU [72], which combines n-gram precision, keyword matching, AST matching, and dataflow matching.
[56] Evaluating large language models trained on code. 2021
[70] Bleu: a method for automatic evaluation of machine translation. 2002
[71] Selecting and interpreting measures of thematic classification accuracy. 1997
[72] Codebleu: a method for automatic evaluation of code synthesis. 2020
3: Sparked ideas, thought or questions.
1. traditional conversational prompts v.s. prompt files (e.g., .cursorrules), which performs better?
2. The paper below gives a result that, AI-generated code performs worse in “testing”, although it performs great in other tasks.
Can LLMs Generate Higher Quality Code Than Human? An Empirical Study . 2025
While this paper gives another view: GPT-4 generated code seems plausible and executable, but they fail to pass the tests, e.g., off-by-one error, using imprecise logical operators (choosing between and and or operator), or missing to generate intermediate steps that are not mentioned in the NL description. A possible reason is that since it is not fine-tuned, it is harder to generate code that requires project-/data-specific knowledge, which could be leveraged to generate code with correct functionality.
Also, an experienced developer argues that AI-generated code cannot be used in industry-level projects, because they cannot consider the business logic, the architecture of the projects, etc. And, “cannot pass the tests” means this project cannot be used.
4: Knowledge or useful tools, tricks.
1. Two primary methodologies are used in LLMs for code: prompt engineering and fine-tuning.
Prompt engineering: involves applying different strategies to query LLMs, like ChatGPT;
Fine-tuning: further adapts pre-trained models, such as CodeBERT, by training them on task-specific data. [So, CodeBERT is a pre-trained model which can be used on task-specific data?]
2. Datasets: HumanEval, MBPP
3. MRR – Mean Reciprocal Rank. What’s this?
"For each category of conversational prompt, we calculate the mean reciprocal rank (MRR) to find which type of prompt constitutes the most to a higher rank."
4. Studies have explored LLMs and prompt engineering to tackle code tasks with various prompting strategies, such as:
-> basic prompting
-> in-context learning:
[29] Constructing effective in-context demonstration for code intelligence: an empirical study. 2023
They identified 3 key factors in in-context learning for code tasks: selection, order, and number of examples.
They found that both similarity and diversity in example selection were crucial for performance and stability.
-> task-specific prompting:
-> chain-of-thought prompting:
[30] Prompting is all your need: Automated android bug replay with large language models. 2024
They introduced AdbGPT, an LLM-based approach for reproducing bugs using few-shot learning and chain-of-thought reasoning.
-> auto-prompting
-> soft prompting