Prompt Engineering or Fine-Tuning: An Empirical Assessment of LLMs for Code

Paper: Prompt Engineering or Fine-Tuning: An Empirical Assessment of LLMs for Code. 2025


1 What do they do, and what’s the result?


1.1 What do they do?

1. Two primary methodologies are used in LLMs for code: prompt engineering and fine-tuning. In this paper, they conduct a comparative analysis between these two methodologies.

 

2. Quantitative analysis of GPT-4:

(1) evaluate GPT-4 using 3 prompt engineering strategies: basic prompting, in-context learning, and task-specific prompting.

Basic prompting: directly query GPT-4 with the input (code or description) and ask to generate solutions in the form of the desired output.

In-context prompting: together with the basic prompt, we give a set of input/output examples to GPT-4. This idea is very similar to few-shot learning in the context of fine-tuned language models.

Task specific engineered prompting: together with the basic prompt, we design additional prompts to guide GPT-4 in generating better results for each task.

 

(2) compare GPT-4 against 17 fine-tuned models across 3 tasks: code summarization, code generation, and code translation.

 

3. Qualitative study involves 37 users.


1.2 What’s the result?

1. Our results indicate that GPT-4 with prompt engineering does not consistently outperform fine-tuned models.

(1) In code summarization, GPT-4 with task-specific prompting outperforms the top fine-tuned model by 8.33% points in the BLEU score.

(2) In code generation, GPT-4 is outperformed by fine-tuned models by 28.3% points on the MBPP dataset.

(3) In code translation, GPT-4 shows mixed results.

 

2. The user study shows that, GPT-4 with “conversational prompts + incorporating human feedback during interaction” significantly improved performance compared to automated prompting. It improves 15.8% points for code summarization, 18.3% points for code generation, and 16.1% points for code translation.

 

3. We identified 9 types of conversational prompts from the chat logs with GPT-4:

(1) Request improvements with certain keywords;

(2) Provide more context;

(3) Add specific instructions; –> was found to be the most common for code generation and translation tasks.

(4) Point mistakes then request fixes;

(5) Ask questions to guide the correct way;

(6) Request verification;

(7) Request more examples;

(8) Request more detailed description;

(9) Request another or a different version of generation.

 

4. [52] found that the most prevalent translation bug was data-related. I.e., data types, parsing input data, and output formatting issues.

[52] Lost in translation: A study of bugs introduced by large language models while translating code. 2024 ICSE

 

5. Our qualitative user study revealed that the key potential of GPT-4 is on conversation-based prompting, which includes human feedback in the loop.

 

6. Two takeaway messages from out results:

(1) To leverage LLMs to their best, a sequence of back-and-forth queries with the model may be needed.

(2) Human feedback still plays a crucial role in optimizing the prompts, as shown by conversational prompting. Thus, to remove humans from the loop completely, future studies are needed to analyze developer-written prompts more carefully and extract patterns that can be implemented as rules, fitness functions, rewards, and policies.

 


2: Can I use any of it in my work? If yes, how?


2.1 Research Questions

RQ1: how does cursorrules compare to conversational prompts?

 

RQ2: how do participants refine their prompts when interacting with LLMs models (e.g., Cursor, GPT-4)?

 

RQ3: what is the impact of different prompt evolution patterns? This question aims to investigate which category of conversation prompts contributes to better results.

For code generation, the top 5 ranks of conversational prompts that contribute the most are:

(1) Add more context

(2) Add instructions

(3) Request improvements

(4) Ask questions

(5) Point mistake then fix

For code summarization: Requesting improvement worked best.

For code translation, Asking questions worked best.

 


2.2 Experiments

1. Three typical automated code-related software engineering tasks

(1) Code Summarization task:

The model generates a short natural language summary from a source code snippet (SC-to-NL).

[45] Few-shot training llms for project-specific code-summarization. 2022

[46] Automatic semantic augmentation of language model prompts (for code summarization). 2024

[47] On the evaluation of neural code summarization. 2022

(2) Code Generation task:

The model generates the corresponding code snippet from a natural language description (NL-to-SC).

[48] Clarifygpt: A framework for enhancing llm-based code generation via requirements clarification. 2024

[49] Competition-level code generation with alphacode. 2022 Science.

[50] Self-collaboration code generation via chatgpt. 2024

(3) Code Translation task:

The model translates a source code snippet into another programming language (SC-to-SC).

[51] Bridging gaps in llm code translation: Reducing errors with call graphs and bridged debuggers. 2024

[53] Exploring and unleashing the power of large language models in automated code translation. 2024

 

2. Evaluation Metrics

(1) For code generation, we use pass@k [56], defined as the probability that one of the top k-generated samples passes the unit tests with k set to 1.

(2) For code summarization, we use BLEU [70], which assesses similarity to the ground truth using n-gram precision.

(3) For code translation, we use BLEU, ACC [71], and CodeBLEU [72], which combines n-gram precision, keyword matching, AST matching, and dataflow matching.

[56] Evaluating large language models trained on code. 2021

[70] Bleu: a method for automatic evaluation of machine translation. 2002

[71] Selecting and interpreting measures of thematic classification accuracy. 1997

[72] Codebleu: a method for automatic evaluation of code synthesis. 2020

 


3: Sparked ideas, thought or questions.


1. traditional conversational prompts v.s. prompt files (e.g., .cursorrules), which performs better?

 

2. The paper below gives a result that, AI-generated code performs worse in “testing”, although it performs great in other tasks.

Can LLMs Generate Higher Quality Code Than Human? An Empirical Study . 2025

While this paper gives another view: GPT-4 generated code seems plausible and executable, but they fail to pass the tests, e.g., off-by-one error, using imprecise logical operators (choosing between and and or operator), or missing to generate intermediate steps that are not mentioned in the NL description. A possible reason is that since it is not fine-tuned, it is harder to generate code that requires project-/data-specific knowledge, which could be leveraged to generate code with correct functionality.

Also, an experienced developer argues that AI-generated code cannot be used in industry-level projects, because they cannot consider the business logic, the architecture of the projects, etc. And, “cannot pass the tests” means this project cannot be used.

 


4: Knowledge or useful tools, tricks.


1. Two primary methodologies are used in LLMs for code: prompt engineering and fine-tuning.

Prompt engineering: involves applying different strategies to query LLMs, like ChatGPT;

Fine-tuning: further adapts pre-trained models, such as CodeBERT, by training them on task-specific data. [So, CodeBERT is a pre-trained model which can be used on task-specific data?]

 

2. Datasets: HumanEval, MBPP

 

3. MRR – Mean Reciprocal Rank. What’s this?

"For each category of conversational prompt, we calculate the mean reciprocal rank (MRR) to find which type of prompt constitutes the most to a higher rank."

 

4. Studies have explored LLMs and prompt engineering to tackle code tasks with various prompting strategies, such as:

-> basic prompting

-> in-context learning:

[29] Constructing effective in-context demonstration for code intelligence: an empirical study. 2023

They identified 3 key factors in in-context learning for code tasks: selection, order, and number of examples.
They found that both similarity and diversity in example selection were crucial for performance and stability.

-> task-specific prompting:

-> chain-of-thought prompting:

[30] Prompting is all your need: Automated android bug replay with large language models. 2024

They introduced AdbGPT, an LLM-based approach for reproducing bugs using few-shot learning and chain-of-thought reasoning.

-> auto-prompting

-> soft prompting