Language Models in Software Development Tasks: An Emperimental Analysis of Energy and Accuracy

Paper: Language Models in Software Development Tasks: An Emperimental Analysis of Energy and Accuracy. 2025


1 What do they do, and what’s the result?


1.1 What do they do?

They explore the trade-off between model accuracy and energy consumption, so that help developers to select a language model.

They investigate the performance of 18 LLMs in software development tasks on both a commodity GPU and a powerful AI-specific GPU.

 


1.2 What’s the result?

Employing a big LLM with a higher energy budget does not always translate to significantly improved accuracy.

Quantized versions of large models generally offer better efficiency and accuracy compared to full-precision versions of medium-sized ones.

Not a single model is suitable for all types of software development tasks.

 


1.3 Others

There is a growing interest in local solutions, with many individuals and organizations seeking to set up their own AI coding assistant using open-access, often smaller language models.

 


2: Can I use any of it in my work? If yes, how?


2.1 For introduction/background

1. There is a growing interest in local solutions, with many individuals and organizations seeking to set up their own AI coding assistant using open-access, often smaller language models.

2. The adoption of generative AI nearly doubled in under six months [2].

3. 76% of respondents in Stack Overflow’s annual survey reported that they currently use or plan to use AI code assistants [3].

4. Developers who participated in their trial quickly integrated GitHub Copilot into their daily workflows and found it extremely valuable [4].

[2] AI at Work Is Here-Now Comes the Hard Part. 2024

[3] Developers get by with a little help from ai: Stack overflow knows code-assistant pulse survey results. 2024

[4] Research: Quantifying github copilot’s impact in the enterprise with accenture. 2024

 


2.2 For experiments

1. Which datasets can be used in my experiments?

See 4 – 1

 

2. How to assess the outputs?

(1) See 4 – 2

(2) See 4 – 3

 


3: Sparked ideas, thought or questions.


No.

 


4: Knowledge or useful tools, tricks.


1. Dataset HumanEval, HumanEvalPack
In studies assessing the coding proficiency of LLMs, HumanEval [21] is the most widely-used benchmark, and has become the de facto standard for this purpose.
It consists of 164 hand-written Python programming problems and each problem comes with its own set of test cases.
HumanEvalPack [20], is an extended version of HumanEval.
It includes more rigorous test cases and encompasses coding tasks as like:
Code Repair (Bug Fixing),
Code Explanation (Generating Docstring),
Code Synthesis (Generating Code),
Test Assertion Generation covering six programming languages.

2. To assess the correctness of the outputs generated by LLMs, we employed the mean pass@k (mean success rate) evaluation metric, as defined in the Codex evaluation set (see the reference of HumanEval [21]).

It considers a problem solved if any of the k-generated solutions pass all test cases.

We focus on pass@1, which shows the probability of the model solving a problem in one try. Specifically, for code generation tasks, the correctness of the generated code is determined by whether it passes or not, all the hand-written test assertions provided in the dataset.

 

3. To assess the thoroughness of the generated tests, we combined the canonical solution with the generated test assertions in a Python file and measured coverage using the Python code coverage analysis tool, coverage.py [56].

[20] Octopack: Instruction tuning code large language models. 2023

[21] Evaluating large language models trained on code. 2021

[56] Coverage.py: The code coverage tool for Python.