Paper: Language Models in Software Development Tasks: An Emperimental Analysis of Energy and Accuracy. 2025
1 What do they do, and what’s the result?
1.1 What do they do?
They explore the trade-off between model accuracy and energy consumption, so that help developers to select a language model.
They investigate the performance of 18 LLMs in software development tasks on both a commodity GPU and a powerful AI-specific GPU.
1.2 What’s the result?
Employing a big LLM with a higher energy budget does not always translate to significantly improved accuracy.
Quantized versions of large models generally offer better efficiency and accuracy compared to full-precision versions of medium-sized ones.
Not a single model is suitable for all types of software development tasks.
1.3 Others
There is a growing interest in local solutions, with many individuals and organizations seeking to set up their own AI coding assistant using open-access, often smaller language models.
2: Can I use any of it in my work? If yes, how?
2.1 For introduction/background
1. There is a growing interest in local solutions, with many individuals and organizations seeking to set up their own AI coding assistant using open-access, often smaller language models.
2. The adoption of generative AI nearly doubled in under six months [2].
3. 76% of respondents in Stack Overflow’s annual survey reported that they currently use or plan to use AI code assistants [3].
4. Developers who participated in their trial quickly integrated GitHub Copilot into their daily workflows and found it extremely valuable [4].
[2] AI at Work Is Here-Now Comes the Hard Part. 2024
[4] Research: Quantifying github copilot’s impact in the enterprise with accenture. 2024
2.2 For experiments
1. Which datasets can be used in my experiments?
See 4 – 1
2. How to assess the outputs?
(1) See 4 – 2
(2) See 4 – 3
3: Sparked ideas, thought or questions.
No.
4: Knowledge or useful tools, tricks.
2. To assess the correctness of the outputs generated by LLMs, we employed the mean pass@k (mean success rate) evaluation metric, as defined in the Codex evaluation set (see the reference of HumanEval [21]).
It considers a problem solved if any of the k-generated solutions pass all test cases.
We focus on pass@1, which shows the probability of the model solving a problem in one try. Specifically, for code generation tasks, the correctness of the generated code is determined by whether it passes or not, all the hand-written test assertions provided in the dataset.
3. To assess the thoroughness of the generated tests, we combined the canonical solution with the generated test assertions in a Python file and measured coverage using the Python code coverage analysis tool, coverage.py [56].
[20] Octopack: Instruction tuning code large language models. 2023
[21] Evaluating large language models trained on code. 2021
[56] Coverage.py: The code coverage tool for Python.