This blog is combing several code metrics tools which introduced in the paper ‘Can LLMs Generate Higher Quality Code Than Humans? An Empirical Study‘. Some context are referenced directly from the paper. For more details, please see the paper.

1 Code Quality Metric

1.1 Code quality Metric

1.2 Human-centric metrics

1.3 Critical code quality metrics

2 Tools for Code Quality Metric

To evaluate code quality across human-writeen and AI-generated codebases, we leverage four widely used static analysis tools: Radon, Complexipy, Bandit, and Pylint. Each of these tools provides specific insights into different aspects of code quality. These tools collectively provide a multi-faceted evaluation of code quality, covering complexity, readability, security, and stylistic conformity, which is essential for an objective comparison between human-written and AI-generated code.

2.1 Radon

Radon is a Python tool that computes various metrics from the source code. Radon can compute McCabe’s [What’s this?] complexity, i.e. cyclomatic complexity, raw metrics (LoC, LLOC, SLOC, comment lines etc.), Halstead metrics, and Maintainability Index. Metrics like cyclomatic complexity and maintainability index help identify areas of high logical complexity that may affect readability and maintainability. Radon outputs numerical scores that represent the complexity of individual functions or modules, with higher values often incdicating greater complexity.

2.1.1 Cyclomatic Complexity

Cyclomatic Complexity (CC) corresponds to the number of decisions a block of code contains plus 1. This number (also called MaCabe number) is equal to the number of linearly independent paths through the code.

2.1.2 Maintainability Index

Maintainability Index (MI) is a software metric which measures how maintainable (easy to support and change) the source code is. The maintainability index is calculated as a factored formula consisting of SLOC (Source Lines Of Code), Cyclomatic Complexity and Halstead volume.

2.1.3 Halstead’s Metrics

Halstead’s metrics form a set of quantitative measures designed to assess the complexity of software based on operator and operand usage within the code. Using the base values of operators and operand, Halstead’s metrics calculate additional measures to estimate the cognitive and structural effort involved in understanding and maintaining code:

– Program Vocabulary (η): Defined as the sum of distinct operators and distinct operands. This metric reflects the unique linguistic elements in the code, with a larger vocabulary suggesting a more complex codebase.

– Program Length (N): The total number of operators and operands, representing the size of the program in terms of tokens used. Program length can indicate code verbosity, where excessive length may imply redundancy or inefficiency.

– Volume (V): Volume measures the size of the program in “mental space” and indicates the amount of information the code contains. Higher volume implies increased difficulty in understanding the program, as it occupies more cognitive resources.

– Difficulty (D): This metric quantifies the difficulty of understanding the program. Difficulty highlights how challenging the program might be for developers, with higher difficulty suggesting a more intricate flow of logic.

– Effort (E): Effort provides an estimation of the mental workload needed to implement or maintain the code. Effort serves as an indicator of development time, with higher values suggesting increased complexity and, consequently, a greater likelihood of defects.

– Time required to program (T): This metric provides an estimate of the time required for a programmer to implement or comprehend the code.

– Number of delivered bugs (B): This metric estimates the potential number of defects in the code, based on its complexity.

2.2 Complexipy

Cognitive complexity (CogC) captures the mental effort required to understand code, emphasizing readability and simplicity. The output consists of cognitive complexity scores, where lower values are preferable as they indicate easier-to-understand code. The complexipy tool for Python code checks the cognitive complexity of a file or function and if it is greater than the default cognitive (15), then the return code will be 1, otherwise it will be 0.

2.3 Bandit

Bandit is a tool which performs static code analysis for potential security vulnerabilities in Python code. Bandit processes each file, builds an AST from it, and runs appropriate plugins against the AST nodes. Once Bandit has finished scanning all the files it generates a report. It scans for common security risks, such as insecure imports or improper use of cryptography. Bandit’s output includes a list of issues flagged with severity levels, enabling targeted remediation of critical vulnerabilities.

2.4 Pylint

Pylint is a static code analysis tool that checks for errors, enforces a coding standard, looks for code smells, and can make suggestions about how the code could be refactored. Pylint locates and outputs messages for errors (Err.), potential refactorings (R), warnings (W), and code convention (C) violations in code.

3 How to use these tools

4 TOPSIS

The Technique for Order of Preference by Similarity to Ideal Solution, TOPSIS, is a multi-criteria decision analysis method.

TOPSIS is based on the concept that the chosen alternative should have the shortest geometric distance from the positive ideal solutions (PIS) or ideal worst and the longest geometric distance from the negative ideal solution (NIS) or ideal worst. It compares a set of alternatives, normalising scores for each criterion and calculating the geometric distance between each alternative and the ideal alternative, which is the best score in each criterion.

5 Ranking the results

5 Other reflections