Bachelor Thesis / Software Engineering & Management

Measuring the Impact of AI Generated Code in MERN Applications

We evaluated whether AI generated code can act as a realistic drop-in replacement inside full stack MERN applications. Across three open source projects, selected files were regenerated with Phi 4 14B Reasoning Plus and Llama 3.1 70B Instruct, integrated one file at a time, and measured with static, runtime, performance, and security tools.

Research Question

How does AI generated code compare with human written code across reliability, maintainability, good practice, performance, and security in realistic MERN applications?

Experiment

Three MERN projects, five target files per app, two model variants per file, and isolated branch-based replacements so each measurement could be attributed to one generated file.

Measurement

SonarQube, ESLint and a custom analyzer for static quality; Lighthouse for performance and visual stability; OWASP ZAP for active security probing; Wilcoxon tests for paired analysis.

Key Findings

AI was competitive with human code on many dimensions and improved Halstead effort, Halstead bugs, and composite Lighthouse performance.

Human code remained safer on security and visual stability. AI generated files often introduced more ZAP findings and higher cumulative layout shift.

Model choice mattered. Llama 3.1 tended to generate simpler, lower complexity code, while Phi 4 produced fewer security findings.

Static Analysis Data Runtime Analysis Data Custom ESLint Analyzer