Bachelor Thesis / Software Engineering & Management
Measuring the Impact of AI Generated Code in MERN Applications
We evaluated whether AI generated code can act as a realistic drop-in replacement inside full stack MERN applications. Across three open source projects, selected files were regenerated with Phi 4 14B Reasoning Plus and Llama 3.1 70B Instruct, integrated one file at a time, and measured with static, runtime, performance, and security tools.
Research Question
How does AI generated code compare with human written code across reliability, maintainability, good practice, performance, and security in realistic MERN applications?
Experiment
Three MERN projects, five target files per app, two model variants per file, and isolated branch-based replacements so each measurement could be attributed to one generated file.
Measurement
SonarQube, ESLint and a custom analyzer for static quality; Lighthouse for performance and visual stability; OWASP ZAP for active security probing; Wilcoxon tests for paired analysis.
Key Findings
AI was competitive with human code on many dimensions and improved Halstead effort, Halstead bugs, and composite Lighthouse performance.
Human code remained safer on security and visual stability. AI generated files often introduced more ZAP findings and higher cumulative layout shift.
Model choice mattered. Llama 3.1 tended to generate simpler, lower complexity code, while Phi 4 produced fewer security findings.