The problem with manual grading at scale
In modern education — particularly in computer science and programming courses — evaluating student assignments manually is time-consuming and difficult to scale. As cohort sizes grow, instructors struggle to deliver detailed, timely feedback for every submission. Traditional automated grading systems partially address this by checking whether a program passes predefined test cases, but they typically return only binary pass/fail results. That binary verdict tells a student what failed, not why it failed or how to fix it.
At the same time, Large Language Models have demonstrated a remarkable ability to generate human-like explanations. Relying on generative AI alone, however, introduces a different risk: hallucinated or factually incorrect feedback that can actively mislead a learner.
My thesis proposes a third path — a Hybrid Neuro-Symbolic Grading System that combines the precision of deterministic code evaluation with the explanatory power of generative AI, wrapped in a learning analytics layer that tracks student progress over time.
Deployment status: The frontend of this platform is live and accessible. Full end-to-end demo — including the code evaluation engine, AI feedback generator, and analytics dashboard — is currently in progress and will be available soon. The backend microservices are still being finalised for production deployment.
Core concept: hybrid neuro-symbolic architecture
The grading pipeline is built around two complementary layers of AI reasoning.
Symbolic evaluation (deterministic). Every submission first passes through objective code analysis tools — static analysers, compilers, linters, and unit testing frameworks. This layer verifies syntactic correctness and logical validity against predefined test cases, producing structured execution logs.
Generative AI feedback. Only after the deterministic layer has verified (or characterised) the program's behaviour does the system invoke a generative AI model. The AI receives a structured prompt grounded in the actual execution output, then generates explanatory feedback covering why the code failed, which concepts need attention, and concrete debugging guidance.
This verify-then-generate protocol is the key safety mechanism. Because the AI's response is anchored to real program output rather than speculation, hallucination risk is substantially reduced.
System architecture
The platform is built as a modular microservices system to ensure scalability and independent deployability of each component.
Submission Interface — students submit programming assignments through the platform UI; each submission triggers the automated grading pipeline.
Code Evaluation Engine — runs static analysis, compilation checks, unit test execution, and error detection. Records detailed execution logs including compilation errors, test-case failure counts, and performance metrics.
AI Feedback Generator — receives structured prompts from the evaluation engine and returns natural-language explanations tailored to the student's specific errors and learning context.
Data Logging System — every submission generates a rich telemetry record: compilation error counts, test-case failures, time between submissions, number of re-submissions, and execution latency.
Learning Analytics Dashboard — aggregates submission telemetry into insights: student progress curves, repeated-error patterns, concept mastery trends, and early identification of at-risk learners.
Learning analytics and longitudinal performance tracking
Traditional grading systems store only a final mark. This platform captures high-resolution learning telemetry from the entire coding process.
Tracked metrics include:
- Syntax error frequency per session
- Compilation latency
- Test-case failure patterns across attempts
- Time elapsed between error introduction and resolution
- Number of re-submissions per assignment
These metrics feed into Longitudinal Skill Profiles — continuous representations of how each student's abilities evolve across assignments and over the course. Instructors can immediately spot students who struggle with specific concepts, take disproportionately long to debug, or repeatedly make the same class of mistake — enabling early intervention rather than post-hoc remediation.
Custom metric: Time-Weighted Error Quotient
The research introduced a novel metric, the Time-Weighted Error Quotient (EQ), which quantifies the intensity and persistence of coding errors during development. By weighting error events by the time the student spent in an erroneous state, EQ provides a more nuanced picture of cognitive difficulty than simple error counts alone.
Correlating EQ against final grades confirmed a statistically significant relationship, validating it as a practical predictor of student struggle before final assessments arrive.
Experimental evaluation
The system was evaluated across three research questions using simulated course environments with multiple cohorts of student submissions.
| Research Question | Finding | |---|---| | Reliability of AI feedback | Hybrid verify-then-generate significantly reduced hallucinated explanations vs. direct LLM prompting | | Pedagogical effectiveness | Analytics successfully detected performance patterns that predicted final grades, enabling early identification of struggling learners | | System scalability | Microservices architecture handled high submission volumes with minimal latency under load testing |
Technologies used
Python · Machine Learning · Static Code Analysis · Unit Testing Frameworks · Generative AI Models · Microservices Architecture · REST APIs · Data Analytics
Key contributions
- Hybrid AI Grading Architecture — a novel framework combining static program analysis with generative AI to ensure accurate, non-hallucinated automated grading.
- Longitudinal Learning Analytics — a methodology for transforming raw submission event logs into continuous student learning profiles.
- Scalable Educational Platform — a microservices-based system capable of supporting large online programming courses.
- Data-Driven Student Support — an early-warning mechanism that identifies at-risk students through performance analytics before final exams.
What I learned
Building this system reinforced that reliable AI in high-stakes domains cannot be a black box. The verify-then-generate protocol was the hardest architectural decision to get right — balancing the latency cost of running deterministic checks first against the drastic improvement in feedback trustworthiness. It also deepened my appreciation for data engineering: the analytics layer is only as useful as the telemetry it ingests, and designing a clean, queryable event schema from day one saved enormous effort downstream.
