When Candidates Use Copilot: How to Assess Real Skill in the Age of AI Coding Assistants
80% of candidates now use AI on coding tests. Banning it does not work, and ignoring it is reckless. Here is how brief-driven, real-project assessment reveals genuine capability regardless of which tools were used.
Eighty percent of candidates are using large language models on your coding tests. Not some of the time. Not on the easy parts. On the whole thing.
That number comes from Karat's co-founder, reported by David Haney, and it tracks with what every hiring team we speak to is experiencing. The submissions look polished. The code is syntactically clean. And nobody can tell who actually understands what they submitted.
Welcome to the hardest problem in technical hiring right now.
The AI Cheating Crisis
80%
Of candidates use LLMs on coding tests
Karat co-founder via David Haney's blog
The scale of AI usage in technical assessments has caught hiring teams off guard. According to the CoderPad State of Tech Hiring 2026 report, 52% of hiring leaders now identify AI-generated solutions as their top cheating concern — ahead of plagiarism, collaboration, and time-limit violations.
The response has been fragmented. Some companies have gone to extreme measures. Google's CEO recently suggested returning to in-person interviews as a response to AI-assisted cheating. Others have tried technological solutions: AI-detection tools, proctored environments, keystroke monitoring.
94%
Of tech hiring processes will use AI tools in some form in 2026
CoderPad State of Tech Hiring 2026
But here is the uncomfortable reality: you are fighting a tool that 94% of the industry is simultaneously adopting. The tension is obvious. We want developers who can use AI effectively, and we are punishing them for using it during the hiring process.
Why Banning AI Is a Losing Strategy
The current landscape of AI policies in technical hiring looks like this:
Banning Copilot from a coding test is like banning Google from a research task. You are testing compliance, not competence.
Banning AI tools creates three problems simultaneously.
It is unenforceable. Unless you are proctoring every minute of a take-home test with screen recording and keystroke analysis — which creates its own candidate experience nightmare — you cannot verify compliance. Honest candidates follow the rules. Dishonest ones ignore them. Your ban filters for honesty, not skill.
It penalises your best candidates. Senior developers who use Copilot, ChatGPT, or Claude as part of their daily workflow are now being asked to code without their standard tools. That is like asking a carpenter to build a cabinet without a power drill. They can do it, but the result does not reflect how they actually work.
It ignores professional reality. If your developers use AI tools on the job — and by 2026, most do — then assessing candidates without those tools tells you nothing about on-the-job performance. You are measuring an artificial constraint, not capability.
The enforcement paradox
The harder you try to enforce an AI ban, the worse the candidate experience becomes. Proctoring software, screen recording, and keystroke monitoring signal distrust before the relationship has even started. The best candidates, the ones with options, will simply decline to participate.
The Vibe Coding Problem
If banning AI does not work, should you simply allow it and move on? Not without changing what you assess.
An Indian tech company recently ran an experiment that illustrates why. They explicitly allowed ChatGPT during their technical assessment. Twelve thousand candidates applied. Four hundred and fifty reached the interview stage. Zero were hired.
The problem was what engineers have started calling "vibe coding" — using AI to generate plausible-looking code without understanding what it does. Candidates could produce working solutions to the initial challenge, but when interviewers probed their understanding of the code, the architecture, the trade-offs, the candidates fell apart.
0
Hires from 12,000 AI-allowed applicants at an Indian tech company
Reported in industry analysis of AI-assisted hiring outcomes
This is the real threat. Not that candidates use AI, but that AI makes it trivially easy to produce output without understanding. A candidate who prompts ChatGPT to "build a REST API with authentication" gets something that runs. But ask them why they chose that authentication pattern, how they would handle token refresh at scale, or what happens when the database connection pool is exhausted, and the facade crumbles.
The data backs this up
Research from Interviewing.io found that ChatGPT solves standard coding problems 73% of the time. But for custom problems designed for specific roles, that success rate drops to just 25%. The more specific and contextual the challenge, the less useful AI becomes as a wholesale replacement for understanding.
This is where assessment design becomes the critical lever.
What Real-Project Assessment Reveals That Sandbox Tests Cannot
Standard coding tests — the kind run in browser-based sandboxes with timed algorithmic challenges — are precisely the format most vulnerable to AI. They present well-defined problems with known solution patterns. They are the 73%.
Real-project assessment works differently. When a candidate builds against a structured brief, pushes code to a GitHub repository, and submits a complete working project, the signals available for evaluation multiply dramatically.
Architecture reveals understanding. How did the candidate structure their project? Did they separate concerns appropriately? Did they make sensible choices about state management, error handling, and data flow? These decisions require understanding the problem domain, not just generating code that compiles.
Testing reveals discipline. Did the candidate write tests? What did they test? A candidate relying entirely on AI-generated code rarely writes meaningful tests, because writing good tests requires understanding what can go wrong. The real cost of reviewing these submissions manually is another problem entirely, but the signal from test coverage is invaluable.
Brief compliance reveals comprehension. When the brief specifies particular requirements — handle concurrent updates, implement optimistic UI with rollback, ensure the interface remains responsive with 50+ items — evaluating compliance tells you whether the candidate understood the problem. AI can generate generic solutions. Meeting specific, contextual requirements demands genuine engagement.
Git history reveals process. A GitHub repository preserves the development timeline. Did the candidate build incrementally? Did they refactor as complexity grew? Did they commit working increments or dump everything in a single commit? The process is as revealing as the product.
The question is no longer whether candidates use AI. The question is whether your assessment can tell the difference between someone who understands what they built and someone who does not.
How Brief-Driven Evaluation Naturally Handles AI-Assisted Submissions
The key insight is this: when your evaluation criteria are specific, contextual, and tied to real role requirements, AI-assisted submissions are assessed on exactly the same terms as manually written ones. The tool becomes irrelevant. The output is what matters.
This is how well-designed briefs naturally handle the AI problem:
Custom problems defeat generic AI. Remember that 73% versus 25% stat from Interviewing.io? Standard problems are AI-friendly. Custom, role-specific briefs are not. When your brief asks candidates to "build an order management interface that handles concurrent updates from multiple restaurant staff during peak hours," the generic ChatGPT solution will not cut it.
Explicit criteria create accountability. When the brief states exactly what will be evaluated — and candidates know their submission will be analysed against those criteria — the incentive shifts from "produce something that looks good" to "produce something that meets the stated requirements." AI can help with the latter, but the candidate still needs to direct it intelligently.
Structured analysis catches shallow work. When submissions are evaluated fairly and consistently against brief criteria, shallow AI-generated code is exposed by what it lacks: coherent architecture, meaningful error handling, tests that cover actual edge cases, and design decisions that reflect the problem context rather than generic patterns.
The assessment design principle
Design your assessment so that the best AI-assisted submission and the best manually written submission look the same: thoughtful, well-structured, and clearly produced by someone who understood the problem. When you achieve that, the question of tool usage becomes irrelevant.
What to Look for When Candidates Use AI Tools
The skills that matter in technical hiring are already shifting. CoderPad's 2026 data shows the change in what hiring teams evaluate:
66%
Of hiring leaders seek candidates who can catch and fix AI mistakes
CoderPad State of Tech Hiring 2026
This shift makes sense. If AI handles much of the initial code generation, the premium skills become the ones AI is weakest at: understanding systems holistically, debugging subtle issues, making architectural trade-offs, and critically evaluating AI-generated output.
Here is what to prioritise when evaluating submissions in an AI-assisted world:
Coherence over cleverness. Does the codebase tell a consistent story? Are the architectural decisions aligned with each other? A candidate who uses AI for individual functions but understands the whole system will produce coherent code. One who assembles AI-generated snippets without understanding will produce a patchwork.
Error handling depth. AI-generated code typically handles the happy path well and ignores edge cases. Look at how the candidate handles failures, invalid input, network errors, and concurrent access. This is where genuine understanding shows.
Test quality over test quantity. Anyone can prompt an AI to "write tests for this function." The question is whether the tests cover meaningful scenarios. Do they test error paths? Do they verify behaviour under realistic conditions? Do they document the candidate's understanding of what can go wrong?
Trade-off documentation. Ask candidates to explain their decisions in a README or inline comments. "I chose this approach because..." is hard to fake. A candidate who understands their code can articulate trade-offs. One who generated it cannot.
When the brief demands architectural reasoning, test coverage, and coherent design decisions, vibe coding collapses under its own weight.
The Path Forward
The industry is converging on a clear direction. Banning AI is untenable. Ignoring it is reckless. The sustainable approach is to design assessments where AI is just another tool — useful, but not sufficient.
This means three things for your hiring process:
-
Move from sandbox puzzles to real-project briefs. Algorithmic challenges in browser sandboxes are the format most vulnerable to AI. Structured briefs that require building something contextual and complete are far more resilient.
-
Make evaluation criteria explicit. When candidates know what you are assessing and your evaluation is consistent, the playing field is level regardless of tool usage. This is exactly what brief-driven assessment provides.
-
Evaluate understanding, not just output. Architecture, testing, brief compliance, and documented trade-offs reveal whether the candidate understands what they built. These signals survive AI assistance because they require the kind of contextual reasoning that AI tools cannot reliably provide on their own.
The teams that figure this out first will have a significant advantage. They will assess candidates on the skills that actually predict job performance, attract senior developers who refuse to be proctored like exam students, and build hiring pipelines that work regardless of what tools candidates use.
Assess what actually matters
MeritDeck evaluates real GitHub repos against your role-specific brief. Architecture, testing, brief compliance — the signals that reveal genuine capability, regardless of what tools candidates used.
See how it works