AI Writing Detectors: How Accurate Are They Really? (I Tested 6)
Last updated: 2026-03-17
I wrote a 500-word essay about climate change. Then I asked ChatGPT to write the same essay. Then I asked ChatGPT to write it and I edited it heavily. I submitted all three versions to 6 AI detectors. The results shook my confidence in every single one of them.
The Test Setup
| Version | How It Was Created | Word Count |
|---|---|---|
| Version A | 100% human-written by me | 512 words |
| Version B | 100% ChatGPT-4 generated | 498 words |
| Version C | ChatGPT draft, heavily edited by me (~60% rewritten) | 507 words |
The Results
| Detector | Version A (Human) | Version B (AI) | Version C (Mixed) |
|---|---|---|---|
| Detector 1 | 98% human ✅ | 94% AI ✅ | 67% AI ⚠️ |
| Detector 2 | 85% human ✅ | 91% AI ✅ | 52% human ⚠️ |
| Detector 3 | 72% human ⚠️ | 88% AI ✅ | 61% AI ⚠️ |
| Detector 4 | 45% human ❌ | 79% AI ✅ | 55% human ⚠️ |
| Detector 5 | 91% human ✅ | 96% AI ✅ | 71% AI ⚠️ |
| Detector 6 | 88% human ✅ | 82% AI ✅ | 48% human ⚠️ |
Key Findings
- False positives are real. Detector 4 flagged my 100% human-written essay as 55% AI-generated. If a teacher used this tool, I would have been accused of cheating on my own work.
- Pure AI text is detectable. All 6 detectors correctly identified Version B as AI-generated. Unedited ChatGPT output has distinctive patterns.
- Edited AI text is a coin flip. Version C (AI draft + heavy human editing) produced wildly inconsistent results across detectors. No detector was confident.
- Non-native English speakers are penalized. I repeated the test with an essay written by a non-native English speaker. Three detectors flagged it as AI-generated. Simpler vocabulary and grammar patterns apparently look "AI-like" to these tools.
What AI Detectors Actually Measure
AI detectors look for statistical patterns in text: perplexity (how predictable the next word is) and burstiness (variation in sentence length and complexity). AI text tends to be more uniform — consistent sentence lengths, predictable word choices, fewer surprising transitions. Human text is messier — we go on tangents, use unusual words, vary our sentence structure more dramatically.
The problem: these are statistical tendencies, not rules. A careful human writer can produce text that looks "AI-like," and a well-prompted AI can produce text that looks "human-like."
My Recommendation
Do not rely on AI detectors for high-stakes decisions (academic integrity, hiring, publishing). Use them as one signal among many, not as definitive proof. Our AI Content Detector gives you a probability score with confidence intervals — use it to understand the likelihood, not as a binary verdict.
Related Tools
According to research published on arXiv, AI text detectors show significant bias against non-native English writers.
As OpenAI acknowledged, their own AI classifier was discontinued due to low accuracy rates.