Math Drill — Philosophy & Design
A Kumon-style incremental learning engine where every decision is deterministic, and AI is used only at the edges.
1 · The problem, and the one idea behind the solution
The goal is modest to state and hard to do well: have a young child spend ten focused minutes a day on maths and steadily get both more accurate and faster — the Kumon promise. A worksheet is printed, the child writes on paper, a parent marks it, and the next batch should be just right: never so hard it discourages, never so easy it bores, always one small step beyond what was mastered yesterday.
The temptation in 2026 is to point a large language model at this and let it "generate worksheets and
track progress." That is precisely the wrong instinct. A system that decides what a child practises, and
judges whether they have mastered it, must be correct, explainable, and reproducible. An
LLM that occasionally computes 7 × 8 = 54, or quietly advances a struggling child, or produces a
different worksheet each time you ask, fails all three. The single idea this whole design turns on is a
boundary:
The core thesis
All judgement is deterministic. The LLM only ever decorates. Every number, every mastery decision, every difficulty step is computed in plain code and is reproducible from a seed. The language model is confined to one job — phrasing an explanation warmly for a child — and is handed the answer so it cannot get the maths wrong.
2 · The determinism boundary
Drawing this line precisely is the most important design act in the project. Everything that constitutes the child's learning record and trajectory sits on the deterministic side. The LLM sits on the far side of a wall, receiving already-solved examples and returning prose that is validated before use.
● Deterministic — plain code
- Generating every question
- Computing every answer & answer key
- Attributing the supervisor's marks to skills
- Deciding mastery (accuracy + speed)
- Choosing the next skill & difficulty
- Sizing the daily worksheet
- Selecting which concepts to reteach
◐ LLM — variable prose only
- Re-phrasing a concept warmly for a 7-year-old
- Encouragement and tone on the weekly sheet
Handed solved examples · told not to alter numbers · output validated · deterministic template used as fallback.
If you removed the language model entirely, the system would still teach correctly — it would simply explain a little less warmly. That is the test of a well-placed AI edge.
3 · Design principles
Six principles fall out of the thesis and govern every component.
| Principle | What it means in practice |
|---|---|
| Determinism first | Maths and decisions are pure functions. Given the same inputs, the system always produces the same worksheet and the same verdict. |
| Reproducibility from a seed | A week is fully defined by a seed string. The same week regenerates byte-for-byte — including its answer key — so printing, auditing, and debugging are trivial. |
| Mastery gates progress | A child advances only after proving accuracy and speed, sustained over several sessions. No time-based promotion. |
| Never too hard, never too long | New skills always start at their easiest; difficulty can only fall when a child struggles; the day is time-boxed so a harder skill yields fewer questions. |
| Small steps | The curriculum is decomposed into many narrow skills, so each advancement is a gentle increment rather than a cliff. |
| Explainability | Every decision traces to a rule and a number. You can always answer "why did she get this sheet?" |
4 · Architecture
The system is a thin stack around a pure engine. The engine has no clock, no randomness beyond a seeded generator, and no input/output — which is exactly what makes it testable and portable. It runs identically on a laptop today and inside a Cloudflare Worker later.
★ The engine is the crown jewel and is unit-tested in isolation. Everything above it is replaceable plumbing; the engine is the part that must be correct.
5 · The learning model
5.1 · A curriculum of small skills (Kumon B → C → D)
The curriculum is an ordered ladder of narrow skills, mapped to Kumon levels B (vertical addition and subtraction with carrying and borrowing), C (multiplication and division), and D (long multiplication, long division, and an introduction to fractions). Each skill is data, not code: it names a generator, the parameters for its easiest setting, optional scaffold-down overrides, a speed target, and a plain-language description. Adding a skill is adding a row.
The decomposition is deliberately fine — for example, addition alone steps through 2-digit + 1-digit no-carry → with carry → 2-digit + 2-digit no-carry → with carry → 3-digit → column addition of three or more numbers. Fine granularity is what makes each advancement feel like a small, winnable step.
5.2 · The question bank — two flavours, one shape
Each skill owns a bank. The flavour depends on the size of the question space.
| Flavour | Used for | Why |
|---|---|---|
| Enumerable | Times tables, division facts | The space is small and finite, so we enumerate it and guarantee coverage — and over-sample the specific facts a child keeps missing until they are automatic. |
| Generative | Multi-digit add / sub / mul / div, fractions | The space is effectively infinite, so we sample it deterministically from a seed — endless fresh questions, no realistic repeats within a sheet. |
Both resolve to one materialized shape, which is all the printer, scorer, and analyzer ever see:
{ kind, operands, answer, render, tags }
// tags drive error analysis: 'carry', 'borrow', 'across_zero', 'fact:7x8', 'review' …
5.3 · Marking — the exact wrong questions
Because every printed question carries a stable number, the supervisor marks by exception: they enter only
the question numbers the child got wrong, plus the single session time. Everything unmarked is assumed
correct. This is low-effort for the parent and high-signal for the system — each error inherits its
tags, so the engine learns not just that a child erred but what kind of
error it was (a borrow-across-zero slip, a weak 7× fact), which directly drives both reteaching and
over-practice.
5.4 · Time-boxed, adaptive volume
A fixed "100 questions a day" is the wrong primitive, because 100 easy sums and 100 long divisions are not the same ask. Instead we hold the time roughly constant and let the count float from the child's observed speed on the upcoming skill:
count = clamp( round( targetSeconds / estimatedSecondsPerQuestion ) )
This produces two self-protecting behaviours for free:
- A harder skill is slower per question, so it yields fewer questions — the session never runs long.
- A faster child gets more questions in the same ten minutes — more fluency volume, exactly when they can absorb it.
A consequence worth noting: since the session length is now constant by construction, "fast enough" can no longer be measured in minutes. The speed gate becomes seconds-per-question against the skill's target — which is the same number the mastery rule already uses. The design stays internally consistent.
5.5 · The mastery state machine
After each session the engine updates a rolling record (accuracy and median seconds-per-question) for the active skill. At the week boundary, a small, fully explainable policy decides what comes next.
| Rolling state (sustained ≥ 2–3 sessions) | Decision |
|---|---|
| ≥ 95% accurate and within the speed target | Advance one skill — entering at its easiest setting |
| Accurate but slow | Repeat the skill — build speed |
| 85–95% accurate | Repeat the skill — consolidate |
| Below 85% | Scaffold down — reduce difficulty to rebuild confidence |
How "never too hard" is guaranteed
Advancement is gated on mastery, so a child cannot be pushed forward before they are ready. A brand-new skill always enters at ease 0, its gentlest form. When a child struggles, difficulty decreases. And the ladder moves at most one rung per week. There is no path by which the next sheet is a cliff.
5.6 · Cumulative review
Mastered skills decay if never revisited, so roughly a fifth of each day is drawn from skills the child has already passed, interleaved and shuffled in. This keeps old fluency alive and mirrors Kumon's own cumulative design — without the child noticing a separate "revision" task.
5.7 · The weekly concept sheet
Each packet opens with a short, friendly explainer of the one or two concepts the child most struggled with. The selection and the worked examples are entirely deterministic — the engine ranks the weakest concept tags from real errors and generates fully-solved examples. Only the wording is optionally handed to the LLM, which is told to use the supplied, already-correct examples and not to invent maths. If the model is unavailable or its output fails validation, a deterministic template is used instead. The child always gets a correct, encouraging sheet; the AI only changes how warm it sounds.
6 · The weekly lifecycle
End to end, a week flows through these steps — five of six are pure computation.
- Decide the next skill & difficulty from the child's competency record. deterministic
- Size the worksheet to the ten-minute box from observed speed. deterministic
- Generate six days of seeded questions plus answer keys, mixing in cumulative review. deterministic
- Reteach: select the weakest concepts and solved examples; wrap them in prose. select phrase
- Print the packet; the child works on paper for ~10 min/day.
- Mark: the supervisor enters the exact wrong questions and the session time. deterministic
- Update competency; the dashboards reflect new mastery, speed, and error patterns. deterministic
The loop then returns to step 1 for the following week, now with fresh evidence.
7 · Two children, one ladder
The system is multi-student from the start, but intentionally simple about it: there is one shared curriculum ladder, and each child carries their own pointer and their own competency record along it. Two sisters can climb the very same B → C → D path entirely independently — one consolidating two-digit carrying while the other has moved on to times tables — with no duplicated content and no interaction between their trajectories. The same model extends upward to Year 13 simply by lengthening the ladder.
8 · Infrastructure — minimal by design
The deployment target is deliberately humble: it should cost nothing and run on managed edge infrastructure. Because the engine is pure and dependency-free, the move from the local build to Cloudflare is a wiring exercise, not a rewrite.
| Concern | Local build (today) | Cloudflare (next) |
|---|---|---|
| Logic / engine | Pure modules run by a CLI | Same modules inside a Worker |
| Storage | A JSON file | D1 (SQLite) — schema already mirrors it 1:1 |
| Weekly generation | A command | A Cron Trigger (Sunday night) |
| Dashboards & marking | Static HTML | Pages, behind Zero Trust access |
| Printing | Print-CSS HTML → browser "Save as PDF" | Unchanged — no PDF library needed |
| Concept prose | Deterministic template | A single OpenRouter call per child per week |
Printing as HTML rather than a generated PDF is a quiet but important choice: a worksheet is just a grid, print-CSS renders it crisply on A4, and we avoid heavyweight PDF libraries that strain a Worker's limits.
9 · Decisions & trade-offs
A few choices were made consciously, each trading a little flexibility for a lot of trust.
| Decision | Why | What we gave up |
|---|---|---|
| Difficulty is constant within a week | A child never meets a mid-week spike; the six sessions give a clean read on one skill | A small lag — mastery mid-week is acted on next week, not the same day |
| Questions are materialized, not just regenerated on the fly | Simple marking-by-number and rich error analysis | A little storage (trivial at this scale) |
| One skill per day, plus review | The single session timer maps cleanly to one skill's fluency | Less variety within a day (mitigated by the review mix) |
| Time-boxed count over a fixed count | Sessions stay ~10 min regardless of difficulty; volume scales with ability | The daily count varies (and is therefore an outcome, not a fixed promise) |
10 · What is deliberately deferred
Phase 0 builds the part that must be correct — the engine, curriculum, adaptation, printing, and dashboards — and proves it with a test suite and an end-to-end demo. Everything that is plumbing is left for later phases: the database swap, the hosted marking interface, automated weekly generation, and the live LLM concept prose. The sequencing is intentional. The hardest thing to get right is the judgement, so that is built and verified first; the cloud is the easy part, added once the brain is trustworthy.
In one sentence
A child's learning is too important to be probabilistic — so the maths, the marking, and the progression are deterministic and reproducible, and the language model is invited only to make a correct explanation sound kind.