Math Drill — Philosophy & Design

A Kumon-style incremental learning engine where every decision is deterministic, and AI is used only at the edges.

Subject: Design rationale for a home maths-drilling system · Scope: Curriculum, question bank, adaptation, infrastructure · Status: Phase 0 (local) built & tested · Audience: Builder / engineering handoff

1 · The problem, and the one idea behind the solution

The goal is modest to state and hard to do well: have a young child spend ten focused minutes a day on maths and steadily get both more accurate and faster — the Kumon promise. A worksheet is printed, the child writes on paper, a parent marks it, and the next batch should be just right: never so hard it discourages, never so easy it bores, always one small step beyond what was mastered yesterday.

The temptation in 2026 is to point a large language model at this and let it "generate worksheets and track progress." That is precisely the wrong instinct. A system that decides what a child practises, and judges whether they have mastered it, must be correct, explainable, and reproducible. An LLM that occasionally computes 7 × 8 = 54, or quietly advances a struggling child, or produces a different worksheet each time you ask, fails all three. The single idea this whole design turns on is a boundary:

The core thesis

All judgement is deterministic. The LLM only ever decorates. Every number, every mastery decision, every difficulty step is computed in plain code and is reproducible from a seed. The language model is confined to one job — phrasing an explanation warmly for a child — and is handed the answer so it cannot get the maths wrong.

2 · The determinism boundary

Drawing this line precisely is the most important design act in the project. Everything that constitutes the child's learning record and trajectory sits on the deterministic side. The LLM sits on the far side of a wall, receiving already-solved examples and returning prose that is validated before use.

● Deterministic — plain code

Generating every question
Computing every answer & answer key
Attributing the supervisor's marks to skills
Deciding mastery (accuracy + speed)
Choosing the next skill & difficulty
Sizing the daily worksheet
Selecting which concepts to reteach

◐ LLM — variable prose only

Re-phrasing a concept warmly for a 7-year-old
Encouragement and tone on the weekly sheet

Handed solved examples · told not to alter numbers · output validated · deterministic template used as fallback.

If you removed the language model entirely, the system would still teach correctly — it would simply explain a little less warmly. That is the test of a well-placed AI edge.

3 · Design principles

Six principles fall out of the thesis and govern every component.

Principle	What it means in practice
Determinism first	Maths and decisions are pure functions. Given the same inputs, the system always produces the same worksheet and the same verdict.
Reproducibility from a seed	A week is fully defined by a seed string. The same week regenerates byte-for-byte — including its answer key — so printing, auditing, and debugging are trivial.
Mastery gates progress	A child advances only after proving accuracy and speed, sustained over several sessions. No time-based promotion.
Never too hard, never too long	New skills always start at their easiest; difficulty can only fall when a child struggles; the day is time-boxed so a harder skill yields fewer questions.
Small steps	The curriculum is decomposed into many narrow skills, so each advancement is a gentle increment rather than a cliff.
Explainability	Every decision traces to a rule and a number. You can always answer "why did she get this sheet?"

4 · Architecture

The system is a thin stack around a pure engine. The engine has no clock, no randomness beyond a seeded generator, and no input/output — which is exactly what makes it testable and portable. It runs identically on a laptop today and inside a Cloudflare Worker later.

Presentation — printable weekly packet (concept cover → 6 day grids → answer key); supervisor & kid dashboards

▲

Orchestration — weekly planning: pick next skill, size the worksheet, mix in review, persist

▲

Pure engine ★ — seeded RNG · generators · question bank · count · scorer · mastery state machine · error analysis

▲

Persistence — students, weeks, worksheets, questions, results, competency (JSON locally; the same schema maps 1:1 to a SQL database)

◐

AI edge — optional concept-sheet prose, invoked once per child per week

★ The engine is the crown jewel and is unit-tested in isolation. Everything above it is replaceable plumbing; the engine is the part that must be correct.

5 · The learning model

5.1 · A curriculum of small skills (Kumon B → C → D)

The curriculum is an ordered ladder of narrow skills, mapped to Kumon levels B (vertical addition and subtraction with carrying and borrowing), C (multiplication and division), and D (long multiplication, long division, and an introduction to fractions). Each skill is data, not code: it names a generator, the parameters for its easiest setting, optional scaffold-down overrides, a speed target, and a plain-language description. Adding a skill is adding a row.

The decomposition is deliberately fine — for example, addition alone steps through 2-digit + 1-digit no-carry → with carry → 2-digit + 2-digit no-carry → with carry → 3-digit → column addition of three or more numbers. Fine granularity is what makes each advancement feel like a small, winnable step.

5.2 · The question bank — two flavours, one shape

Each skill owns a bank. The flavour depends on the size of the question space.

Flavour	Used for	Why
Enumerable	Times tables, division facts	The space is small and finite, so we enumerate it and guarantee coverage — and over-sample the specific facts a child keeps missing until they are automatic.
Generative	Multi-digit add / sub / mul / div, fractions	The space is effectively infinite, so we sample it deterministically from a seed — endless fresh questions, no realistic repeats within a sheet.

Both resolve to one materialized shape, which is all the printer, scorer, and analyzer ever see:

{ kind, operands, answer, render, tags }
// tags drive error analysis: 'carry', 'borrow', 'across_zero', 'fact:7x8', 'review' …

5.3 · Marking — the exact wrong questions

Because every printed question carries a stable number, the supervisor marks by exception: they enter only the question numbers the child got wrong, plus the single session time. Everything unmarked is assumed correct. This is low-effort for the parent and high-signal for the system — each error inherits its tags, so the engine learns not just that a child erred but what kind of error it was (a borrow-across-zero slip, a weak 7× fact), which directly drives both reteaching and over-practice.

5.4 · Time-boxed, adaptive volume

A fixed "100 questions a day" is the wrong primitive, because 100 easy sums and 100 long divisions are not the same ask. Instead we hold the time roughly constant and let the count float from the child's observed speed on the upcoming skill:

count = clamp( round( targetSeconds / estimatedSecondsPerQuestion ) )

This produces two self-protecting behaviours for free:

A harder skill is slower per question, so it yields fewer questions — the session never runs long.
A faster child gets more questions in the same ten minutes — more fluency volume, exactly when they can absorb it.

A consequence worth noting: since the session length is now constant by construction, "fast enough" can no longer be measured in minutes. The speed gate becomes seconds-per-question against the skill's target — which is the same number the mastery rule already uses. The design stays internally consistent.

5.5 · The mastery state machine

After each session the engine updates a rolling record (accuracy and median seconds-per-question) for the active skill. At the week boundary, a small, fully explainable policy decides what comes next.

Rolling state (sustained ≥ 2–3 sessions)	Decision
≥ 95% accurate and within the speed target	Advance one skill — entering at its easiest setting
Accurate but slow	Repeat the skill — build speed
85–95% accurate	Repeat the skill — consolidate
Below 85%	Scaffold down — reduce difficulty to rebuild confidence

How "never too hard" is guaranteed

Advancement is gated on mastery, so a child cannot be pushed forward before they are ready. A brand-new skill always enters at ease 0, its gentlest form. When a child struggles, difficulty decreases. And the ladder moves at most one rung per week. There is no path by which the next sheet is a cliff.

5.6 · Cumulative review

Mastered skills decay if never revisited, so roughly a fifth of each day is drawn from skills the child has already passed, interleaved and shuffled in. This keeps old fluency alive and mirrors Kumon's own cumulative design — without the child noticing a separate "revision" task.

5.7 · The weekly concept sheet

Each packet opens with a short, friendly explainer of the one or two concepts the child most struggled with. The selection and the worked examples are entirely deterministic — the engine ranks the weakest concept tags from real errors and generates fully-solved examples. Only the wording is optionally handed to the LLM, which is told to use the supplied, already-correct examples and not to invent maths. If the model is unavailable or its output fails validation, a deterministic template is used instead. The child always gets a correct, encouraging sheet; the AI only changes how warm it sounds.

6 · The weekly lifecycle

End to end, a week flows through these steps — five of six are pure computation.

Decide the next skill & difficulty from the child's competency record. deterministic
Size the worksheet to the ten-minute box from observed speed. deterministic
Generate six days of seeded questions plus answer keys, mixing in cumulative review. deterministic
Reteach: select the weakest concepts and solved examples; wrap them in prose. select phrase
Print the packet; the child works on paper for ~10 min/day.
Mark: the supervisor enters the exact wrong questions and the session time. deterministic
Update competency; the dashboards reflect new mastery, speed, and error patterns. deterministic

The loop then returns to step 1 for the following week, now with fresh evidence.

7 · Two children, one ladder

The system is multi-student from the start, but intentionally simple about it: there is one shared curriculum ladder, and each child carries their own pointer and their own competency record along it. Two sisters can climb the very same B → C → D path entirely independently — one consolidating two-digit carrying while the other has moved on to times tables — with no duplicated content and no interaction between their trajectories. The same model extends upward to Year 13 simply by lengthening the ladder.

8 · Infrastructure — minimal by design

The deployment target is deliberately humble: it should cost nothing and run on managed edge infrastructure. Because the engine is pure and dependency-free, the move from the local build to Cloudflare is a wiring exercise, not a rewrite.

Concern	Local build (today)	Cloudflare (next)
Logic / engine	Pure modules run by a CLI	Same modules inside a Worker
Storage	A JSON file	D1 (SQLite) — schema already mirrors it 1:1
Weekly generation	A command	A Cron Trigger (Sunday night)
Dashboards & marking	Static HTML	Pages, behind Zero Trust access
Printing	Print-CSS HTML → browser "Save as PDF"	Unchanged — no PDF library needed
Concept prose	Deterministic template	A single OpenRouter call per child per week

Printing as HTML rather than a generated PDF is a quiet but important choice: a worksheet is just a grid, print-CSS renders it crisply on A4, and we avoid heavyweight PDF libraries that strain a Worker's limits.

9 · Decisions & trade-offs

A few choices were made consciously, each trading a little flexibility for a lot of trust.

Decision	Why	What we gave up
Difficulty is constant within a week	A child never meets a mid-week spike; the six sessions give a clean read on one skill	A small lag — mastery mid-week is acted on next week, not the same day
Questions are materialized, not just regenerated on the fly	Simple marking-by-number and rich error analysis	A little storage (trivial at this scale)
One skill per day, plus review	The single session timer maps cleanly to one skill's fluency	Less variety within a day (mitigated by the review mix)
Time-boxed count over a fixed count	Sessions stay ~10 min regardless of difficulty; volume scales with ability	The daily count varies (and is therefore an outcome, not a fixed promise)

10 · What is deliberately deferred

Phase 0 builds the part that must be correct — the engine, curriculum, adaptation, printing, and dashboards — and proves it with a test suite and an end-to-end demo. Everything that is plumbing is left for later phases: the database swap, the hosted marking interface, automated weekly generation, and the live LLM concept prose. The sequencing is intentional. The hardest thing to get right is the judgement, so that is built and verified first; the cloud is the easy part, added once the brain is trustworthy.

In one sentence

A child's learning is too important to be probabilistic — so the maths, the marking, and the progression are deterministic and reproducible, and the language model is invited only to make a correct explanation sound kind.