Walkthrough 2: Understanding LLMs

Second dry-run stress test. This topic tests differently from design patterns: the field moves fast, sources contradict each other, the depth range is enormous, and political dimensions are unavoidable.

Gaps found and resolutions

1. Temporal sensitivity in sources

Problem: LLM capabilities change monthly. A 2023 source saying “context windows max at 8k” was true then and wrong now. The self-check validates internal consistency but not currency.

Resolution: The refinement chain (Stage 3b) explicitly handles time-sensitive claims in two ways:

  • Hedge statements: When a value could change, the script contextualizes it. “In 2023, most context windows topped out around 8k tokens, but this has been steadily improving. The principle of why context window size matters is more durable than the specific number.”
  • Explicit currency checks: When the script contains numbers or capability claims that are likely to shift, the accuracy review flags them with a prompt like “What is the current standard value for X?” and verifies against the most recent available source.

The goal isn’t to be a news service — it’s to teach durable concepts and be honest about which specific values have a shelf life.

2. Learning arc doesn’t need to be optimal

Problem: Not every topic has a consensus teaching order. LLMs could be taught bottom-up (math → transformers), top-down (prompt → layers peeled back), or problem-oriented (why does it hallucinate?).

Resolution: The agent already makes these decisions. When someone types “how does an LLM work” into a chat, it picks a teaching order without agonizing. Some pedagogical reasoning is good, but imperfection is a feature, not a bug.

A podcast host meanders. They follow a thread that’s interesting. The learning arc should feel like a thoughtful conversation, not a textbook table of contents. If Episode 3 takes a slight detour because the Expert got excited about tokenization edge cases, that’s authentic and engaging.

The curriculum agent should apply basic pedagogical reasoning (don’t explain attention before the learner knows what a token is) but shouldn’t over-optimize the sequence.

3. Source synthesis and useful analogies

Problem: Different sources explain the same concept differently. Some simplifications are useful, some are misleading. How does the agent choose?

Resolution: The agent synthesizes across sources and the refinement chain catches harmful simplifications. But also: stealing good analogies is useful. If Andrej Karpathy has a great metaphor for attention and the Anthropic blog has a clear explanation of the training loop, use both. The script should grab the best explanation from each source and credit it.

The line between “useful simplification” and “misleading oversimplification” is exactly what Stage 3c (self-check Feynman) is designed to catch. “The model understands what you mean” would fail falsification — “what would break if that were true?” reveals that the model has no grounding, contradicting the claim of understanding.

4. Host echoes the learner’s actual misconceptions

Problem: The Host/Expert dialogue is most powerful when the Host voices the learner’s real wrong mental models, not generic ones.

Resolution: This is critical. The pre-assessment maps specific misconceptions. These feed directly into script generation. If the learner said “tokens are basically words” in the pre-assessment, the Host says exactly that in the episode:

Host: “So it breaks my sentence into words —” Expert: “Not quite words. This is the part that surprises people…”

The learner hears their own misunderstanding voiced and corrected in real time. This is personalized education that almost nothing else does. The script generation prompt must receive the pre-assessment’s “shaky” findings as input and use them to write the Host’s dialogue.

For the shareable version of the course, the Host would use common misconceptions from the topic research rather than personal ones. Same mechanism, different source.

5. The syllabus is a guide, not a cage

Problem: When the self-check finds a gap, what happens? And what about when the learner brings up something outside the syllabus?

Resolution: The syllabus is not a holy relic. It’s a starting guide. The system should be responsive:

  • If the self-check finds a minor gap (missing explanation, weak analogy), it re-runs the previous stage with the note as additional context. No human needed.
  • If the self-check finds a major gap (fundamental concept missing from the curriculum), it flags for review but doesn’t block.
  • If the learner brings up a related concept during a tutor session (e.g., mentions nearest neighbor while discussing embeddings), the tutor explores it. If it’s substantive enough, it can generate a bonus episode or weave it into the next episode’s script.
  • If listener questions pull the topic in an interesting direction, the system can adapt. The syllabus was generated before the learner started — it shouldn’t be treated as more authoritative than the learner’s actual learning needs.

The rigidity is in the pipeline structure, not the content.

6. TTS voice consistency

Problem: Two distinct voices need to sound the same across episodes generated at different times.

Resolution: Voice profiles (model, pitch, speed, vocal quality) are stored in the course settings and applied consistently. The Feynman tutor, when quizzes have an oral option, would use a stored voice profile as well.

Implementation detail, not a design gap.

7. Controversial topics and false balance

Problem: LLMs as a topic inevitably touches on politics (bias, alignment, values, labor displacement). How does the system handle it?

Resolution: Transparency about disagreement is not the same as false balance. The principle:

  • Legitimate scientific/technical disagreement: Present both sides. “Researchers actively debate whether RLHF or DPO produces better alignment results” — both positions are grounded in evidence and published research.
  • Manufactured controversy where consensus exists: Don’t give equal weight. Climate change is real. Vaccines don’t cause autism. LLMs are not conscious. The system can acknowledge that some people hold contrary views and explain why those views exist (mistrust of institutions, misunderstanding of mechanism) without presenting them as equally valid scientific positions.
  • Value judgments: “Should AI replace jobs?” isn’t a factual question. The system presents the factual landscape (what’s happening, what research shows about impacts) and lets the learner form their own position.

The Expert stays grounded in evidence. The Host can voice the learner’s concerns and skepticism. The Expert addresses them honestly without either dismissing or amplifying.