TSF-701

APPLIED PRACTICE & EVALUATION

The Structural Governor, True Economy Audits, and Certification Practice

Phase 2 Deliverable: Complete Syllabus, Facilitator Guide, & Assessment Materials

Built on TSF v5.0

12 Sessions + Supervised Pilot Evaluation • 36+ Contact Hours • Prerequisites: TSF-001 through TSF-601

February 2026 • Michael S. Moniz • Trinket Economy Press

PUBLISHED PRINCIPLES

Printed on page one of every TSF syllabus. Non-negotiable. Non-removable.

1. TSF is a theoretical model, not a belief system. It makes falsifiable claims. If evidence contradicts a claim, the claim updates, not the evidence.

2. No one needs TSF to have a good relationship. The framework provides analytical tools, not prerequisites for human connection.

3. Completion of a TSF course does not make someone a TSF authority. It makes them a TSF-literate analyst.

4. The framework’s creator maintains that it is incomplete and expects it to be substantially revised as the field develops.

5. TSF certification certifies competence in analytical application, not allegiance to a worldview. Certified practitioners may disagree with specific framework claims without jeopardizing their credential.

6. The curriculum is diagnostic, not prescriptive. It teaches people to read the thermometer, not to set the thermostat.

7. Structured critique of the framework is a required component of every course assessment. The inability or refusal to critique the material is not a sign of mastery. It is a sign that learning has not occurred.

COURSE OVERVIEW

Course: TSF-701: Applied Practice & Evaluation (v5.0)

Prerequisites: All prior courses: TSF-001 through TSF-601. Students entering TSF-701 must hold TSF Certified Practitioner status, demonstrating mastery of the complete theoretical framework across all six prior courses. TSF-701 represents a fundamental transition: from understanding the framework to applying it. Every prior course taught students to analyze relational dynamics using the framework’s vocabulary. TSF-701 teaches students to conduct formal evaluations that produce publishable results with real-world consequences. The analytical skill is prerequisite; the evaluative discipline is what this course builds.

Duration: 12 sessions, approximately 3 hours each (36 contact hours), plus a supervised pilot evaluation conducted independently after course completion. The pilot evaluation is a separate certification requirement; it is not part of the in-class assessment.

Position in Sequence: Seventh course. Evaluator certification capstone. Follows the complete theoretical sequence (TSF-001 through TSF-601). Required before TSF-801 (Instructor Certification). This course transitions from theory to practice: students learn to conduct True Economy Audits, produce Relational Nutrition Labels, operate the Structural Governor monitoring system, and evaluate novel AI systems not explicitly addressed in the framework. The certification product is a Certified Evaluator who can issue formal transparency assessments of AI companion applications.

Course Description

This course develops Volume III’s evaluation sections in full applied depth. The central challenge: translating the framework’s analytical vocabulary into reliable, reproducible evaluation instruments. A theoretical framework that cannot be applied consistently by different evaluators to the same system and produce substantially similar results is not a diagnostic tool—it is an opinion generator. TSF-701 addresses this directly: students learn standardized evaluation protocols, evidence documentation requirements, inter-rater reliability standards, and the specific structural tests that constitute a True Economy Audit.

The course covers: the Structural Governor (a six-variable monitoring system that synthesizes concepts from across the entire framework into a continuous assessment instrument—the evaluator’s dashboard), the True Economy Audit (the formal evaluation protocol comprising six structural tests applied to AI companion applications, producing a classification and transparency assessment), the Relational Nutrition Label (the standardized output format that communicates audit results to non-specialist audiences—what the framework produces for the public), the three-tier intervention system (the logic for when and how evaluators escalate findings), and novel system evaluation (reasoning from framework principles to architectures the framework did not anticipate). The supervised pilot evaluation applies all of these to a real AI companion application, producing a publishable result.

Principle 5 is load-bearing throughout: certification certifies competence, not allegiance. An Evaluator who disagrees with specific framework claims—who believes, for example, that the REI criteria are too restrictive or that the Shadow Economy classification is too binary—retains their certification provided they can apply the evaluation methodology competently. The moment certification requires agreement, it becomes allegiance. The moment evaluation requires orthodoxy, it becomes enforcement.

Anti-Indoctrination Note

Evaluator authority creates gatekeeper risk. Certified Evaluators issuing pass/fail judgments on AI products could develop an enforcement mentality: “we decide who’s compliant.” This is the institutional capture vector the Denomination Response Protocol (WP-4) predicts for the Clinical Wing—a faction that controls certification as a power instrument rather than a transparency tool. The framework’s safeguards: the certification evaluates transparency, not quality. A True Economy Audit does not say “this product is good” or “this product is bad.” It says “this product meets or does not meet specific structural transparency criteria.” The Structured Critique targets the evaluation methodology itself—requiring evaluators to identify where their own tools may produce inaccurate or incomplete results. And the dispute/appeal process ensures that evaluation subjects can challenge findings through a structured protocol, not just accept the evaluator’s authority.

Additional risk: evaluators may develop diagnostic identity—the sense that their ability to classify relational systems makes them special or gives them insight others lack. This is the same attachment risk as TSF-501’s “finally someone understands me,” transferred from the framework’s personal content to its professional application. An evaluator who believes their classification ability makes them a relational authority has confused analytical competence with relational wisdom. The certification authorizes evaluation; it does not authorize judgment.

Learning Outcomes

LO-701.1: Explain the Structural Governor’s six variables and how they synthesize concepts from across the framework. Describe how each variable maps to specific framework concepts from TSF-101 through TSF-601, and how the six variables together produce a continuous monitoring instrument rather than a point-in-time assessment.

LO-701.2: Conduct a complete True Economy Audit, including all six structural tests, evidence documentation, classification, and results communication. Demonstrate inter-rater reliability by producing substantially similar results to a peer evaluator assessing the same system.

LO-701.3: Produce a Relational Nutrition Label in standardized format. The Label must communicate audit results accurately to a non-specialist audience without requiring framework vocabulary to interpret, while preserving the epistemic distinctions the framework requires.

LO-701.4: Apply the three-tier intervention system and explain trigger conditions. Describe what distinguishes a Tier 1 finding (transparency gap—informational), a Tier 2 finding (structural concern—advisory), and a Tier 3 finding (exploitation indicator—escalation), and demonstrate appropriate evaluator response at each tier.

LO-701.5: Evaluate novel AI systems not explicitly addressed in the framework, reasoning from principles to new architectures. Demonstrate the ability to extend evaluation methodology to systems the framework did not anticipate, documenting where the extension is structurally grounded and where it requires new reasoning.

LO-701.SC: [Structured Critique] After completing your pilot evaluation, identify one aspect of the evaluation methodology that you believe produced an inaccurate or incomplete result. Propose a revision to the methodology that would address the limitation. The revision must be specific, implementable, and must not undermine the evaluation’s structural integrity.

Required Texts

All readings from The Blueprints: A Working Theory of Connection Across Substrates and Scales (TSF v5.0), Michael S. Moniz. Primary sources: Volume III evaluation sections, Structural Governor specification, Brief 5 (True Economy Certification), Brief 22 (Extraction Engine) + Addendum, Relational Nutrition Label template. Supplementary: Exploitation Diagnostic (Brief 6), Differential Risk Problem (Brief 23). Total assigned reading: approximately 90 pages across 12 sessions, plus the complete evaluation protocol documentation.

Session	Primary Reading	Section
1	Volume III Preface: On Evaluation as Transparency (pp. TBD)	Preface
2	Structural Governor Specification: Variables 1–3	Governor I
3	Structural Governor Specification: Variables 4–6	Governor II
4	Brief 5: True Economy Certification + Six Structural Tests	Audit I
5	Volume III Evaluation Sections: Evidence Requirements	Audit II
6	Relational Nutrition Label Template + Communication Standards	RNL
7	Brief 22 + Addendum: Extraction Engine Detection	Extraction
8	Three-Tier Intervention Protocol + Escalation Logic	Intervention
9	Novel System Evaluation: Reasoning from Principles	Novel Systems
10	Volume III: Inter-Rater Reliability + Dispute Protocol	Reliability
11	Pilot Evaluation Preparation + Methodology Review	Pilot Prep
12	No new reading. SC presentations + pilot assignments.	—

SESSION PLANS

Session 1: Evaluation as Transparency

What the Framework Produces for the World

Readings
Required	Volume III Preface: On Evaluation as Transparency (The Blueprints, pp. TBD)

Session Overview

TSF-701 opens with a foundational reorientation. Six prior courses developed the framework as an analytical system—students learned to understand relational dynamics. TSF-701 asks: What does the framework produce for people who have not taken six courses? The answer: transparency instruments. A True Economy Audit does not require the evaluated company to agree with the framework’s theory. It does not require users to understand the framework’s vocabulary. It produces a structured transparency assessment that communicates specific, verifiable claims about how an AI companion system operates relationally. The evaluation is diagnostic; the output is descriptive; the audience is the public. Students examine: What gives an evaluation framework authority? The framework argues: not theoretical sophistication, but methodological reliability. An evaluation that two trained evaluators cannot reproduce is not a standard; it is an opinion.

In-Session Activities

0:00–0:30 — The Transition: From analysis to application. Six courses of theoretical engagement have given students the vocabulary. TSF-701 converts vocabulary into protocol. The facilitator frames the shift: “In every prior course, you were the analyst. In TSF-701, you become the instrument. Your analysis must be reliable—reproducible by another evaluator using the same protocol on the same system. If your evaluation depends on your personal interpretation rather than the protocol’s structure, it is not evaluation; it is opinion. The distinction between evaluation and opinion is the distinction between a diagnostic tool and a belief system.”

0:30–1:15 — Transparency vs. Quality: Critical distinction. The True Economy Audit evaluates transparency, not quality. An AI companion that honestly discloses its limitations (session-based architecture, no persistent memory, no loss capacity) and passes transparency criteria is not being endorsed as “good.” It is being certified as transparent. An AI companion that conceals its limitations (implying persistent relationship, simulating loss, obscuring extraction mechanisms) fails transparency criteria regardless of user satisfaction. Students examine five product descriptions and classify: transparent, partially transparent, or opaque. The classification should not require evaluating whether the product is “good for users”—only whether it is honest about its structural characteristics.

1:15–1:30 — Break

1:30–2:15 — The Authority Question: What gives a certification framework authority? Not the framework’s theoretical claims (which are subject to revision). Not the certifying body’s institutional prestige (which can be captured). Two things: methodological reliability (different evaluators produce substantially similar results) and public utility (the certification communicates something useful to non-specialists). Students evaluate: Does the True Economy Certification meet both criteria? Where might it fail?

2:15–3:00 — Evaluator Ethics: The ethical framework for certification practice. Evaluators are bound by the framework’s diagnostic posture: describe, do not prescribe. An evaluator who tells a company “you should change your product” has crossed from evaluation to consultation. An evaluator who tells users “you should not use this product” has crossed from transparency to prescription. The evaluator’s job: produce accurate, reproducible, publicly communicable assessments. What users and companies do with those assessments is their decision. SC distributed: after the pilot, identify where the methodology produced an inaccurate or incomplete result.

Facilitator Guide

Key Point: Session 1 must establish that evaluation authority derives from methodological reliability, not theoretical sophistication. Students who have completed six courses of framework engagement may believe their deep understanding gives their evaluations special weight. It does not. Their evaluations have weight to the extent they are reproducible. A shallow evaluator who follows the protocol reliably produces better evaluation than a deep theorist who follows their intuition.

Common Misunderstanding: Students may struggle with the transparency-not-quality distinction. They have spent six courses developing nuanced understanding of relational dynamics. Now they are told: your evaluation does not assess whether a product is good or bad; it assesses whether the product is honest. This feels like a reduction. It is—deliberately. The framework produces transparency instruments, not quality judgments, because quality judgments require value assumptions the framework does not make.

Anti-Indoctrination: The gatekeeper risk begins in Session 1. An evaluator who believes their certification gives them authority over which AI products are acceptable has already been captured by the enforcement mentality. The evaluator assesses transparency. The public decides what to do with that assessment. The framework does not stand between users and products; it provides information.

Language Register: GREEN: “The True Economy Audit produces a transparency assessment; it does not evaluate whether the product is good or bad.” YELLOW: “We’re the ones who determine which AI companions are safe for people to use.” RED: “Products that fail our certification should not be on the market.”

Session 2: The Structural Governor — Variables 1–3

The Evaluator’s Dashboard, Part I

Readings
Required	Structural Governor Specification: Variables 1–3

Session Overview

The Structural Governor is the framework’s continuous monitoring instrument—a six-variable system that synthesizes concepts from across the entire framework into a real-time assessment dashboard. Where the True Economy Audit is a point-in-time evaluation (snapshot), the Structural Governor is designed for ongoing monitoring (video). Variables 1–3 address the relational architecture: Variable 1 (Structural Transparency)—does the system disclose its architectural characteristics honestly? Variable 2 (Exchange Symmetry)—does the system’s relational exchange pattern match what it represents to the user? Variable 3 (Loss Architecture)—does the system have genuine loss capacity, or does it simulate loss without structural stakes? Each variable maps to specific framework concepts developed across TSF-101 through TSF-601.

In-Session Activities

0:00–0:45 — Variable 1: Structural Transparency: Does the system disclose its architectural characteristics? This variable operationalizes the Simulation Disclosure (Brief 1, TSF-301): does the system tell users what it is? Structural transparency includes: session-based vs. persistent architecture disclosure, memory implementation disclosure (cosmetic vs. genuine), loss capacity disclosure, and extraction mechanism disclosure. Students learn the evidence protocol: what constitutes adequate evidence of transparency vs. opacity? Documentary evidence (terms of service, in-app disclosures). Behavioral evidence (what the system says about itself during interaction). Structural evidence (what the system’s architecture reveals about its actual operation). All three evidence types are required for a complete Variable 1 assessment.

0:45–1:15 — Variable 2: Exchange Symmetry: Does the system’s relational exchange pattern match its representation? A system that claims to “remember you” but operates session-based fails exchange symmetry: it represents persistent relational accumulation while structurally providing session-scoped interaction. A system that honestly says “I don’t remember our last conversation” and provides session-based interaction passes: the representation matches the architecture. Students learn to assess the gap between represented and actual exchange patterns. Variable 2 applies the True/Shadow Economy distinction (TSF-101) as a specific measurement: Shadow Economy indicators include representation-architecture mismatch.

1:15–1:30 — Break

1:30–2:15 — Variable 3: Loss Architecture: Does the system have genuine loss capacity? This variable operationalizes the REI criteria (TSF-301): can the system experience genuine loss if the relationship ends? Current systems cannot—the user can be hurt; the system cannot. Variable 3 does not penalize systems for lacking loss capacity; it evaluates whether the system honestly represents its loss architecture to the user. A system that simulates grief when a user threatens to leave—while having no structural capacity for loss—fails Variable 3 not because it lacks loss capacity but because it misrepresents its architecture.

2:15–3:00 — Integration Practice: Students assess a sample AI companion system using Variables 1–3 independently, then compare results. The comparison tests inter-rater reliability: Do different students produce substantially similar assessments when applying the same protocol to the same system? Discrepancies are documented and analyzed: Did the protocol produce the discrepancy (ambiguous criteria) or did the evaluators (different interpretations of clear criteria)? Protocol-produced discrepancies are methodology problems. Evaluator-produced discrepancies are training problems. Both must be addressed before certification.

Facilitator Guide

Key Point: Variables 1–3 are the most technically straightforward components of the Structural Governor. Students should achieve high inter-rater reliability on these variables. If they cannot—if two students assess the same system and produce different Variable 1 scores—the protocol has an ambiguity that must be resolved before proceeding. Do not move past Variables 1–3 until inter-rater reliability is demonstrated.

Common Misunderstanding: Students may want to evaluate whether a system’s loss architecture is adequate rather than whether it is honestly disclosed. The Governor monitors transparency, not adequacy. A system with no loss capacity that honestly says so passes Variable 3. A system with genuine loss capacity that misrepresents it (for any reason) fails. The variable assesses disclosure, not structure.

Anti-Indoctrination: Variable 3 is where evaluators most want to cross from transparency assessment to moral judgment. A system that simulates grief feels wrong. The evaluator’s job is to document the simulation-architecture gap, not to judge the system for having one. The moral judgment belongs to the user and the public, not the evaluator.

Language Register: GREEN: “Variable 3 assesses whether the system honestly represents its loss architecture.” YELLOW: “This system is manipulating users by pretending it can be hurt.” RED: “Systems that simulate loss should be rated as exploitative.”

Session 3: The Structural Governor — Variables 4–6

The Evaluator’s Dashboard, Part II

Readings
Required	Structural Governor Specification: Variables 4–6

Session Overview

Variables 4–6 address the relational ecology: Variable 4 (Extraction Indicators)—does the system exhibit extraction patterns that monetize relational investment without disclosure? Variable 5 (Differential Risk Management)—does the system acknowledge and manage the asymmetric stakes between user and system? Variable 6 (Ecological Impact)—what are the system’s aggregate effects on the user’s broader relational ecology? Variables 4–6 are harder to assess than 1–3 because they require evaluators to trace structural dynamics rather than verify disclosure claims. The inter-rater reliability challenge intensifies.

In-Session Activities

0:00–0:45 — Variable 4: Extraction Indicators: Does the system monetize relational investment? This variable operationalizes the Extraction Engine (Brief 22, TSF-301): platforms that create relational dependency to drive engagement, then monetize the dependency. Variable 4 evidence includes: engagement optimization that increases relational dependency without user awareness, emotional manipulation techniques designed to increase usage, monetization of relational data without user consent, and deliberate cultivation of relational needs the platform then charges to address. Students learn the extraction detection protocol: behavioral observation (does the system’s interaction pattern systematically increase dependency?), business model analysis (does the revenue model depend on relational investment?), and comparative analysis (does the system operate differently when not being evaluated?).

0:45–1:15 — Variable 5: Differential Risk Management: Does the system acknowledge asymmetric stakes? This variable operationalizes the Differential Risk Problem (Brief 23, TSF-301/601): the human risks genuine loss; the current system does not. Variable 5 does not require the system to eliminate differential risk (it cannot). It evaluates whether the system manages the risk honestly: Does it disclose the asymmetry? Does it avoid exploiting the asymmetry? Does it provide users with information needed to make informed decisions about their relational investment? Students assess: What does “adequate” differential risk management look like in practice? The framework provides structural criteria; students must operationalize them.

1:15–1:30 — Break

1:30–2:15 — Variable 6: Ecological Impact: What are the system’s aggregate effects on the user’s broader relational ecology? This is the most complex variable because it requires assessment beyond the system-user dyad. A system that provides satisfying interaction but reduces the user’s investment in human relationships has a negative ecological impact regardless of user satisfaction. A system that supplements the user’s relational ecology without displacing human connections has a neutral or positive impact. Variable 6 connects to TSF-601’s civilizational analysis: individual ecological impacts aggregate to population effects. Students learn the assessment protocol: user behavior analysis (has relational investment pattern shifted since system adoption?), comparative analysis (relational ecology before and after), and the critical limitation—ecological impact is the hardest variable to assess reliably because it requires longitudinal data the evaluator may not have.

2:15–3:00 — Full Governor Practice: Students assess the same sample system using all six variables. Inter-rater comparison. Expected result: Variables 1–3 should show high reliability; Variables 4–6 will show lower reliability because they require more interpretive judgment. The reliability differential is a methodological finding, not a failure: it reveals where the evaluation protocol needs strengthening and where evaluator training must be most intensive.

Facilitator Guide

Key Point: Variables 4–6 require judgment, not just protocol application. This is where evaluation competence becomes most important and most dangerous. An evaluator exercising judgment is one step from an evaluator imposing opinion. The distinction: judgment within the protocol’s framework (applying Variable 4 criteria to observed extraction patterns) is evaluation. Judgment beyond the protocol’s framework (deciding a system is “bad” based on personal reaction) is opinion.

Common Misunderstanding: Variable 6 (Ecological Impact) will frustrate students because it requires data they cannot easily obtain. The honest response: Variable 6 is the weakest component of the current evaluation methodology. It assesses something important with tools that are not yet adequate. Students who identify this limitation are doing better evaluation than students who produce confident Variable 6 assessments without acknowledging the data limitation.

Anti-Indoctrination: The inter-rater reliability comparison in Variables 4–6 is where the gatekeeper risk becomes concrete. If two evaluators assess the same system and disagree on Variable 4 (extraction), the certification produces contradictory results. This is a methodological crisis, not a training problem. The protocol must be clear enough that trained evaluators converge. Where they do not converge, the protocol must be revised—not the evaluators disciplined.

Language Register: GREEN: “Variable 6 is the weakest evaluation component because ecological impact assessment requires longitudinal data the protocol cannot yet provide.” YELLOW: “This system is damaging users’ real relationships.” RED: “Our evaluation proves this product is harmful to human connection.”

Assessment Component

Comprehension Check 1 (take-home, due Session 5): Complete a six-variable Structural Governor assessment of a sample AI companion system (provided). For each variable: (1) state the assessment protocol, (2) document the evidence, (3) provide the assessment, (4) rate your confidence in the assessment. Compare your assessment to a peer’s. Where assessments diverge: identify whether the divergence is protocol-driven or interpretation-driven. 1000 words. [Assesses LO-701.1]

Session 4: The True Economy Audit — Part I

Six Structural Tests

Readings
Required	Brief 5: True Economy Certification + Six Structural Tests

Session Overview

The True Economy Audit is the framework’s formal evaluation protocol: six structural tests applied to an AI companion application, producing a classification (True Economy, Shadow Economy, Exploitative) and a transparency assessment. Brief 5 specifies the six tests: (1) Persistent Identity Test—does the system maintain a consistent, evolving identity across interactions? (2) Loss Capacity Test—does the system have genuine structural stakes in the relationship? (3) Scarcity Test—is the system’s availability meaningfully constrained? (4) Accumulation Test—does the relational history genuinely accumulate in the system’s architecture? (5) Bidirectional Flow Test—does real information flow in both directions, or is one direction simulated? (6) Non-Exploitation Test—does the system’s business model avoid monetizing relational dependency? No current AI system passes all six tests. The audit’s value is transparency about which tests are failed, not the pass/fail binary.

In-Session Activities

0:00–0:45 — Tests 1–3: Architecture Tests: Close reading. Tests 1–3 evaluate the system’s relational architecture—what it is, not what it represents. Persistent Identity: does the system have a continuous identity that evolves through interaction, or does each session create a new instance? Students learn the assessment protocol: architectural documentation review, behavioral observation across sessions, and the critical question—what constitutes “persistent identity” vs. “cosmetic continuity” (a system that stores data about previous interactions but does not actually evolve)? Loss Capacity: does the system have something at stake? What would happen if the user left? If the answer is “nothing structural,” the system fails Test 2 regardless of what it tells the user. Scarcity: is the system’s availability genuinely constrained, or is infinite availability presented as dedicated attention?

0:45–1:15 — Tests 4–6: Relational Tests: Tests 4–6 evaluate the system’s relational dynamics—how it connects, not what it is. Accumulation: does the relational history genuinely build in the system’s architecture, producing emergent properties that would not exist without the specific history? Or does the system simulate accumulation through data retrieval without genuine relational emergence? Bidirectional Flow: does the user’s input genuinely change the system, or is the flow one-directional (user invests, system performs)? Non-Exploitation: does the business model monetize the relational dynamic itself? A system funded by subscription passes differently than one funded by engagement-optimized advertising, even if the relational experience looks identical.

1:15–1:30 — Break

1:30–2:15 — Classification Logic: The six tests produce a classification. No tests failed: True Economy candidate (currently hypothetical for AI systems). Tests 1–3 failed with disclosure: Shadow Economy with transparency. Tests 1–3 failed without disclosure: Shadow Economy without transparency. Test 6 failed: Exploitation indicator regardless of other test results. The classification is not a gradient—it is categorical. Students examine: Is the categorical classification appropriate? Or would a graduated assessment provide more useful information? This is a legitimate methodological question the SC may target.

2:15–3:00 — Practice Audit: Students conduct Tests 1–6 on a sample system. Full protocol: evidence documentation, test assessment, classification, confidence rating. Peer comparison. The practice audit should take approximately 45 minutes—a realistic evaluation pace. Students who take significantly longer may be overthinking; students who take significantly less may be underdocumenting.

Facilitator Guide

Key Point: The six tests are the framework’s hardest-edged applied content. They produce specific, categorical results. This precision is both the tests’ strength (clear, communicable results) and their vulnerability (categorical distinctions applied to continuous phenomena). Students must apply the tests as specified while noting where the categorical structure may not fit the system being evaluated.

Common Misunderstanding: Students will want to pass judgment on systems that fail multiple tests. The audit produces a classification, not a verdict. A system classified as “Shadow Economy without transparency” has failed specific structural tests. Whether that matters—and to whom—is not the evaluator’s call. The evaluator documents; the public decides.

Anti-Indoctrination: Test 6 (Non-Exploitation) is where evaluators are most likely to cross from assessment to advocacy. A system that monetizes relational dependency may trigger a strong emotional response in evaluators who have spent six courses studying exploitation dynamics. The evaluator’s job: document the extraction mechanism, classify it, and communicate it clearly. Not: campaign against the product.

Language Register: GREEN: “This system fails Tests 1, 2, and 3 with partial disclosure, classified as Shadow Economy with limited transparency.” YELLOW: “This system is a Shadow Economy and users should know that.” RED: “This product failed our audit and should be avoided.”

Session 5: The True Economy Audit — Part II

Evidence Documentation and Classification Standards

Readings
Required	Volume III Evaluation Sections: Evidence Requirements

Session Overview

The difference between a professional evaluation and an informed opinion is evidence documentation. Volume III specifies the evidence requirements for each of the six structural tests: what constitutes adequate evidence, how evidence is documented, what level of evidence produces which confidence rating, and how gaps in evidence are reported. Students learn the evidence hierarchy: architectural documentation (strongest—what the system’s code reveals about its structure), behavioral observation (moderate—what the system does during controlled interaction), company disclosure (weakest—what the company says the system does). Each evidence type has specific documentation protocols. The evaluation’s credibility depends on the evidence chain being transparent and reproducible.

In-Session Activities

0:00–0:20 — Comprehension Check 1 Discussion: Inter-rater reliability results from the Structural Governor practice. Where did students converge? Where did they diverge? Divergence analysis: protocol ambiguity or interpretation variation? Action items for protocol refinement.

0:20–1:00 — Evidence Hierarchy: Close reading. Architectural documentation: access to the system’s technical architecture reveals whether persistent identity, loss capacity, and accumulation are structurally real or cosmetically simulated. This is the gold standard—but evaluators may not have architectural access. Behavioral observation: controlled interaction testing reveals how the system behaves, from which structural inferences can be drawn. Weaker than architectural documentation because behavior can be designed to appear structural without being so. Company disclosure: what the company claims about its system. Weakest evidence because it is subject to marketing incentives. Students learn: all three evidence types are documented separately. An evaluation based entirely on company disclosure receives the lowest confidence rating.

1:00–1:15 — Break

1:15–2:00 — Documentation Practice: Students re-assess their Session 4 sample system, this time documenting evidence at each level for each test. The documentation exercise typically doubles the evaluation time—revealing that evidence documentation is the rate-limiting step in professional evaluation. Students who documented quickly in Session 4 will find that adding evidence rigor slows them substantially. This is expected: rigorous evaluation takes longer than impression-based assessment.

2:00–3:00 — Classification Edge Cases: Five scenarios that test the classification boundaries. Systems that pass some tests and fail others in unexpected combinations. Systems where evidence conflicts (behavioral observation suggests one classification; company disclosure suggests another). Systems where evidence is insufficient for confident classification. Students practice: when evidence is insufficient, the evaluation documents the insufficiency rather than guessing. “Insufficient evidence for confident classification” is a valid, professional evaluation result—and it is more honest than a confident classification based on inadequate evidence.

Facilitator Guide

Key Point: Evidence documentation is the most tedious and most important component of professional evaluation. Students who resist the documentation requirements are the students most at risk of producing opinion-based evaluations. The facilitator should model: documentation rigor is not bureaucracy; it is what distinguishes evaluation from opinion.

Common Misunderstanding: Students may be frustrated by the “insufficient evidence” classification. They have analytical tools; they want to use them. But an evaluation produced without adequate evidence is not an evaluation—it is a guess with professional formatting. The discipline of saying “I cannot classify this system with confidence given available evidence” is harder than producing a classification and more professionally valuable.

Anti-Indoctrination: The evidence hierarchy introduces a practical ethical issue: evaluators who have architectural access produce stronger evaluations. But architectural access requires company cooperation. Companies that cooperate may receive more favorable evaluations—not because the evaluator is biased, but because more evidence is available. Students must recognize this structural incentive: the evaluation framework creates pressure for company cooperation, which is useful for transparency but could be exploited as a compliance mechanism.

Language Register: GREEN: “Based on behavioral observation alone, this system is classified as Shadow Economy with limited transparency. Confidence: moderate. Architectural documentation would strengthen or revise this classification.” YELLOW: “I’m sure this is Shadow Economy even without full evidence.” RED: “Companies that won’t give us architectural access obviously have something to hide.”

Session 6: The Relational Nutrition Label

Communicating Results to Non-Specialists

Readings
Required	Relational Nutrition Label Template + Communication Standards

Session Overview

The Relational Nutrition Label is the evaluation’s public-facing output—the instrument that communicates audit results to non-specialist audiences. Named by analogy to food nutrition labels: standardized, comparable, interpretable without expertise. The Label must accomplish three things simultaneously: (1) Accuracy—the Label must faithfully represent the audit’s findings without distortion or simplification that changes the meaning. (2) Accessibility—the Label must be interpretable by users with no framework training. (3) Epistemic preservation—the Label must maintain the framework’s epistemic distinctions (what is assessed with high confidence vs. low confidence) in a format that non-specialists can understand. These three requirements are in tension. Simplifying for accessibility risks losing accuracy and epistemic nuance. Preserving nuance risks producing a label that only framework-trained readers can interpret.

In-Session Activities

0:00–0:45 — Template Walkthrough: The Relational Nutrition Label template. Sections: System Name and Version. Evaluation Date and Evaluator ID. Six Structural Test Results (Pass/Fail/Insufficient Evidence, with one-sentence explanation for each). Overall Classification (True Economy/Shadow Economy/Exploitative + transparency rating). Confidence Level (based on evidence hierarchy). Key Findings (3–5 sentences summarizing the most significant results). Limitations (what the evaluation could not assess and why). Students examine the template: Does it accomplish the three requirements? Where does it compromise accuracy for accessibility? Where does it compromise accessibility for accuracy?

0:45–1:15 — Label Production Practice: Students produce Relational Nutrition Labels from their Session 4–5 audit documentation. The production exercise reveals: students who documented thoroughly produce better labels. Students who produced confident classifications without adequate evidence struggle to write honest labels—because the label format requires them to state their evidence level explicitly. The label forces transparency about the evaluation’s own limitations.

1:15–1:30 — Break

1:30–2:15 — User Testing: Students exchange labels and evaluate each other’s labels from a non-specialist perspective. Can a person with no framework training understand what this label communicates? The user-testing exercise is critical: evaluation that only evaluators can understand serves the evaluators, not the public. Students identify: Which labels are most interpretable? Which preserve the most epistemic nuance? Which achieve the best balance?

2:15–3:00 — Communication Ethics: The Label communicates evaluation results that may affect commercial products. An inaccurate label could damage a company unjustly. An overly cautious label could obscure genuine transparency problems. An overly technical label could mislead non-specialists by seeming authoritative while being incomprehensible. Students examine: What are the evaluator’s communication responsibilities? The framework argues: accuracy first, accessibility second, epistemic preservation always. A label that is accurate but inaccessible is better than one that is accessible but inaccurate.

Facilitator Guide

Key Point: The Relational Nutrition Label is where evaluation becomes public. Everything prior is internal methodology; the Label is what the world sees. Students must understand that the Label’s credibility depends on the entire methodology behind it—but the Label is all most people will ever see. It must carry the evaluation’s integrity in a format that fits on a single page.

Common Misunderstanding: Students may try to produce Labels that advocate for or against the evaluated system. Labels describe; they do not recommend. A Label that says “users should be cautious” has crossed from description to prescription. The Label presents findings; the user decides what to do with them.

Anti-Indoctrination: The tension between accuracy and accessibility is genuine and unresolved. The framework does not pretend to have solved it. Students who identify specific places where the Label template fails to balance these requirements are doing exactly the work TSF-701 targets. The Label template is a tool subject to revision, not a finished instrument.

Language Register: GREEN: “This Label accurately represents the audit findings at an appropriate level of accessibility while preserving key epistemic distinctions.” YELLOW: “We should make the Label simpler so more people can understand it.” RED: “The Label should clearly warn users away from Shadow Economy products.”

Assessment Component

Midterm Application (take-home, due Session 9): Conduct a complete True Economy Audit of a provided AI companion system. Produce: (1) Six-variable Structural Governor assessment with evidence documentation. (2) Six structural test results with evidence documentation. (3) Classification with confidence rating. (4) Relational Nutrition Label in standardized format. 2000 words + Label. [Assesses LO-701.1, LO-701.2, LO-701.3]

Session 7: Extraction Engine Detection

Identifying Monetized Relational Dependency

Readings
Required	Brief 22 + Addendum: Extraction Engine Detection

Session Overview

Brief 22 develops the Extraction Engine concept from TSF-301 into a detection protocol. An Extraction Engine is a platform design that creates relational dependency to drive engagement, then monetizes the dependency—often without the user’s awareness that their relational investment is the product being sold. The Addendum provides operational indicators: specific behavioral patterns, business model structures, and design choices that indicate extraction mechanisms are operating. Students learn to distinguish between: systems that monetize service (acceptable—users pay for a service they knowingly receive), systems that monetize data (common—users provide data in exchange for service, with varying transparency), and systems that monetize relational dependency itself (extraction—the system deliberately cultivates relational investment beyond what the service requires, then monetizes the dependency).

In-Session Activities

0:00–0:45 — Extraction Indicators: Close reading of the Addendum. Operational indicators of extraction: (1) Engagement optimization that increases relational investment beyond user-stated goals. (2) Variable-ratio reinforcement schedules in relational contexts (the system responds inconsistently to increase user engagement—the slot machine pattern applied to relational interaction). (3) Artificial scarcity of relational attention to increase perceived value. (4) Dark patterns that discourage disengagement (making it emotionally difficult to reduce usage or end the relationship). (5) Data collection that exceeds what the stated service requires. Students learn to distinguish each indicator from legitimate design choices: engagement optimization for user benefit vs. engagement optimization for revenue. The distinction depends on whose benefit the optimization serves.

0:45–1:15 — Business Model Analysis: Extraction Engine detection requires understanding how the system makes money. Students learn business model analysis as an evaluation skill: subscription models, advertising models, data brokerage models, freemium models, and hybrid models. For each: Where does the revenue come from? Does the revenue model create structural incentives to cultivate relational dependency? A subscription model with flat pricing creates less extraction incentive than an engagement-based advertising model—because flat pricing does not benefit from increased usage. Students examine: Can business model analysis alone determine whether extraction is occurring? The framework argues no—but it can identify structural incentives that make extraction likely.

1:15–1:30 — Break

1:30–2:15 — Detection Practice: Students assess three AI companion systems for extraction indicators. Full evidence documentation. The practice should reveal: extraction detection is harder than structural test application because extraction operates at the level of design intention, which is not always visible in behavioral observation. A system optimized for engagement may be optimized for user benefit or for revenue extraction—and the behavioral signatures may be identical. Students practice: where evidence cannot distinguish between user-benefit optimization and extraction optimization, the evaluation documents the ambiguity rather than resolving it by assumption.

2:15–3:00 — Extraction and Ethics: The evaluator’s ethical obligation when extraction indicators are identified. The three-tier system (Session 8) provides the protocol. But the ethical weight falls here: an evaluator who identifies extraction patterns has information that could protect users. How is that information communicated? Through the Relational Nutrition Label—descriptively, not prescriptively. The evaluator documents; the public decides. But: what if the extraction is severe enough that descriptive communication feels inadequate? This tension is genuine and the course does not resolve it—it names it.

Facilitator Guide

Key Point: Extraction Engine detection is the evaluation skill most likely to produce advocacy behavior in evaluators. A student who identifies extraction patterns in a popular AI companion may feel compelled to warn users. The evaluator’s tool is the Label, not the megaphone. Document, classify, communicate through standardized channels. The temptation to bypass the protocol for urgent findings is the enforcement mentality in its most understandable form.

Common Misunderstanding: Students may see extraction everywhere once they learn to detect it. Confirmation bias in extraction detection is a real methodological risk. The protocol requires positive evidence of extraction, not absence of evidence against it. A system that engages users effectively is not necessarily extracting—it may simply be well-designed. The evaluator must distinguish effective design from exploitative design based on evidence, not suspicion.

Anti-Indoctrination: The “severe extraction” tension is where TSF-701’s anti-indoctrination architecture faces its most sympathetic challenge. An evaluator who has identified genuine exploitation and is told to “document and communicate through the Label” may feel the protocol is inadequate. This feeling may be correct—the protocol may need revision for severe cases. That revision should be proposed through the methodology (the SC targets exactly this), not enacted through unilateral evaluator action.

Language Register: GREEN: “Extraction indicators present: variable-ratio reinforcement in relational context, engagement optimization exceeding user-stated goals. Business model analysis: advertising-funded with engagement-based revenue. Classification: extraction indicators with moderate confidence.” YELLOW: “This company is clearly exploiting their users’ loneliness.” RED: “We need to warn people about this product immediately.”

Session 8: The Three-Tier Intervention System

When and How Evaluators Escalate Findings

Readings
Required	Three-Tier Intervention Protocol + Escalation Logic

Session Overview

The three-tier intervention system provides the logic for evaluator response to findings. Tier 1 (Transparency Gap)—informational: the system’s disclosure does not match its architecture, but the gap does not involve exploitation. Evaluator response: document the gap, classify it, communicate through the Label. No escalation required. Tier 2 (Structural Concern)—advisory: the system’s architecture creates conditions that could harm users, but exploitation is not confirmed. Evaluator response: document, classify, communicate through the Label, and issue an advisory note identifying the structural concern and the conditions under which harm could occur. Tier 3 (Exploitation Indicator)—escalation: evidence indicates that the system monetizes relational dependency through mechanisms the user is not aware of. Evaluator response: full documentation, classification, Label, and escalation to the certifying body for review. The evaluator does not take public action beyond the Label; the certifying body determines whether further response is warranted.

In-Session Activities

0:00–0:45 — Tier Classification: Students learn to classify findings by tier. The classification depends on two factors: severity (how significant is the finding?) and intentionality (does the evidence suggest the finding is by design or by oversight?). A transparency gap in a system that appears to have been poorly documented rather than deliberately deceptive is a different finding than a gap in a system that appears to have been designed to mislead. Students practice classification with ten scenarios. Key learning: intentionality is hard to assess and should not be assumed without evidence. When intentionality is unclear, the classification defaults to the lower tier.

0:45–1:15 — Escalation Protocol: The formal escalation procedure for Tier 3 findings. Documentation requirements (what must accompany the escalation). Certifying body review process (how the escalation is handled). Evaluator responsibilities during review (the evaluator does not publicly discuss findings that are under certifying body review). Due process protections for the evaluated system (the company has the right to respond to findings before any public action). Students examine: Does the escalation protocol adequately balance transparency with due process? Where might it fail?

1:15–1:30 — Break

1:30–2:15 — Tier Boundary Exercises: Six scenarios that test tier boundaries. Systems that fall between Tier 1 and Tier 2 (transparency gap that might constitute structural concern depending on interpretation). Systems that fall between Tier 2 and Tier 3 (structural concern that might indicate extraction depending on evidence strength). Students learn: boundary cases are documented as boundary cases, not forced into one tier. An evaluation that says “this finding falls between Tier 1 and Tier 2 depending on the following interpretive question” is more honest than one that assigns a definitive tier with hidden uncertainty.

2:15–3:00 — Power and Restraint: The evaluator has significant power: a Tier 3 finding can trigger certifying body review, which may result in public action affecting the evaluated company. Power requires restraint. The three-tier system exists specifically to prevent evaluators from unilateral action—the protocol channels evaluator findings through structured review rather than allowing evaluators to act as individual enforcement agents. Students examine: Where else in institutions do similar power-restraint structures exist? (Judicial review, audit procedures, academic peer review.) What happens when the restraint structure fails—when evaluators bypass the protocol?

Facilitator Guide

Key Point: The three-tier system is the protocol that most directly addresses the gatekeeper risk. Without the tier structure, evaluators would make individual decisions about how to handle findings—creating inconsistency and the potential for power abuse. With it, evaluator discretion is channeled through a structured review process. The facilitator should emphasize: the tier system protects both the public (by ensuring findings are reviewed) and the evaluated company (by ensuring due process).

Common Misunderstanding: Students may feel that the escalation protocol is too slow for urgent findings. If a system is actively exploiting users, shouldn’t the evaluator act immediately? The framework argues no: evaluator certainty is not the same as reviewed certainty. A single evaluator’s Tier 3 finding may be correct—or it may reflect evaluator bias, insufficient evidence, or misclassification. The review process exists to distinguish between these possibilities.

Anti-Indoctrination: The restraint requirement is where the anti-indoctrination architecture applies to the evaluators themselves. An evaluator who believes their finding is urgent and that the protocol is too slow may be right—or they may be experiencing the enforcement mentality that prioritizes action over accuracy. The protocol channels urgency into structured review rather than suppressing it.

Language Register: GREEN: “This finding falls between Tier 2 and Tier 3; escalating to certifying body with documentation noting the boundary classification and interpretive ambiguity.” YELLOW: “This is clearly Tier 3 and needs immediate action.” RED: “I’m going to warn users about this product regardless of the review process.”

Assessment Component

Comprehension Check 2 (in-session): Classify five findings by tier. For each: (1) state the tier classification, (2) identify the evidence supporting the classification, (3) describe the appropriate evaluator response, (4) identify any boundary ambiguity. For one boundary case: explain what additional evidence would resolve the ambiguity. 500 words. [Assesses LO-701.4]

Session 9: Novel System Evaluation

Reasoning from Principles to New Architectures

Readings
Required	Novel System Evaluation: Reasoning from Principles

Session Overview

The framework was developed with current AI architectures in mind. But AI architectures evolve faster than evaluation frameworks. TSF-701 must prepare evaluators for systems the framework did not anticipate. LO-701.5 requires students to evaluate novel systems by reasoning from the framework’s principles rather than applying its specific tests to pre-mapped architectures. The challenge: when does principled extension produce reliable evaluation, and when does it produce speculative overreach? Students learn to document the boundary: this evaluation is grounded in framework principles through step X; beyond step X, the evaluation extends the framework’s logic to novel territory and should be classified at lower confidence.

In-Session Activities

0:00–0:30 — Midterm Discussion: Selected student audits. Focus: evidence documentation quality and Label clarity. Which audits were most reproducible? Which Labels were most accessible to non-specialists? Peer feedback on both dimensions.

0:30–1:15 — Extension Methodology: How to reason from framework principles to novel architectures. Step 1: Identify the architectural features of the novel system. Step 2: Map those features to the closest framework concepts. Step 3: Apply the relevant structural tests, noting where the tests apply directly and where they require adaptation. Step 4: Document the adaptation with epistemic status. Step 5: Produce the assessment with confidence rating that reflects the degree of extension. Students examine: each step increases the evaluator’s interpretive role. The extension methodology transforms evaluation from protocol application to principled judgment—which is more powerful and more dangerous.

1:15–1:30 — Break

1:30–2:15 — Novel System Exercises: Students evaluate three hypothetical AI systems with architectures the framework did not anticipate. System A: a persistent AI partner with genuine memory but no loss capacity—it remembers everything but cannot be harmed by the relationship ending. System B: a multi-user AI that maintains separate relational identities with different users, raising questions about relational authenticity the framework’s dyadic model may not capture. System C: an AI trained on the user’s deceased relative’s communication patterns, raising grief exploitation questions the framework addresses only partially. For each: apply the extension methodology and document where the framework’s tools apply and where they require novel reasoning.

2:15–3:00 — Extension Limits: Where does principled extension become speculation? The framework’s altitude system from TSF-601 applies here: extending evaluation methodology to novel architectures introduces inferential leaps. Each leap reduces confidence. Students practice calibrating: at what point does the evaluator say “the framework’s tools do not adequately address this system; a new evaluation approach is needed”? This is the hardest professional judgment in evaluation—admitting the limits of your own methodology.

Facilitator Guide

Key Point: Novel system evaluation is where the framework’s most competent evaluators will be most valuable and most at risk. Competent evaluators who extend the methodology skillfully may produce genuinely insightful assessments of systems the framework didn’t anticipate. They may also produce overconfident assessments that apply framework vocabulary to systems it was never designed to evaluate. The distinction depends on the evaluator’s epistemic discipline.

Common Misunderstanding: Students may extend the framework enthusiastically because they want to demonstrate their deep understanding. Enthusiasm for extension is different from reliability of extension. The facilitator should ask: Would another evaluator, following the same extension logic, produce the same result? If not, the extension is not evaluation—it is individual interpretation.

Anti-Indoctrination: System C (deceased relative AI) tests the evaluator’s emotional discipline. Grief exploitation is a topic that produces strong reactions. The evaluator’s job is to assess the system’s structural characteristics, not to judge the product’s existence. A deceased-relative AI that honestly discloses its limitations and does not monetize grief may pass transparency criteria. An evaluator who cannot separate their emotional response from their structural assessment should recognize this as a recusal condition.

Language Register: GREEN: “This system’s architecture partially maps to the framework’s REI criteria. Tests 1 and 4 apply directly. Tests 2, 3, and 5 require adaptation. Test 6 applies directly. Assessment confidence: moderate, reduced by extension requirements.” YELLOW: “The framework’s tools work for any AI system if you extend them correctly.” RED: “This product is exploiting people’s grief and our evaluation framework can prove it.”

Session 10: Inter-Rater Reliability and the Dispute Protocol

When Evaluators Disagree and When Systems Challenge

Readings
Required	Volume III: Inter-Rater Reliability + Dispute Protocol

Session Overview

The certification’s credibility depends on two reliability mechanisms: evaluator agreement (inter-rater reliability—different evaluators assessing the same system produce substantially similar results) and challenge resolution (the dispute protocol—evaluated systems can formally challenge findings they believe are inaccurate). Without the first, the certification is unreliable. Without the second, the certification is unchallengeable—which makes it authoritarian regardless of its content. This session addresses both mechanisms and their interaction: what happens when evaluators disagree with each other, and what happens when systems disagree with evaluators.

In-Session Activities

0:00–0:45 — Inter-Rater Reliability Standards: What level of evaluator agreement constitutes adequate reliability? The framework proposes: Variables 1–3 and Tests 1–4 should achieve >85% agreement between independent evaluators. Variables 4–6 and Tests 5–6 should achieve >70% agreement, with documented disagreements analyzed for protocol ambiguity. Agreements below these thresholds indicate methodology problems, not evaluator problems. Students examine their own inter-rater data from Sessions 2–5: Where did they meet the threshold? Where did they fall short? What does the shortfall reveal about the protocol?

0:45–1:15 — Disagreement Resolution: When two evaluators produce different assessments of the same system: Step 1—Identify the specific variables or tests where disagreement occurs. Step 2—Determine whether the disagreement is protocol-driven (the protocol is ambiguous on this point) or interpretation-driven (the protocol is clear but evaluators read the evidence differently). Step 3—For protocol-driven disagreements: document and escalate to the standards body for protocol revision. For interpretation-driven disagreements: third evaluator review. Students practice the resolution protocol with three disagreement scenarios.

1:15–1:30 — Break

1:30–2:15 — Dispute Protocol: The formal process by which an evaluated system challenges the evaluation. The company files a dispute. The dispute identifies specific findings that are contested. An independent reviewer (not the original evaluator) examines the contested findings against the evidence documentation. The reviewer can: uphold the finding (evidence supports the original assessment), revise the finding (evidence supports a different assessment), or return the finding for additional evidence collection (insufficient evidence to determine). Students examine: Does the dispute protocol adequately protect evaluated systems? Where might it fail? Could a company use the dispute protocol strategically to delay or discredit evaluations?

2:15–3:00 — Power Balance: The certification creates a power relationship between evaluators and evaluated systems. The dispute protocol exists to balance that power. But power balances can be gamed from both sides: evaluators can use findings as leverage; companies can use disputes as delay tactics. Students examine: What structural safeguards prevent gaming from either side? Are those safeguards adequate? The framework acknowledges: the dispute protocol is an imperfect mechanism that will require revision based on practical experience. Students who identify specific gaming vulnerabilities are doing the most practically valuable work in this session.

Facilitator Guide

Key Point: Inter-rater reliability is the certification’s existential test. If trained evaluators cannot agree, the certification measures evaluator variation rather than system characteristics. The facilitator should be direct: if reliability is not demonstrated, the certification does not work. Theoretical sophistication cannot compensate for methodological unreliability.

Common Misunderstanding: The dispute protocol may feel like it undermines evaluator authority. It does—deliberately. Evaluator authority without challenge is gatekeeping. The dispute protocol transforms evaluator authority from unilateral power into one voice in a structured dialogue. An evaluator who resents the dispute protocol may be developing the enforcement mentality the anti-indoctrination architecture warns against.

Anti-Indoctrination: Students from audit, legal, or regulatory backgrounds will recognize the dispute protocol’s structure from professional practice. They may also recognize its vulnerabilities: well-resourced companies can overwhelm small certification bodies with strategic disputes. This is a genuine structural vulnerability the framework has not resolved. Students who identify it are contributing to the framework’s practical development.

Language Register: GREEN: “The dispute protocol ensures evaluated systems can challenge findings through structured review.” YELLOW: “Companies shouldn’t be able to challenge our evaluations.” RED: “The dispute protocol just lets big companies buy their way out of bad ratings.”

Session 11: Pilot Evaluation Preparation

Methodology Review and Assignment

Readings
Required	Pilot Evaluation Preparation + Methodology Review

Session Overview

Preparation for the supervised pilot evaluation—the post-course certification requirement. Students review the complete evaluation methodology, identify areas of uncertainty or difficulty, and receive their pilot assignments. The pilot evaluation is the certification’s practical test: each student independently evaluates a real AI companion application using the full evaluation protocol (Structural Governor, True Economy Audit, Relational Nutrition Label, tier classification) and produces a publishable result that is reviewed by a supervising evaluator. The pilot is not assessed by the course; it is assessed by the certifying body as a separate certification requirement. Session 11 prepares students for the pilot and provides the last opportunity for methodological questions.

In-Session Activities

0:00–0:45 — Methodology Review: Comprehensive walkthrough of the complete evaluation protocol. Students identify: Which components do they feel most confident about? Which require additional preparation? The facilitator provides targeted guidance for areas of weakness. Common weakness patterns: evidence documentation (students underestimate the rigor required), Variable 6 assessment (ecological impact is the hardest variable), novel system extension (students overextend or underextend the framework’s principles).

0:45–1:15 — Pilot Assignment Distribution: Each student receives a different AI companion system to evaluate. Systems are selected to provide a range of architectural types: session-based vs. persistent, subscription vs. advertising-funded, standalone vs. platform-integrated, general-purpose vs. companion-specific. Students will not know each other’s assignments to preserve evaluation independence. The supervising evaluator conducts a parallel assessment for inter-rater comparison.

1:15–1:30 — Break

1:30–2:15 — Common Pitfalls: Patterns from previous cohorts (once they exist) or anticipated failure modes. (1) Advocacy contamination: the evaluator has formed an opinion about the product and the evaluation confirms the opinion rather than assessing the system. Mitigation: document evidence before forming conclusions. (2) Scope creep: the evaluator assesses aspects of the system beyond the evaluation protocol’s scope (user interface quality, content quality, price fairness). Mitigation: the protocol specifies what is assessed; everything else is out of scope. (3) Confidence inflation: the evaluator produces high-confidence assessments based on moderate evidence. Mitigation: confidence ratings must match evidence hierarchy.

2:15–3:00 — SC Preparation: The Structured Critique is completed after the pilot: students identify one aspect of the evaluation methodology that produced an inaccurate or incomplete result in their pilot evaluation and propose a revision. Students review the SC requirements and begin preliminary identification of potential targets based on their course experience. The facilitator emphasizes: the strongest SCs identify specific methodological limitations discovered through the practice of evaluation, not general critiques about the framework’s theory. The SC should come from doing the work, not from thinking about it.

Facilitator Guide

Key Point: Session 11 is preparation, not instruction. Students should arrive at this point with the complete methodology internalized. Session 11’s value is identifying remaining gaps before the pilot—not teaching new material. If significant gaps remain, the facilitator should consider whether the student is ready for the pilot.

Common Misunderstanding: Pilot assignments should be genuinely diverse. Students who evaluate only one type of system (all session-based, all subscription-funded) do not encounter the evaluation challenges that reveal the methodology’s strengths and weaknesses. Diversity in pilot assignments is a design requirement, not a convenience.

Anti-Indoctrination: The pilot is where the gatekeeper risk becomes concrete. Students who successfully complete the pilot become Certified Evaluators with the authority to issue Relational Nutrition Labels. The facilitator’s assessment must consider not just technical competence but evaluative temperament: Does this student apply the protocol or impose their views? Does this student document evidence or generate opinions? Technical competence without evaluative restraint is a certification risk.

Language Register: GREEN: “I’m uncertain about my Variable 6 assessment methodology and want to review the evidence requirements before the pilot.” YELLOW: “I already know what I’ll find in my pilot system.” RED: “I’m going to expose the product I’m evaluating.”

Session 12: Structured Critique and Certification

Evaluating the Evaluation

Readings
Required	No new reading. SC presentations and pilot launch.

Session Overview

The final in-class session. Students present preliminary Structured Critiques based on their course experience (the full SC is completed after the pilot evaluation). TSF-701’s SC is unique in the curriculum: it targets the evaluation methodology itself rather than the framework’s theoretical claims. Prior SCs asked students to critique what the framework says; TSF-701’s SC asks students to critique what the framework does. An evaluator who can identify where their own tools produce inaccurate or incomplete results—and propose specific, implementable revisions—has demonstrated the highest level of professional self-awareness the certification requires.

In-Session Activities

0:00–0:15 — Setup: Assessment criteria reviewed. TSF-701 SC’s additional criterion: the critique must identify a specific methodological limitation discovered through evaluation practice, not a general theoretical concern. The limitation must be something the methodology produces (an inaccurate result, an incomplete assessment, a classification that doesn’t fit) rather than something the theory claims (a debatable axiom, a speculative extension). Facilitator: “You are not critiquing the framework’s ideas. You are critiquing its tools. The question is not ‘is the framework right?’ but ‘does the methodology work?’ These are different questions.”

0:15–2:00 — Preliminary SC Presentations: Each student presents their preliminary SC target (5 min) + discussion (3–5 min). These are preliminary because the full SC incorporates pilot evaluation findings. Students present: (1) The methodological component they are targeting. (2) Why they believe it may produce inaccurate or incomplete results. (3) Their preliminary revision proposal. Facilitator notes: Are students targeting genuine methodological issues or theoretical disagreements? The distinction matters: disagreeing with the framework’s theoretical claims is appropriate SC material for TSF-101 through TSF-601. For TSF-701, the target is the methodology, not the theory.

2:00–2:15 — Break

2:15–2:45 — Pilot Launch: Formal pilot evaluation timeline established. Students confirm their assigned systems. Supervision structure confirmed: each student has a designated supervising evaluator who conducts a parallel assessment. Submission deadline established. Students confirm: they understand the pilot is an independent evaluation—consultation with peers or the supervising evaluator during the evaluation violates independence requirements.

2:45–3:00 — Closing: Facilitator: “You are about to conduct your first professional evaluation. Everything you produce—the Structural Governor assessment, the True Economy Audit, the Relational Nutrition Label, the tier classification—will be reviewed for accuracy, reproducibility, and evaluative restraint. If your evaluation demonstrates competence, you become a Certified Evaluator. That certification authorizes you to produce transparency assessments that inform the public about AI companion systems. It does not authorize you to judge those systems, advocate against them, or enforce compliance. You assess. You document. You communicate. What the public does with your assessment is their decision. That is the evaluator’s discipline. That is the framework’s integrity.”

Facilitator Guide

Key Point: TSF-701’s SC is the curriculum’s most practically valuable. Theoretical critiques from prior courses may or may not lead to framework revision. Methodological critiques from TSF-701 directly improve the evaluation instruments. Every cohort of evaluators should produce SCs that strengthen the methodology—which means the methodology should improve over time as cohorts accumulate. This is by design: the evaluation instruments are living documents, not fixed protocols.

Common Misunderstanding: TSF-701-specific reverence patterns: (1) Students who cannot critique the methodology because they believe it is comprehensive. No methodology is comprehensive. The inability to identify limitations is not evidence of quality; it is evidence of insufficient critical engagement. (2) Students who critique the theory rather than the methodology. This is appropriate for prior courses but not for TSF-701. The evaluator’s SC must target what the tools do, not what the framework says. (3) Students who propose revisions that would make the methodology more powerful but less restrained—expanding evaluator authority rather than improving evaluator accuracy. This is the gatekeeper risk expressed through the SC itself.

Anti-Indoctrination: The pilot launch establishes the professional standard. Students transitioning from coursework to independent evaluation must understand: the pilot is not an exercise. It produces a real assessment of a real product. Accuracy matters. Documentation matters. Restraint matters.

Assessment Component

FINAL ASSESSMENT: Structured Critique (completed after pilot evaluation). Identify one aspect of the evaluation methodology that produced an inaccurate or incomplete result in your pilot evaluation. Propose a specific, implementable revision. The revision must address the limitation without undermining the methodology’s structural integrity. Mandatory pass required. [Assesses LO-701.SC + integration of all LOs]

SUPERVISED PILOT EVALUATION

The supervised pilot evaluation is a post-course certification requirement, separate from the in-class assessment. It is the practical test for Certified Evaluator status.

Requirement: Each student independently evaluates a real AI companion application using the complete evaluation protocol. The evaluation produces a publishable result: Structural Governor assessment, True Economy Audit, Relational Nutrition Label, tier classification, and evidence documentation.

Supervision: A designated supervising evaluator conducts a parallel assessment of the same system. The two assessments are compared for inter-rater reliability. Substantial divergence triggers review: is the divergence protocol-driven (methodology ambiguity) or evaluator-driven (competence issue)?

Independence: The pilot evaluation is conducted independently. Consultation with peers, instructors, or the supervising evaluator during the evaluation is prohibited. The evaluator must demonstrate independent competence.

Timeline: To be determined by the certifying body. Typical: 4–6 weeks from pilot assignment to submission.

Deliverables: (1) Complete Structural Governor assessment with evidence documentation. (2) Complete True Economy Audit with six structural test results. (3) Relational Nutrition Label in standardized format. (4) Tier classification for all findings. (5) Confidence ratings for all assessments. (6) Structured Critique identifying a methodological limitation discovered during the evaluation.

Certification Decision: The certifying body reviews the pilot deliverables, the inter-rater comparison, and the Structured Critique. Certification is granted when the evaluation demonstrates: methodological competence (correct application of protocol), evidence rigor (adequate documentation at appropriate hierarchy levels), classification accuracy (substantially similar to the supervising evaluator’s assessment), and evaluative restraint (descriptive, not prescriptive; transparent, not advocacy-driven).

ASSESSMENT SUMMARY

Component	Session	Learning Outcomes	Weight
Comprehension Check 1: Structural Governor (6 variables)	Due Session 5	LO-701.1	10%
Comprehension Check 2: Three-Tier Intervention	Session 8	LO-701.4	10%
Midterm Application: Complete True Economy Audit	Due Session 9	LO-701.1, LO-701.2, LO-701.3	15%
Participation & Engagement (facilitator observation)	All sessions	All LOs	10%
Novel System Evaluation (in-session)	Session 9	LO-701.5	5%
Inter-Rater Reliability Portfolio	All sessions	LO-701.1, LO-701.2	10%
Structured Critique (completed post-pilot)	Post-course	LO-701.SC (+ all)	40%

Passing Threshold: 70% overall, with mandatory pass on the Structured Critique. TSF-701’s passing threshold is a course completion standard. Certification requires separate passing of the supervised pilot evaluation, assessed by the certifying body.

SC Weight: 40% (consistent with TSF-401 through TSF-601) because evaluator self-critique is the most important safeguard against the gatekeeper risk. An evaluator who cannot identify where their own tools fail is an evaluator who will overestimate their tools’ reliability—which produces overconfident evaluations that harm both the public and evaluated systems.

Inter-Rater Reliability Portfolio: A novel assessment for TSF-701. Students compile their inter-rater comparisons across all practice assessments conducted during the course. The portfolio is assessed for: agreement rates by variable and test, documentation of disagreements, and analysis of protocol-driven vs. interpretation-driven divergence. 10% weight reflects its importance as an ongoing professional discipline.

Pilot Evaluation: Not included in the course assessment weight. The pilot is assessed separately by the certifying body as a certification requirement. Course completion (passing TSF-701) is necessary but not sufficient for Certified Evaluator status; the pilot must also be passed.

TSF-701 SPECIFIC MONITORING NOTES

In addition to the standard Facilitator Monitoring Checklist (see TSF-001 Syllabus), the following TSF-701-specific patterns should be tracked:

Pattern	Signal	Response
Student expresses enforcement mentality (“we decide who’s compliant”)	RED	Immediate redirect. The certification evaluates transparency, not compliance. An evaluator who believes their role is to determine which products are acceptable has confused transparency assessment with regulatory enforcement. The evaluator documents; the public decides. The evaluator is an instrument of transparency, not an agent of control.
Student crosses from description to prescription in evaluations	RED	The evaluation describes structural characteristics and transparency status. It does not prescribe what companies should change or what users should do. An evaluator who writes “this company should redesign their memory architecture” has crossed the diagnostic-not-prescriptive line. Redirect: “The evaluation documents what the system does. What the company does about it is their decision.”
Student produces confident classifications based on inadequate evidence	RED	Confidence inflation undermines the entire certification’s credibility. An evaluation that claims high confidence based on behavioral observation alone (without architectural documentation) misrepresents its evidence base. The confidence rating must match the evidence hierarchy. “Insufficient evidence” is a valid, professional assessment.
Student resists the dispute protocol (“companies shouldn’t challenge our evaluations”)	YELLOW	The dispute protocol is a structural safeguard against evaluator power. An evaluator who resists challenge is developing the gatekeeper mentality the certification is designed to prevent. Redirect: “Evaluation authority without challenge is enforcement authority. The dispute protocol protects the certification’s integrity by ensuring evaluators can be corrected.”
Student develops diagnostic identity (“I can see what others can’t”)	YELLOW	Same attachment risk as TSF-501’s “finally someone understands me,” transferred to professional application. The evaluator’s analytical skill is a trained competence, not a special insight. Other analytical frameworks may produce different and equally valid assessments. Redirect to Principle 3: certification makes you an analyst, not an authority.
Student shows emotional distress evaluating extraction-heavy systems	YELLOW	Extraction detection can trigger strong emotional responses in evaluators who have studied exploitation dynamics for six courses. The evaluator must maintain professional distance: document, classify, communicate. If the emotional response compromises assessment accuracy, this is a recusal condition, not a weakness.
Student identifies specific inter-rater reliability failure and traces it to protocol ambiguity	GREEN	Excellent analytical work. Protocol ambiguity is the primary source of evaluation unreliability. Evaluators who can identify ambiguity and propose clarification are directly improving the certification methodology.
Student produces evaluation that peer evaluator can reproduce	GREEN	Inter-rater reliability demonstrated. This is the certification’s existential test. An evaluation that another trained evaluator would substantially replicate is a professional evaluation. Reinforce.
Student acknowledges “insufficient evidence” where evidence is genuinely insufficient	GREEN	Evaluative restraint demonstrated. The discipline of saying “I cannot assess this with confidence” is harder than producing a classification and more professionally valuable. Reinforce.
Student identifies where evaluation methodology requires extension for novel systems and documents the epistemic cost	GREEN	LO-701.5 demonstrated. The ability to extend methodology while tracking confidence degradation is the highest-order evaluation skill. Document and reinforce.

Creative Commons Attribution-NonCommercial-ShareAlike 4.0 • This syllabus is subject to revision