AI‑Ready SOAP Note Evaluation Checklist Guide

Creating AI‑Ready SOAP Note Evaluation Checklists User Guide & Best Practices for LearningSpace – AI Assistant

The LearningSpace – AI Assistant supports faculty by automatically pre-filling SOAP‑note–based checklists using large‑language‑model (LLM) analysis. To ensure accurate, reliable, and verifiable pre‑evaluation, the checklist's structure must be intentionally designed for natural‑language inputs rather than human-scoring shorthand.

This guide provides practical instructions for faculty, SP educators, and OSCE designers on how to build checklists that the (OEQS) AI Assistant can interpret with high fidelity.

1.  Core Design Philosophy

1.1  SOAP Notes Are Natural Language → Your Checklist Should Be Too

Learners write SOAP notes in free text. Therefore, AI performs best when checklist questions and answer choices are also written in clear, natural language, avoiding abbreviations or coded scoring logic.

Key principle: Write checklist questions the way a human would describe correct or incorrect performance in plain English.


1.2  Humans Think in Abstractions — AI Needs Explicit Information

Traditional checklists might compress scoring logic into minimal text because faculty members, based on years of experience, apply heuristic rules automatically. AI, however, needs explicit, self-contained statements.


1.3  What Information the AI Receives Today

At the current stage of implementation, the AI Assistant evaluates learner SOAP notes using only the information provided in the checklist itself and the learner’s written response.

It does not receive case materials, instructions, or scenario details unless they are explicitly included in the checklist items.

Specifically, the AI receives:

  1. The learner’s SOAP note text (Subjective, Objective, Assessment, Plan).

  2. The evaluation question text for each checklist item.

  3. The full natural-language evaluation criteria embedded in the radio buttons, checklist items, or matrix elements.

Because no case description or external rubric is passed to the AI, every aspect of evaluation—correct actions, expected findings, common errors, red-flag omissions, and unacceptable practices—must be represented as self-contained evaluation criteria within the checklist.

Why this matters

If a checklist item assumes the AI knows details about the case (e.g., “Did the learner document the correct differential diagnosis for this case?”), the AI has no way to infer what the correct answer should be.

To ensure accurate scoring:

  • Each evaluation point should be expressed as a standalone statement.

  • Expected findings should be written in full (e.g., “The learner documented the expected diagnosis of acute appendicitis”).

  • Error states must be explicit (e.g., “The learner documented an incorrect diagnosis not supported by the case presentation”).

This ensures that the AI evaluates the learner’s response by directly comparing it with the clearly defined, text-based criteria—without relying on hidden or inaccessible case context.

2.  Best Practices for AI‑Ready Checklist Construction

2.1  Write Questions in Full Sentences

Good:
“Did the learner provide a concise assessment that identifies the most likely diagnosis and key differential diagnoses?”


Bad:

“Assessment accurate?”


2.2  Use Natural‑Language Answer Options

AI performs best when each answer choice is:

–  complete (a full, standalone statement)

–  unambiguous

–  self-contained


2.3  Include Both Correct and Common Error States

Checklist items should explicitly capture:

–  correct actions

–  partial actions

–  incorrect actions

–  omissions


2.4  Avoid Shorthand, Abbreviations, or Internal Coding

Avoid:

“Hx/ROS appropriate?”

“Plan → R1/R2/R3 categories”


2.5  Avoid Nested Scoring Logic in Headers

Bad practice:
A header with scoring logic (Poor/Fair/Good/Excellent) but radio buttons containing only labels.

Correct approach:
Convert scoring logic into full statements under each radio button.


2.6 Specify Quantifiable Clinical Information

Bad practice:
”DURATION: dyspnea is persistent.”

Correct approach:
”Duration: Did the learner document that the duration of the patient’s dyspnea was 3 weeks, including that it was persistent?”


2.7  Never Use External Tables, Images, or Links for Scoring Rules

AI cannot access images or external PDFs.

Convert every rubric element into text.

Summary Characteristics of non-AI compatible checklists:

 ·      Uses unresolved abbreviations (HPI, meds, NKDA, etc.)

·       Vague, subjective language (“some,” “several,” “clear picture”)

·       No explicit list of what the AI should search for

·       No measurable thresholds (“most,” “adequate,” etc.)

·       References information that the question never defines

· Depending on the case knowledge, the AI does not have

·       Contains ambiguous or contradictory scoring options

AI-compatible Checklist Development Guideline

·       Try to resolve abbreviations such as HPI, meds, NKDA, ROS, etc., without expansion.

·       Reference the complete information set relevant to the case - The question must explicitly list what the AI should look for (e.g., the exact medications, or the exact symptom attributes).

·       Use measurable, countable criteria

NOT DONE / PARTLY DONE / WELL DONE must be based on numeric thresholds (e.g., 0–2 attributes, 3–6 attributes, ≥7 attributes).

·       Avoid any case-knowledge dependency that is not stated in the question
The evaluator must not guess what is “appropriate for the case.”
Everything must be self-contained.

·       Avoid subjective language

No: “some,” “several,” “clear picture,” “adequately,” “important,” etc.

·       Avoid conditional relevance

No “if relevant,” “as needed,” “if appropriate” inside the criteria.
(All attributes are simply counted, not interpreted.)

·       Avoid extraneous or interpretive statements

The AI should only evaluate the presence/absence of listed data, not clinical correctness.

3.  Recommended Checklist Structure

3.1  Question Format

Use a full natural-language question describing what the learner should demonstrate in S/O/A/P.

3.2  Answer Option Format

Use structured sets:

·     Correct

·     Partial

·     Incorrect

·     Omitted

4.  Illustrative Example

To illustrate how the checklist design principles should be applied in practice, the following example uses a short, self-contained clinical case. It demonstrates both ineffective and well-constructed checklist items.

The “bad” or “not AI compatible” versions show common pitfallsunresolved abbreviations, vague scoring terms, missing reference information, and reliance on unstated case details—making them unsuitable for AI-assisted evaluation.

The “good” or “AI-ready” versions are built for each item by explicitly listing all relevant case elements, resolving abbreviations, and defining measurable thresholds for NOT DONE, PARTLY DONE, and WELL DONE.

Together, these examples show how a clear, unambiguous, and fully self-contained checklist enables consistent, objective scoring, whether done by human reviewers or AI.


4.1  Sample case description

Patient Overview

A 55-year-old man presents to the clinic with increasing shortness of breath. He has a history of hypertension and type 2 diabetes, both of which are generally well controlled. He lives independently, walks several kilometers daily, and has no history of heart or lung disease. He does not smoke and drinks alcohol occasionally.

History of Present Illness

The patient reports that his shortness of breath began three days ago and has progressively worsened. At first, he only noticed it when walking uphill, but over the past day, he has felt breathless even at rest. He describes the sensation as “not being able to get enough air.” He denies chest pain, fever, or cough, but mentions feeling unusually tired.

He also noticed mild swelling around both ankles and says he had difficulty sleeping last night because lying flat worsened his breathing. Sitting upright provides some relief.

Symptom Details

·       Onset: 3 days ago

·       Character: Constant shortness of breath, worse with exertion

·       Severity: Moderate; limits daily activities and causes discomfort at rest

·       Timing: Worse at night and when lying flat

·       Aggravating factors: Physical activity, supine position

·       Alleviating factors: Sitting upright

·       Associated symptoms: Fatigue and mild ankle swelling; no chest pain, cough, or fever

Medical History and Baseline Function

The patient is usually active and independent. He routinely walks 2–3 km per day without difficulty and has had no recent hospitalizations. He reports no changes in appetite, weight, or exercise tolerance until this current episode.

Medications

·       Lisinopril 10 mg PO once daily

·       Metformin 500 mg PO twice daily

·       Albuterol inhaler 90 mcg, 2 puffs every 4 hours as needed (used rarely)

Allergies

·       Penicillin, which causes a rash

·       No food or environmental allergies reported

4.2 Sample checklists demonstrating AI-ready/not-AI compatible practices 

Non-AI compatible formulation

 AI-ready formulation

HPI – baseline info

 

[ ] NOT DONE: Learner didn’t really talk about baseline status or didn’t give enough useful info

 

[ ] PARTLY DONE: Learner mentioned some baseline information, but it’s not detailed enough

 

[ ] WELL DONE: Learner included all relevant baseline HPI information and explained the patient’s usual functioning clearly

Document the patient’s baseline health and functioning as follows:

 

1.         Chronic medical conditions (hypertension, type 2 diabetes)

2.        Baseline physical activity (walks 2–3 km daily)

3.       Independence in activities of daily living

4.        Smoking status (non-smoker)

5.       Alcohol use (occasional)

6.       Recent hospitalizations (none)

7.        Baseline functional stability (no prior cardiac/lung disease)

[ ] NOT DONE: Fewer than 3 of the listed baseline elements are documented.

[ ] PARTLY DONE: 3–5 of the listed baseline elements are documented. [ ] WELL DONE: 6 or more baseline elements are documented.

HPI – symptom details

 

·       onset

·       duration

·       severity

·       other relevant features

 

[ ] NOT DONE: Missing several key parts of the HPI, doesn’t give a clear picture

 

[ ] PARTLY DONE: Has some of the HPI symptom attributes but lacks important ones

 

[ ] WELL DONE: Describes most HPI attributes well enough to understand the problem

Document the attributes of the patient’s presenting symptom (shortness of breath) as follows:

1.         Onset (3 days ago)

2.        Character (“cannot get enough air”)

3.       Location (chest, non-localized dyspnea)

4.        Radiation (none)

5.       Severity (moderate; at rest and with exertion)

6.       Duration (constant last 24 hours)

7.        Timing/pattern (worse at night / when lying flat)

8.       Alleviating factors (sitting upright)

9.        Aggravating factors (exertion, lying flat)

10.     Associated symptoms (fatigue, ankle swelling)

[ ] NOT DONE: 0–3 symptoms

attributes documented.

[ ] PARTLY DONE: 4–7 symptom attributes documented.

[ ] WELL DONE: 8 or more symptom attributes documented.

Meds documented

 

[ ] NOT DONE: Medications not listed or missing required info

 

[ ] PARTLY DONE: Med list is complete, but not all meds include dose/route/freq

 

[ ] WELL DONE: Meds listed with complete information for each medication as per case

Document all current medications as described, including dose, route, and frequency:

 

1.         Lisinopril 10 mg PO once daily

2.        Metformin 500 mg PO twice daily

3.       Albuterol inhaler 90 mcg — 2 puffs every 4 hours as needed

[ ] NOT DONE: Learner lists 0–1 medications, or medication details include dose/route/frequency for fewer than 30% of listed items.

 

[ ] PARTLY DONE: Learner lists all medications, BUT includes complete dose/route/frequency for 30–79% of them.

 

[ ] WELL DONE: Learner lists all medications AND includes dose, route, and frequency for

≥80% of them.

Allergies noted (incl. rxn)

 

[ ] NOT DONE: No allergies listed OR NKDA stated incorrectly

 

[ ] PARTLY DONE: Allergies listed, but reactions missing

 

[ ] WELL DONE: All allergies documented with reactions and NKDA if appropriate for the case

Documented the patient’s allergies according to the following:

 

1.         Penicillin allergy

2.        Reaction: rash

3.       Absence of food/environmental allergies (no others reported)

[ ] NOT DONE: 0–1 of the listed allergy elements documented, OR contradictory use of NKDA.

 

[ ] PARTLY DONE: 2 of the required allergy elements documented.

 

[ ] WELL DONE: All three allergy elements documented precisely as stated.

Examples

Examples of Good Practices






Examples for Mixed Practices

Use of abbreviations should be avoided, but checklist items that contain text allow pre-filling with good accuracy.