Alpha & Beta Testing Coachi


[Overview]
The Context
Coachi is an AI-powered ski coaching app that analyses technique and delivers personalised feedback to accelerate learning.
As Lead UX Designer, I designed and led the research and validation strategy for its core feedback system — bridging expert instruction, UX, and AI development to ensure the feedback was both accurate and engaging.
This involved a three-phase research approach:
1. Preliminary testing: Structured Instructor input into reliable data sets for model calibration.
2. Alpha testing: Validated AI scoring against expert judgement.
3. Beta testing: Assessed real-world, accuracy, educational impact and user engagement.
https://www.coachiapp.com/
The Context
Coachi is an AI-powered ski coaching app that analyses technique and delivers personalised feedback to accelerate learning.
As Lead UX Designer, I designed and led the research and validation strategy for its core feedback system — bridging expert instruction, UX, and AI development to ensure the feedback was both accurate and engaging.
This involved a three-phase research approach:
1. Preliminary testing: Structured Instructor input into reliable data sets for model calibration.
2. Alpha testing: Validated AI scoring against expert judgement.
3. Beta testing: Assessed real-world, accuracy, educational impact and user engagement.
https://www.coachiapp.com/
Skills Demonstrated
Miro
User interviewing
Contextual enquiry
Research planning
UX research & strategy
Human–AI interaction design
[Overall Impact]
Preliminary Phase:
Delivered Structured Dataset for Model Training
Replaced fragmented spreadsheet workflows with a Figma-based calibration system that standardised instructor scoring across 100+ video clips — delivering the reliable dataset engineers needed to build the first version of the movement model and aligning all stakeholders on what “good” technique looks like.
Alpha Phase:
75% Alignment Achieved Between AI and Instructor Scores
Alpha Phase:
75% Alignment Achieved Between AI and Instructor Scores
Led validation cycles comparing AI-generated scores to expert benchmarks using a custom joint-tracking review tool. Iterative refinement brought model accuracy within 10% of instructor ratings — building stakeholder trust and establishing a repeatable evaluation process for future releases.
Beta Phase:
Designed Field Testing Framework to Measure Real-World Impact
Beta Phase:
Designed Field Testing Framework to Measure Real-World Impact
Planned a 4-stage beta rollout — from raw score validation to full AI coaching feedback — across snowdomes, ski resorts, and varied learner contexts. The framework will assess not just accuracy, but motivation, engagement, and progression over time, ensuring Coachi’s feedback model delivers meaningful learning outcomes at scale.
[1. Preliminary Testing Process]
1. The Purpose
The goal of the preliminary testing setup was to generate structured, reliable data for the development team. This dataset served as the foundation for building the first iteration of the movement analysis model, which would later be validated in the Alpha Testing phase. Without this work, developers would have been working with fragmented, inconsistent inputs, risking an unstable or inaccurate model.
1. The Purpose
The goal of the preliminary testing setup was to generate structured, reliable data for the development team. This dataset served as the foundation for building the first iteration of the movement analysis model, which would later be validated in the Alpha Testing phase. Without this work, developers would have been working with fragmented, inconsistent inputs, risking an unstable or inaccurate model.
2. The Challenge
Initial attempts at calibration using spreadsheets quickly failed:
Instructors had to review videos one by one, making comparisons inconsistent.
Numeric scores lacked context, leaving engineers unclear on why movements were rated good or poor.
Calibration drift emerged — similar turns were scored differently without a shared framework.
This mismatch between subjective expertise and technical requirements risked producing an unreliable model.
2. The Challenge
Initial attempts at calibration using spreadsheets quickly failed:
Instructors had to review videos one by one, making comparisons inconsistent.
Numeric scores lacked context, leaving engineers unclear on why movements were rated good or poor.
Calibration drift emerged — similar turns were scored differently without a shared framework.
This mismatch between subjective expertise and technical requirements risked producing an unreliable model.

3. The Solution
I designed a Figma-based calibration framework to turn subjective instructor expertise into structured, comparable data:
Built a rating scale from ≤10% to 100% in 10% increments for each movement metric.
Created calibration cards for every analysed turn, including: snapshot, video link, slope difficulty, timestamp, and instructor notes.
Allowed cards to be placed directly onto the scale, enabling instructors to compare turns side by side and cluster them consistently.

3. The Solution
I designed a Figma-based calibration framework to turn subjective instructor expertise into structured, comparable data:
Built a rating scale from ≤10% to 100% in 10% increments for each movement metric.
Created calibration cards for every analysed turn, including: snapshot, video link, slope difficulty, timestamp, and instructor notes.
Allowed cards to be placed directly onto the scale, enabling instructors to compare turns side by side and cluster them consistently.
[1. Preliminary Testing Impact]
Standardised Instructor Scoring Across 100+ Skiing Videos
Replaced unstructured spreadsheets with a Figma-based calibration system that aligned instructor evaluations — producing a consistent dataset that became the foundation for the first AI movement model.
Bridged Expert Intuition and Engineering Precision
Transformed subjective instructor judgements into structured, comparable data, enabling developers to translate human expertise into measurable algorithmic logic.
Established a Scalable Calibration Framework for Future Metrics
Created a repeatable method for rating and comparing movement data, improving inter-rater consistency by 70% in pilot testing and setting a new benchmark for technical collaboration within the product team.
[2. Alpha Testing Process]
1. The Purpose
The purpose of Alpha Testing was to validate the first version of the movement analysis model — ensuring that the AI-generated scores aligned with instructor judgement and the calibration framework established in the Preliminary phase.
1. The Purpose
The purpose of Alpha Testing was to validate the first version of the movement analysis model — ensuring that the AI-generated scores aligned with instructor judgement and the calibration framework established in the Preliminary phase.
2. The Challenge
Even with a solid dataset, there was no guarantee the AI would interpret movement in ways that matched human expertise. If scores felt arbitrary or inconsistent, instructors and learners would lose trust in the system.
2. The Challenge
Even with a solid dataset, there was no guarantee the AI would interpret movement in ways that matched human expertise. If scores felt arbitrary or inconsistent, instructors and learners would lose trust in the system.

3. The Solution
Structured Validation Cycles:
Collaborated with engineers to build a test tool that overlayed joint-tracking skeletons on skier videos.
Synced this with percentage scores per metric, allowing outputs to be paused, replayed, and interrogated.
Ran validation sessions with instructors, comparing AI outputs against the Figma calibration benchmarks and refining the model until alignment was achieved.

3. The Solution
Structured Validation Cycles:
Collaborated with engineers to build a test tool that overlayed joint-tracking skeletons on skier videos.
Synced this with percentage scores per metric, allowing outputs to be paused, replayed, and interrogated.
Ran validation sessions with instructors, comparing AI outputs against the Figma calibration benchmarks and refining the model until alignment was achieved.
[2. Alpha Testing Impact]
Achieved 75% Alignment Between AI and Instructor Scores
Led structured validation cycles using a custom joint-tracking overlay tool, refining the model until AI-generated movement scores matched expert evaluations within a 10% variance range.
Built a Transparent Validation Workflow for Cross-Team Trust
Established a process where design, engineering, and instructors reviewed results collaboratively — turning black-box AI outputs into explainable, evidence-based insights that built confidence across stakeholders.
Created a Repeatable Model Validation Process for Future Releases
Documented and systematised the validation method, allowing developers to run future accuracy checks autonomously — accelerating iteration cycles and improving design–engineering alignment.
[3. Beta Testing Process]
1. The Purpose
The purpose of Beta Testing is to test the model in real-world learner contexts — ensuring not only technical accuracy but also educational effectiveness and user engagement.
1. The Purpose
The purpose of Beta Testing is to test the model in real-world learner contexts — ensuring not only technical accuracy but also educational effectiveness and user engagement.
2. The Challenge
Lab-based validation confirmed the model’s stability, but it remained unclear how learners would respond to feedback in live skiing environments:
Would percentage scores alone motivate improvement?
Would adding AI-generated coaching text enhance learning or overwhelm users?
How would different practice contexts (with an instructor, with friends, solo) shape interaction with the app?
2. The Challenge
Lab-based validation confirmed the model’s stability, but it remained unclear how learners would respond to feedback in live skiing environments:
Would percentage scores alone motivate improvement?
Would adding AI-generated coaching text enhance learning or overwhelm users?
How would different practice contexts (with an instructor, with friends, solo) shape interaction with the app?

3. The Solution
Phase 1 - Core Feedback Validation:
Test the UI with metric scores only (no AI coaching text).
Focus on validating how learners understand and use raw scores in practice.
Phase 2 – Adding AI Coaching Feedback:
Layer in text-based AI feedback that interprets scores and provides actionable guidance, mimicking instructor-style advice.
Explore how users interact with this feedback, including the ability to chat with the AI coach for clarification and progression tips.
Phase 3 – Field Trials Across Contexts:
Conduct testing in snowdomes and ski resorts, covering three learning scenarios:
In a lesson with an instructor.
Practising with friends post-lesson.
Practising independently without instructor support.
Phase 4 – Longitudinal Testing:
Track how feedback (metrics + AI coaching) influences progression, motivation, and retention over multiple sessions.

3. The Solution
Phase 1 - Core Feedback Validation:
Test the UI with metric scores only (no AI coaching text).
Focus on validating how learners understand and use raw scores in practice.
Phase 2 – Adding AI Coaching Feedback:
Layer in text-based AI feedback that interprets scores and provides actionable guidance, mimicking instructor-style advice.
Explore how users interact with this feedback, including the ability to chat with the AI coach for clarification and progression tips.
Phase 3 – Field Trials Across Contexts:
Conduct testing in snowdomes and ski resorts, covering three learning scenarios:
In a lesson with an instructor.
Practising with friends post-lesson.
Practising independently without instructor support.
Phase 4 – Longitudinal Testing:
Track how feedback (metrics + AI coaching) influences progression, motivation, and retention over multiple sessions.
[3. Beta Testing Impact]
Designed a Four-Stage Field Testing Framework Across Real Environments
Planned a rollout covering snowdomes, ski resorts, and at-home training, ensuring the system is tested in authentic learner contexts before launch.
Defined Success Metrics for Engagement, Comprehension, and Progression
Established KPIs including feedback comprehension rate, confidence uplift, and session return rate — moving validation beyond accuracy to behavioural and motivational outcomes.
Prepared a Longitudinal Study to Measure Skill Improvement Over Time
Designed a data collection plan to track learner progression over multiple sessions, assessing how AI feedback and coaching text influence retention and skill mastery — shaping future product iterations and growth strategy.
[Key Learnings]
Bridging Subjective Expertise with Structured Data
I learned how to translate domain-specific human judgement (from ski instructors) into structured, repeatable data models — a skill that deepened my ability to align expert knowledge with technical implementation.
Designing Scalable, Multi-Phase Validation Frameworks
This project strengthened my ability to plan and execute staged research strategies — from calibration to alpha testing to real-world beta trials — ensuring every design decision was evidence-based and scalable.
Human–AI Design Requires Deep Contextual Understanding
I developed a stronger grasp of how to design AI feedback that’s not only accurate but trustworthy, educational, and adaptable to different user contexts — a key competency for future-facing UX leadership roles.


[Persona]
Jhon Roberts
Marketing Manager
Content
Age: 29
Location: New York City
Tech Proficiency: Moderate
Gender: Male
[Goal]
Quickly complete purchases without interruptions.
Trust the platform to handle her payment securely.
Access a seamless mobile shopping experience.
[Frustrations]
Long or confusing checkout processes.
Error messages that don’t explain the issue.
Poor mobile optimization that slows her down.
[Overview]
The Context
Coachi is an AI-powered ski coaching app that analyses technique and delivers personalised feedback to accelerate learning.
As Lead UX Designer, I designed and led the research and validation strategy for its core feedback system — bridging expert instruction, UX, and AI development to ensure the feedback was both accurate and engaging.
This involved a three-phase research approach:
1. Preliminary testing: Structured Instructor input into reliable data sets for model calibration.
2. Alpha testing: Validated AI scoring against expert judgement.
3. Beta testing: Assessed real-world, accuracy, educational impact and user engagement.
The Context
Coachi is an AI-powered ski coaching app that analyses technique and delivers personalised feedback to accelerate learning.
As Lead UX Designer, I designed and led the research and validation strategy for its core feedback system — bridging expert instruction, UX, and AI development to ensure the feedback was both accurate and engaging.
This involved a three-phase research approach:
1. Preliminary testing: Structured Instructor input into reliable data sets for model calibration.
2. Alpha testing: Validated AI scoring against expert judgement.
3. Beta testing: Assessed real-world, accuracy, educational impact and user engagement.
Skills Demonstrated
Miro
Miro
User interviewing
User interviewing
Contextual enquiry
Contextual enquiry
Research planning
Research planning
[1. Preliminary Testing Process]
1. The Purpose
The goal of the preliminary testing setup was to generate structured, reliable data for the development team. This dataset served as the foundation for building the first iteration of the movement analysis model, which would later be validated in the Alpha Testing phase. Without this work, developers would have been working with fragmented, inconsistent inputs, risking an unstable or inaccurate model.
1. The Purpose
The goal of the preliminary testing setup was to generate structured, reliable data for the development team. This dataset served as the foundation for building the first iteration of the movement analysis model, which would later be validated in the Alpha Testing phase. Without this work, developers would have been working with fragmented, inconsistent inputs, risking an unstable or inaccurate model.
2. The Challenge
Initial attempts at calibration using spreadsheets quickly failed:
Instructors had to review videos one by one, making comparisons inconsistent.
Numeric scores lacked context, leaving engineers unclear on why movements were rated good or poor.
Calibration drift emerged — similar turns were scored differently without a shared framework.
This mismatch between subjective expertise and technical requirements risked producing an unreliable model.
2. The Challenge
Initial attempts at calibration using spreadsheets quickly failed:
Instructors had to review videos one by one, making comparisons inconsistent.
Numeric scores lacked context, leaving engineers unclear on why movements were rated good or poor.
Calibration drift emerged — similar turns were scored differently without a shared framework.
This mismatch between subjective expertise and technical requirements risked producing an unreliable model.

3. The Solution
I designed a Figma-based calibration framework to turn subjective instructor expertise into structured, comparable data:
Built a rating scale from ≤10% to 100% in 10% increments for each movement metric.
Created calibration cards for every analysed turn, including: snapshot, video link, slope difficulty, timestamp, and instructor notes.
Allowed cards to be placed directly onto the scale, enabling instructors to compare turns side by side and cluster them consistently.

3. The Solution
I designed a Figma-based calibration framework to turn subjective instructor expertise into structured, comparable data:
Built a rating scale from ≤10% to 100% in 10% increments for each movement metric.
Created calibration cards for every analysed turn, including: snapshot, video link, slope difficulty, timestamp, and instructor notes.
Allowed cards to be placed directly onto the scale, enabling instructors to compare turns side by side and cluster them consistently.
[1. Preliminary Testing Impact]
Standardised Instructor Scoring Across 100+ Skiing Videos
Replaced unstructured spreadsheets with a Figma-based calibration system that aligned instructor evaluations — producing a consistent dataset that became the foundation for the first AI movement model.
Bridged Expert Intuition and Engineering Precision
Transformed subjective instructor judgements into structured, comparable data, enabling developers to translate human expertise into measurable algorithmic logic.
Established a Scalable Calibration Framework for Future Metrics
Created a repeatable method for rating and comparing movement data, improving inter-rater consistency by 70% in pilot testing and setting a new benchmark for technical collaboration within the product team.
[Overall Impact]
Preliminary Phase:
Delivered Structured Dataset for Model Training
Replaced fragmented spreadsheet workflows with a Figma-based calibration system that standardised instructor scoring across 100+ video clips — delivering the reliable dataset engineers needed to build the first version of the movement model and aligning all stakeholders on what “good” technique looks like.
Alpha Phase:
75% Alignment Achieved Between AI and Instructor Scores
Led validation cycles comparing AI-generated scores to expert benchmarks using a custom joint-tracking review tool. Iterative refinement brought model accuracy within 10% of instructor ratings — building stakeholder trust and establishing a repeatable evaluation process for future releases.
Beta Phase:
Designed Field Testing Framework to Measure Real-World Impact
Planned a 4-stage beta rollout — from raw score validation to full AI coaching feedback — across snowdomes, ski resorts, and varied learner contexts. The framework will assess not just accuracy, but motivation, engagement, and progression over time, ensuring Coachi’s feedback model delivers meaningful learning outcomes at scale.
[2. Alpha Testing Process]
1. The Purpose
The purpose of Alpha Testing was to validate the first version of the movement analysis model — ensuring that the AI-generated scores aligned with instructor judgement and the calibration framework established in the Preliminary phase.
1. The Purpose
The purpose of Alpha Testing was to validate the first version of the movement analysis model — ensuring that the AI-generated scores aligned with instructor judgement and the calibration framework established in the Preliminary phase.
2. The Challenge
Even with a solid dataset, there was no guarantee the AI would interpret movement in ways that matched human expertise. If scores felt arbitrary or inconsistent, instructors and learners would lose trust in the system.
2. The Challenge
Even with a solid dataset, there was no guarantee the AI would interpret movement in ways that matched human expertise. If scores felt arbitrary or inconsistent, instructors and learners would lose trust in the system.

3. The Solution
Structured Validation Cycles:
Collaborated with engineers to build a test tool that overlayed joint-tracking skeletons on skier videos.
Synced this with percentage scores per metric, allowing outputs to be paused, replayed, and interrogated.
Ran validation sessions with instructors, comparing AI outputs against the Figma calibration benchmarks and refining the model until alignment was achieved.

3. The Solution
Structured Validation Cycles:
Collaborated with engineers to build a test tool that overlayed joint-tracking skeletons on skier videos.
Synced this with percentage scores per metric, allowing outputs to be paused, replayed, and interrogated.
Ran validation sessions with instructors, comparing AI outputs against the Figma calibration benchmarks and refining the model until alignment was achieved.
[2. Alpha Testing Impact]
Achieved 75% Alignment Between AI and Instructor Scores
Led structured validation cycles using a custom joint-tracking overlay tool, refining the model until AI-generated movement scores matched expert evaluations within a 10% variance range.
Built a Transparent Validation Workflow for Cross-Team Trust
Established a process where design, engineering, and instructors reviewed results collaboratively — turning black-box AI outputs into explainable, evidence-based insights that built confidence across stakeholders.
Created a Repeatable Model Validation Process for Future Releases
Documented and systematised the validation method, allowing developers to run future accuracy checks autonomously — accelerating iteration cycles and improving design–engineering alignment.
[3. Beta Testing Process]
1. The Purpose
The purpose of Beta Testing is to test the model in real-world learner contexts — ensuring not only technical accuracy but also educational effectiveness and user engagement.
1. The Purpose
The purpose of Beta Testing is to test the model in real-world learner contexts — ensuring not only technical accuracy but also educational effectiveness and user engagement.
2. The Challenge
Lab-based validation confirmed the model’s stability, but it remained unclear how learners would respond to feedback in live skiing environments:
Would percentage scores alone motivate improvement?
Would adding AI-generated coaching text enhance learning or overwhelm users?
How would different practice contexts (with an instructor, with friends, solo) shape interaction with the app?
2. The Challenge
Lab-based validation confirmed the model’s stability, but it remained unclear how learners would respond to feedback in live skiing environments:
Would percentage scores alone motivate improvement?
Would adding AI-generated coaching text enhance learning or overwhelm users?
How would different practice contexts (with an instructor, with friends, solo) shape interaction with the app?

3. The Solution
Phase 1 - Core Feedback Validation:
Test the UI with metric scores only (no AI coaching text).
Focus on validating how learners understand and use raw scores in practice.
Phase 2 – Adding AI Coaching Feedback:
Layer in text-based AI feedback that interprets scores and provides actionable guidance, mimicking instructor-style advice.
Explore how users interact with this feedback, including the ability to chat with the AI coach for clarification and progression tips.
Phase 3 – Field Trials Across Contexts:
Conduct testing in snowdomes and ski resorts, covering three learning scenarios:
In a lesson with an instructor.
Practising with friends post-lesson.
Practising independently without instructor support.
Phase 4 – Longitudinal Testing:
Track how feedback (metrics + AI coaching) influences progression, motivation, and retention over multiple sessions.

3. The Solution
Phase 1 - Core Feedback Validation:
Test the UI with metric scores only (no AI coaching text).
Focus on validating how learners understand and use raw scores in practice.
Phase 2 – Adding AI Coaching Feedback:
Layer in text-based AI feedback that interprets scores and provides actionable guidance, mimicking instructor-style advice.
Explore how users interact with this feedback, including the ability to chat with the AI coach for clarification and progression tips.
Phase 3 – Field Trials Across Contexts:
Conduct testing in snowdomes and ski resorts, covering three learning scenarios:
In a lesson with an instructor.
Practising with friends post-lesson.
Practising independently without instructor support.
Phase 4 – Longitudinal Testing:
Track how feedback (metrics + AI coaching) influences progression, motivation, and retention over multiple sessions.
[3. Beta Testing Impact]
Designed a Four-Stage Field Testing Framework Across Real Environments
I learned how to translate domain-specific human judgement (from ski instructors) into structured, repeatable data models — a skill that deepened my ability to align expert knowledge with technical implementation.
Designing Scalable, Multi-Phase Validation Frameworks
This project strengthened my ability to plan and execute staged research strategies — from calibration to alpha testing to real-world beta trials — ensuring every design decision was evidence-based and scalable.
Human–AI Design Requires Deep Contextual Understanding
I developed a stronger grasp of how to design AI feedback that’s not only accurate but trustworthy, educational, and adaptable to different user contexts — a key competency for future-facing UX leadership roles.
[Overall Key Learnings]
Bridging Subjective Expertise with Structured Data
I learned how to translate domain-specific human judgement (from ski instructors) into structured, repeatable data models — a skill that deepened my ability to align expert knowledge with technical implementation.
Designing Scalable, Multi-Phase Validation Frameworks
This project strengthened my ability to plan and execute staged research strategies — from calibration to alpha testing to real-world beta trials — ensuring every design decision was evidence-based and scalable.
Human–AI Design Requires Deep Contextual Understanding
I developed a stronger grasp of how to design AI feedback that’s not only accurate but trustworthy, educational, and adaptable to different user contexts — a key competency for future-facing UX leadership roles.