Gareth Shelbourne | UX/UI & Product Designer

[Case Studies]

Alpha & Beta Testing Coachi

[Product Research]

[Overview]

The Context

Coachi is an AI-powered ski coaching app that analyses technique and delivers personalised feedback to accelerate learning.

As Lead UX Designer, I designed and led the research and validation strategy for its core feedback system — bridging expert instruction, UX, and AI development to ensure the feedback was both accurate and engaging.

This involved a three-phase research approach:

1. Preliminary testing: Structured Instructor input into reliable data sets for model calibration.

2. Alpha testing: Validated AI scoring against expert judgement.

3. Beta testing: Assessed real-world, accuracy, educational impact and user engagement.

https://www.coachiapp.com/

The Context

Coachi is an AI-powered ski coaching app that analyses technique and delivers personalised feedback to accelerate learning.

This involved a three-phase research approach:

1. Preliminary testing: Structured Instructor input into reliable data sets for model calibration.

2. Alpha testing: Validated AI scoring against expert judgement.

3. Beta testing: Assessed real-world, accuracy, educational impact and user engagement.

https://www.coachiapp.com/

Skills Demonstrated

Miro

User interviewing

Contextual enquiry

Research planning

UX research & strategy

Human–AI interaction design

[Overall Impact]

Preliminary Phase:
Delivered Structured Dataset for Model Training

Replaced fragmented spreadsheet workflows with a Figma-based calibration system that standardised instructor scoring across 100+ video clips — delivering the reliable dataset engineers needed to build the first version of the movement model and aligning all stakeholders on what “good” technique looks like.

Alpha Phase:
75% Alignment Achieved Between AI and Instructor Scores

Led validation cycles comparing AI-generated scores to expert benchmarks using a custom joint-tracking review tool. Iterative refinement brought model accuracy within 10% of instructor ratings — building stakeholder trust and establishing a repeatable evaluation process for future releases.

Beta Phase:
Designed Field Testing Framework to Measure Real-World Impact

Planned a 4-stage beta rollout — from raw score validation to full AI coaching feedback — across snowdomes, ski resorts, and varied learner contexts. The framework will assess not just accuracy, but motivation, engagement, and progression over time, ensuring Coachi’s feedback model delivers meaningful learning outcomes at scale.

[1. Preliminary Testing Process]

1. The Purpose

The goal of the preliminary testing setup was to generate structured, reliable data for the development team. This dataset served as the foundation for building the first iteration of the movement analysis model, which would later be validated in the Alpha Testing phase. Without this work, developers would have been working with fragmented, inconsistent inputs, risking an unstable or inaccurate model.

1. The Purpose

2. The Challenge

Initial attempts at calibration using spreadsheets quickly failed:

Instructors had to review videos one by one, making comparisons inconsistent.
Numeric scores lacked context, leaving engineers unclear on why movements were rated good or poor.
Calibration drift emerged — similar turns were scored differently without a shared framework.

This mismatch between subjective expertise and technical requirements risked producing an unreliable model.

2. The Challenge

Initial attempts at calibration using spreadsheets quickly failed:

Instructors had to review videos one by one, making comparisons inconsistent.
Numeric scores lacked context, leaving engineers unclear on why movements were rated good or poor.
Calibration drift emerged — similar turns were scored differently without a shared framework.

This mismatch between subjective expertise and technical requirements risked producing an unreliable model.

3. The Solution

I designed a Figma-based calibration framework to turn subjective instructor expertise into structured, comparable data:

Built a rating scale from ≤10% to 100% in 10% increments for each movement metric.
Created calibration cards for every analysed turn, including: snapshot, video link, slope difficulty, timestamp, and instructor notes.
Allowed cards to be placed directly onto the scale, enabling instructors to compare turns side by side and cluster them consistently.

3. The Solution

I designed a Figma-based calibration framework to turn subjective instructor expertise into structured, comparable data:

Built a rating scale from ≤10% to 100% in 10% increments for each movement metric.
Created calibration cards for every analysed turn, including: snapshot, video link, slope difficulty, timestamp, and instructor notes.
Allowed cards to be placed directly onto the scale, enabling instructors to compare turns side by side and cluster them consistently.

[1. Preliminary Testing Impact]

Standardised Instructor Scoring Across 100+ Skiing Videos

Replaced unstructured spreadsheets with a Figma-based calibration system that aligned instructor evaluations — producing a consistent dataset that became the foundation for the first AI movement model.

Bridged Expert Intuition and Engineering Precision

Transformed subjective instructor judgements into structured, comparable data, enabling developers to translate human expertise into measurable algorithmic logic.

Established a Scalable Calibration Framework for Future Metrics

Created a repeatable method for rating and comparing movement data, improving inter-rater consistency by 70% in pilot testing and setting a new benchmark for technical collaboration within the product team.

[2. Alpha Testing Process]

1. The Purpose

The purpose of Alpha Testing was to validate the first version of the movement analysis model — ensuring that the AI-generated scores aligned with instructor judgement and the calibration framework established in the Preliminary phase.

1. The Purpose

2. The Challenge

Even with a solid dataset, there was no guarantee the AI would interpret movement in ways that matched human expertise. If scores felt arbitrary or inconsistent, instructors and learners would lose trust in the system.

2. The Challenge

3. The Solution

Structured Validation Cycles:

Collaborated with engineers to build a test tool that overlayed joint-tracking skeletons on skier videos.
Synced this with percentage scores per metric, allowing outputs to be paused, replayed, and interrogated.
Ran validation sessions with instructors, comparing AI outputs against the Figma calibration benchmarks and refining the model until alignment was achieved.

3. The Solution

Structured Validation Cycles:

Collaborated with engineers to build a test tool that overlayed joint-tracking skeletons on skier videos.
Synced this with percentage scores per metric, allowing outputs to be paused, replayed, and interrogated.
Ran validation sessions with instructors, comparing AI outputs against the Figma calibration benchmarks and refining the model until alignment was achieved.

[2. Alpha Testing Impact]

Achieved 75% Alignment Between AI and Instructor Scores

Led structured validation cycles using a custom joint-tracking overlay tool, refining the model until AI-generated movement scores matched expert evaluations within a 10% variance range.

Built a Transparent Validation Workflow for Cross-Team Trust

Established a process where design, engineering, and instructors reviewed results collaboratively — turning black-box AI outputs into explainable, evidence-based insights that built confidence across stakeholders.

Created a Repeatable Model Validation Process for Future Releases

Documented and systematised the validation method, allowing developers to run future accuracy checks autonomously — accelerating iteration cycles and improving design–engineering alignment.

[3. Beta Testing Process]

1. The Purpose

The purpose of Beta Testing is to test the model in real-world learner contexts — ensuring not only technical accuracy but also educational effectiveness and user engagement.

1. The Purpose

The purpose of Beta Testing is to test the model in real-world learner contexts — ensuring not only technical accuracy but also educational effectiveness and user engagement.

2. The Challenge

Lab-based validation confirmed the model’s stability, but it remained unclear how learners would respond to feedback in live skiing environments:

Would percentage scores alone motivate improvement?
Would adding AI-generated coaching text enhance learning or overwhelm users?
How would different practice contexts (with an instructor, with friends, solo) shape interaction with the app?

2. The Challenge

Lab-based validation confirmed the model’s stability, but it remained unclear how learners would respond to feedback in live skiing environments:

Would percentage scores alone motivate improvement?
Would adding AI-generated coaching text enhance learning or overwhelm users?
How would different practice contexts (with an instructor, with friends, solo) shape interaction with the app?

3. The Solution

Phase 1 - Core Feedback Validation:

Test the UI with metric scores only (no AI coaching text).
Focus on validating how learners understand and use raw scores in practice.

Phase 2 – Adding AI Coaching Feedback:

Layer in text-based AI feedback that interprets scores and provides actionable guidance, mimicking instructor-style advice.
Explore how users interact with this feedback, including the ability to chat with the AI coach for clarification and progression tips.

Phase 3 – Field Trials Across Contexts:

Conduct testing in snowdomes and ski resorts, covering three learning scenarios:
1. In a lesson with an instructor.
2. Practising with friends post-lesson.
3. Practising independently without instructor support.

Phase 4 – Longitudinal Testing:

Track how feedback (metrics + AI coaching) influences progression, motivation, and retention over multiple sessions.

3. The Solution

Phase 1 - Core Feedback Validation:

Test the UI with metric scores only (no AI coaching text).
Focus on validating how learners understand and use raw scores in practice.

Phase 2 – Adding AI Coaching Feedback:

Layer in text-based AI feedback that interprets scores and provides actionable guidance, mimicking instructor-style advice.
Explore how users interact with this feedback, including the ability to chat with the AI coach for clarification and progression tips.

Phase 3 – Field Trials Across Contexts:

Conduct testing in snowdomes and ski resorts, covering three learning scenarios:
1. In a lesson with an instructor.
2. Practising with friends post-lesson.
3. Practising independently without instructor support.

Phase 4 – Longitudinal Testing:

Track how feedback (metrics + AI coaching) influences progression, motivation, and retention over multiple sessions.

[3. Beta Testing Impact]

Designed a Four-Stage Field Testing Framework Across Real Environments

Planned a rollout covering snowdomes, ski resorts, and at-home training, ensuring the system is tested in authentic learner contexts before launch.

Defined Success Metrics for Engagement, Comprehension, and Progression

Established KPIs including feedback comprehension rate, confidence uplift, and session return rate — moving validation beyond accuracy to behavioural and motivational outcomes.

Prepared a Longitudinal Study to Measure Skill Improvement Over Time

Designed a data collection plan to track learner progression over multiple sessions, assessing how AI feedback and coaching text influence retention and skill mastery — shaping future product iterations and growth strategy.

[Key Learnings]

Bridging Subjective Expertise with Structured Data

I learned how to translate domain-specific human judgement (from ski instructors) into structured, repeatable data models — a skill that deepened my ability to align expert knowledge with technical implementation.

Designing Scalable, Multi-Phase Validation Frameworks

This project strengthened my ability to plan and execute staged research strategies — from calibration to alpha testing to real-world beta trials — ensuring every design decision was evidence-based and scalable.

Human–AI Design Requires Deep Contextual Understanding

I developed a stronger grasp of how to design AI feedback that’s not only accurate but trustworthy, educational, and adaptable to different user contexts — a key competency for future-facing UX leadership roles.

[Persona]

Jhon Roberts

Marketing Manager

Content

Age: 29

Location: New York City

Tech Proficiency: Moderate

Gender: Male

[Goal]

Quickly complete purchases without interruptions.

Trust the platform to handle her payment securely.

Access a seamless mobile shopping experience.

[Frustrations]

Long or confusing checkout processes.

Error messages that don’t explain the issue.

Poor mobile optimization that slows her down.

[Overview]

The Context

Coachi is an AI-powered ski coaching app that analyses technique and delivers personalised feedback to accelerate learning.

This involved a three-phase research approach:

1. Preliminary testing: Structured Instructor input into reliable data sets for model calibration.

2. Alpha testing: Validated AI scoring against expert judgement.

3. Beta testing: Assessed real-world, accuracy, educational impact and user engagement.

The Context

Coachi is an AI-powered ski coaching app that analyses technique and delivers personalised feedback to accelerate learning.

This involved a three-phase research approach:

1. Preliminary testing: Structured Instructor input into reliable data sets for model calibration.

2. Alpha testing: Validated AI scoring against expert judgement.

3. Beta testing: Assessed real-world, accuracy, educational impact and user engagement.

Skills Demonstrated

Miro

User interviewing

Contextual enquiry

Research planning

[1. Preliminary Testing Process]

1. The Purpose

2. The Challenge

Initial attempts at calibration using spreadsheets quickly failed:

Instructors had to review videos one by one, making comparisons inconsistent.
Numeric scores lacked context, leaving engineers unclear on why movements were rated good or poor.
Calibration drift emerged — similar turns were scored differently without a shared framework.

This mismatch between subjective expertise and technical requirements risked producing an unreliable model.

2. The Challenge

Initial attempts at calibration using spreadsheets quickly failed:

Instructors had to review videos one by one, making comparisons inconsistent.
Numeric scores lacked context, leaving engineers unclear on why movements were rated good or poor.
Calibration drift emerged — similar turns were scored differently without a shared framework.

This mismatch between subjective expertise and technical requirements risked producing an unreliable model.

3. The Solution

I designed a Figma-based calibration framework to turn subjective instructor expertise into structured, comparable data:

Built a rating scale from ≤10% to 100% in 10% increments for each movement metric.
Created calibration cards for every analysed turn, including: snapshot, video link, slope difficulty, timestamp, and instructor notes.
Allowed cards to be placed directly onto the scale, enabling instructors to compare turns side by side and cluster them consistently.

3. The Solution

I designed a Figma-based calibration framework to turn subjective instructor expertise into structured, comparable data:

Built a rating scale from ≤10% to 100% in 10% increments for each movement metric.
Created calibration cards for every analysed turn, including: snapshot, video link, slope difficulty, timestamp, and instructor notes.
Allowed cards to be placed directly onto the scale, enabling instructors to compare turns side by side and cluster them consistently.

[1. Preliminary Testing Impact]

Standardised Instructor Scoring Across 100+ Skiing Videos

Bridged Expert Intuition and Engineering Precision

Transformed subjective instructor judgements into structured, comparable data, enabling developers to translate human expertise into measurable algorithmic logic.

Established a Scalable Calibration Framework for Future Metrics

[Overall Impact]

Preliminary Phase:
Delivered Structured Dataset for Model Training

Alpha Phase:
75% Alignment Achieved Between AI and Instructor Scores

Beta Phase:
Designed Field Testing Framework to Measure Real-World Impact

[2. Alpha Testing Process]

1. The Purpose

2. The Challenge

3. The Solution

Structured Validation Cycles:

Collaborated with engineers to build a test tool that overlayed joint-tracking skeletons on skier videos.
Synced this with percentage scores per metric, allowing outputs to be paused, replayed, and interrogated.
Ran validation sessions with instructors, comparing AI outputs against the Figma calibration benchmarks and refining the model until alignment was achieved.

3. The Solution

Structured Validation Cycles:

Collaborated with engineers to build a test tool that overlayed joint-tracking skeletons on skier videos.
Synced this with percentage scores per metric, allowing outputs to be paused, replayed, and interrogated.
Ran validation sessions with instructors, comparing AI outputs against the Figma calibration benchmarks and refining the model until alignment was achieved.

[2. Alpha Testing Impact]

Achieved 75% Alignment Between AI and Instructor Scores

Led structured validation cycles using a custom joint-tracking overlay tool, refining the model until AI-generated movement scores matched expert evaluations within a 10% variance range.

Built a Transparent Validation Workflow for Cross-Team Trust

Created a Repeatable Model Validation Process for Future Releases

Documented and systematised the validation method, allowing developers to run future accuracy checks autonomously — accelerating iteration cycles and improving design–engineering alignment.

[3. Beta Testing Process]

1. The Purpose

The purpose of Beta Testing is to test the model in real-world learner contexts — ensuring not only technical accuracy but also educational effectiveness and user engagement.

1. The Purpose

The purpose of Beta Testing is to test the model in real-world learner contexts — ensuring not only technical accuracy but also educational effectiveness and user engagement.

2. The Challenge

Lab-based validation confirmed the model’s stability, but it remained unclear how learners would respond to feedback in live skiing environments:

Would percentage scores alone motivate improvement?
Would adding AI-generated coaching text enhance learning or overwhelm users?
How would different practice contexts (with an instructor, with friends, solo) shape interaction with the app?

2. The Challenge

Lab-based validation confirmed the model’s stability, but it remained unclear how learners would respond to feedback in live skiing environments:

Would percentage scores alone motivate improvement?
Would adding AI-generated coaching text enhance learning or overwhelm users?
How would different practice contexts (with an instructor, with friends, solo) shape interaction with the app?

3. The Solution

Phase 1 - Core Feedback Validation:

Test the UI with metric scores only (no AI coaching text).
Focus on validating how learners understand and use raw scores in practice.

Phase 2 – Adding AI Coaching Feedback:

Layer in text-based AI feedback that interprets scores and provides actionable guidance, mimicking instructor-style advice.
Explore how users interact with this feedback, including the ability to chat with the AI coach for clarification and progression tips.

Phase 3 – Field Trials Across Contexts:

Conduct testing in snowdomes and ski resorts, covering three learning scenarios:
1. In a lesson with an instructor.
2. Practising with friends post-lesson.
3. Practising independently without instructor support.

Phase 4 – Longitudinal Testing:

Track how feedback (metrics + AI coaching) influences progression, motivation, and retention over multiple sessions.

3. The Solution

Phase 1 - Core Feedback Validation:

Test the UI with metric scores only (no AI coaching text).
Focus on validating how learners understand and use raw scores in practice.

Phase 2 – Adding AI Coaching Feedback:

Layer in text-based AI feedback that interprets scores and provides actionable guidance, mimicking instructor-style advice.
Explore how users interact with this feedback, including the ability to chat with the AI coach for clarification and progression tips.

Phase 3 – Field Trials Across Contexts:

Conduct testing in snowdomes and ski resorts, covering three learning scenarios:
1. In a lesson with an instructor.
2. Practising with friends post-lesson.
3. Practising independently without instructor support.

Phase 4 – Longitudinal Testing:

Track how feedback (metrics + AI coaching) influences progression, motivation, and retention over multiple sessions.

[3. Beta Testing Impact]

Designed a Four-Stage Field Testing Framework Across Real Environments

Designing Scalable, Multi-Phase Validation Frameworks

Human–AI Design Requires Deep Contextual Understanding

[Overall Key Learnings]

Bridging Subjective Expertise with Structured Data

Designing Scalable, Multi-Phase Validation Frameworks

Human–AI Design Requires Deep Contextual Understanding

[Case Studies]

[Overview]

The Context

The Context

Skills Demonstrated

[Overall Impact]

Preliminary Phase: Delivered Structured Dataset for Model Training

Alpha Phase: 75% Alignment Achieved Between AI and Instructor Scores

Alpha Phase: 75% Alignment Achieved Between AI and Instructor Scores

Beta Phase: Designed Field Testing Framework to Measure Real-World Impact

Beta Phase: Designed Field Testing Framework to Measure Real-World Impact

[1. Preliminary Testing Process]

1. The Purpose​

1. The Purpose​

2. The Challenge

2. The Challenge

3. The Solution

3. The Solution

[1. Preliminary Testing Impact]

Standardised Instructor Scoring Across 100+ Skiing Videos

Bridged Expert Intuition and Engineering Precision

Established a Scalable Calibration Framework for Future Metrics

[2. Alpha Testing Process]

1. The Purpose​

1. The Purpose​

2. The Challenge

2. The Challenge

3. The Solution

Structured Validation Cycles:

3. The Solution

Structured Validation Cycles:

[2. Alpha Testing Impact]

Achieved 75% Alignment Between AI and Instructor Scores

Built a Transparent Validation Workflow for Cross-Team Trust

Created a Repeatable Model Validation Process for Future Releases

[3. Beta Testing Process]

1. The Purpose​

1. The Purpose​

2. The Challenge

2. The Challenge

3. The Solution

​Phase 1 - Core Feedback Validation:

Phase 2 – Adding AI Coaching Feedback:

Phase 3 – Field Trials Across Contexts:

Phase 4 – Longitudinal Testing:

3. The Solution

​Phase 1 - Core Feedback Validation:

Phase 2 – Adding AI Coaching Feedback:

Phase 3 – Field Trials Across Contexts:

Phase 4 – Longitudinal Testing:

[3. Beta Testing Impact]

Designed a Four-Stage Field Testing Framework Across Real Environments

Defined Success Metrics for Engagement, Comprehension, and Progression

Prepared a Longitudinal Study to Measure Skill Improvement Over Time

[Key Learnings]

Bridging Subjective Expertise with Structured Data

Designing Scalable, Multi-Phase Validation Frameworks

Human–AI Design Requires Deep Contextual Understanding

Jhon Roberts

[Overview]

The Context

The Context

Skills Demonstrated

[1. Preliminary Testing Process]

1. The Purpose​

1. The Purpose​

2. The Challenge

2. The Challenge

3. The Solution

3. The Solution

[1. Preliminary Testing Impact]

Standardised Instructor Scoring Across 100+ Skiing Videos

Bridged Expert Intuition and Engineering Precision

Established a Scalable Calibration Framework for Future Metrics

[Overall Impact]

Preliminary Phase: Delivered Structured Dataset for Model Training

Alpha Phase: 75% Alignment Achieved Between AI and Instructor Scores

Beta Phase: Designed Field Testing Framework to Measure Real-World Impact

[2. Alpha Testing Process]

1. The Purpose​

1. The Purpose​

Preliminary Phase:
Delivered Structured Dataset for Model Training

Alpha Phase:
75% Alignment Achieved Between AI and Instructor Scores

Alpha Phase:
75% Alignment Achieved Between AI and Instructor Scores

Beta Phase:
Designed Field Testing Framework to Measure Real-World Impact

Beta Phase:
Designed Field Testing Framework to Measure Real-World Impact

1. The Purpose

1. The Purpose

1. The Purpose

1. The Purpose

1. The Purpose

1. The Purpose

Phase 1 - Core Feedback Validation:

Phase 1 - Core Feedback Validation:

1. The Purpose

1. The Purpose

Preliminary Phase:
Delivered Structured Dataset for Model Training

Alpha Phase:
75% Alignment Achieved Between AI and Instructor Scores

Beta Phase:
Designed Field Testing Framework to Measure Real-World Impact

1. The Purpose

1. The Purpose

1. The Purpose

1. The Purpose

Phase 1 - Core Feedback Validation:

Phase 1 - Core Feedback Validation: