Methodology
Most scoring tools give you a number. FirstPass gives you the number and shows every moment that produced it — tagged, traceable, calibrated against human judgment. Here's how it works.
01The premise
A discovery call lasts 45 minutes and contains roughly 200 turns of speech. Most scoring tools collapse all of that into a single number — sometimes a letter grade, sometimes a 1-to-10 rating, occasionally a four-quadrant rubric.
The trouble is that the number doesn't tell the rep anything they can act on. A 6.4 isn't actionable. Strong discovery, weak close is closer, but still doesn't get to the actual move that worked or the one that didn't.
We started from the opposite direction. Instead of asking what's the score for this call, we asked what specifically happened in this call that's worth noticing? The answer turned out to be a lot of specific things — a strategic question asked, a buying signal missed, a price objection acknowledged but not diagnosed. We made those specific moments the unit of measurement.
The score follows from the tags. Not the other way around.
Everything below explains how that works — the taxonomy, how tags become a score, the role of human annotation, and what happens when a team needs the system tuned to their specific motion.
02The taxonomy
70 sales-rep tags organized by call stage. 40 customer tags organized by what the customer is doing in response. The taxonomy is the same backbone for every call type — but tag weights and presence differ. A tradeshow chat doesn't get scored on Decision Criteria Explored the way a discovery call does.
Tags marked with a filled cyan dot are scorable — they move the score up or down. Tags marked with an empty dot are diagnostic — observed and recorded for context, but they don't move the score.
The customer-side taxonomy is shorter and almost entirely diagnostic by design. The customer's behavior tells you whether the call is going somewhere — but it's the rep we're scoring. The exceptions are tags that act as outcome indicators: a customer sharing business objectives, asking strategic questions, or expressing genuine enthusiasm about the solution. Those are signals that something the rep did landed.
03How scoring works
Of the 110 tags, only 29 are scorable. The rest are diagnostic — observed and recorded, but they don't move the score up or down.
The reason is straightforward: not every observation is a quality signal. Open-Ended Question is descriptively useful but isn't itself a quality marker. A rep who asks ten open-ended questions — none strategic, none layered, none demonstrating business acumen — would score low on Discovery despite high open-ended-question volume. The tag is still applied, still useful for analysis, but it doesn't affect the score. Scorable tags are the moves a coach would notice: Strategic Question Asked, Layered Follow-Up Question, Gap Impact Quantified, Effective Objection Resolution.
Once tags are applied, the score is computed in 3 steps:
Each call type has its own weight profile. A formulary review weighs Use of Evidence or Proof Point heavily; a tradeshow chat barely touches it.
Calls don't reach every stage. The score normalizes for stages actually present rather than penalizing the absence of stages that aren't relevant.
The aggregate is calibrated against thousands of human-annotated calls so a 4.0 means roughly the same thing across call types and over time.
The output is a number — but it's the traceable provenance of the number that matters. Every score links to the specific tags that produced it, and every tag links to the moment in the transcript it was applied to. Nothing is opaque. A rep whose Discovery score dropped from 4.2 to 3.6 can see exactly why: Strategic Questions Asked went from 6 to 2, Premature Solution Offering appeared, Decision Criteria Explored was missed.
04Annotation
The reason most AI scoring is unreliable isn't that the AI is bad at recognizing tags — it's that nobody's checking the AI's work against ground truth. LLM-only scoring drifts. The same call scored twice on different days can come back differently. Without a feedback signal, the engine has no way to know it's wrong.
FirstPass maintains a continuously expanding corpus of calls annotated by humans — sales coaches, methodology specialists, and trained annotators — who tag every moment manually against the same taxonomy. The AI's tagging is then evaluated against that ground truth. Mismatches feed back into prompt engineering to handle ambiguous cases. This calibration loop runs continuously, which is why scoring accuracy improves over time rather than drifting.
05Co-definition
The 110-tag taxonomy covers the moves that recur across most B2B sales motions. For most teams, it's enough as-is.
For teams with specific methodologies — MEDDPICC, SPICED, Sandler, Challenger, and many more — the taxonomy extends. We work with your sales operations and enablement leads to map your methodology onto existing tags, identify gaps, and add tags that capture moves specific to your motion. Scoring weights get tuned alongside the taxonomy.
A typical co-definition engagement runs 3–5 weeks:
Review your existing methodology, scoring rubrics, top-rep call recordings, and field-leadership feedback. Identify moves that don't yet map to existing tags. Output: a gap analysis with proposed additions.
Write new tags with explicit definitions, pass criteria, and example moments. Added to your team's instance of the taxonomy without affecting other customers.
Calibrate the AI customer's behavior to push back the way your real customers do. Industry knowledge loaded in: the regulations they care about, the objections that actually slow deals down.
Annotate a sample set of your team's existing calls against the extended taxonomy. This becomes the ground truth for the AI's tagging in your instance.
Reps practice against the calibrated system. The first two weeks include light review of scoring outputs to catch any drift — after that, the engine runs against your team's standard.
The outcome is FirstPass calibrated to how your team specifically sells. What each role sees day-to-day:
For the rep
After every call: a tagged transcript with strengths and gaps, coaching tied to specific moments, and a score that breaks down to specific tags. Over time, a personal trend — skills improving, patterns identified.
For the manager
Tag-level visibility into the team — which reps consistently miss Decision Criteria Explored, who's improving on objection handling, where coaching time will return the most.
For sales ops & enablement
Tag definitions, scoring weights, AI persona profiles, call type structures — all editable, all versionable. Co-definition produces documentation that becomes part of how your team understands what good looks like.
Your reps are developing their skills on live calls with real revenue on the line. There's a better way.
Find out if FirstPass is right for your team →