How to Navigate a World of Exploding Measures and Estimates

The number of, well, numbers we track during training is exploding, but they’re not all made equal. Some represent actual measurements while others are just estimates. We discuss the implications.

Please login or join at a higher membership level to view this content.

Episode Transcript

Trevor Connor 00:04

Hello and welcome to Fast Talk: your source for the science of endurance performance. I’m your host Trevor Connor here with Dr. Stephen Seiler doing co-hosting duties. We can debate the value of various technological changes over the last half century but, hands down, the biggest evolution in training is the shift from the trained purely by feel approach of the 60s and 70s to having numbers to track not only every moment of your workout, but basically every moment of your day. Believe it or not, there was a time not all that long ago when a device had just showed your speed was revolutionary. What’s important to understand is that these numbers or training metrics aren’t all made equal. Metrics and measures don’t mean the same thing. Some things are actually measured, such as heart rate and cadence. But many metrics including some of the most exciting new numbers are actually estimates such as sleep scores or training stress. These estimates are calculations often based on assumptions or not fully validated correlations, which raises serious questions about their validity. For example, how accurately can we truly analyze something like sleep based on changes in HRV, especially when you question how accurately HRV is being measured using light sensors on your wrist? If you’re wondering how complex is issue of measures versus estimates get, remember, even power is not a measure. It is actually a calculation based on cadence and torque that does require some estimation. Here to help us navigate this tricky but fascinating subject are none other than physiologist Dr. Stephen Sieler and founder of HRV For Training Dr. Marco Altini, who’s a key science advisor for the Oura rings. Together, we’ll explain the differences between measures and estimates and how both can have their issues. We’ll discuss how most metrics try to get at the concepts of load, stress and strain. Then we’ll dig a little deeper into what Dr. Altini calls “known and unknown metrics”, those that we can and can’t validate. We’ll then shift gears to talk about the psychological impacts of these various metrics, particularly sleep rating on athletes and their performance. Finally, we’ll round out the conversation with ways to differentiate not only good measures from bad measures, but more importantly, good estimates from bad estimates. Some estimates are actually pretty valuable, but be careful when marketing teams start getting involved in deciding what metrics to include. During the conversation, we’ll hear from elite cyclists and author of “How to Become a Pro-Cyclist” Jack Burke, physiologist and Examine.com writer Brady Holmer, Tri Doc podcast host Dr. Jeff Sankoff, and elite cyclist and coach Taylor Warren. So think critically about how you want to measure your fast and let’s make you fast.

Chris Case 02:35

Today’s episode of Fast Talk is brought to you by Alter Exploration created by me, Fast Talk Labs co-founder Chris Case. Alter Exploration crafts challenging, transformative cycling journeys in some of the world’s most stunning destinations. A mantra is a powerful tool used to focus your mind on a particular goal and create calm during challenging situations. Our mantra? “Transformation begins where comfort ends.” This mantra isn’t meant to be intimidating, on the contrary, it should be invigorating. For many people everyday life is filled with convenience, monotony, and lack of time spent in nature. Alter Exploration facilitates the exact opposite – challenging, invigorating, life altering experiences in the natural world. Alter’s journeys aren’t so much a vacation as an exploration of you and the destination. At the end of every day, be preoccupied as much by the transformative experience, as by the satisfaction of exhaustion. Life. Altered. Learn more about my favorite adventure destinations and start dreaming at alterexploration.com.

Trevor Connor 03:41

Welcome, Marco, welcome Dr. Seiler, to the episode. I’m kind of excited about this because Dr. Seiler, this was an episode that you brought up to us that you really want to do, I know this is something you’re a little bit passionate about. So I kind of see this as you’re the host of this episode, and I’m the co host, how do you feel about that?

Dr. Stephen Seiler 04:00

Hey, well, we’ll see how it goes, but you’re right – the background for this is two areas. One, I came back into this research, you know, full speed and had to catch up a bit a couple of years ago, a few years ago, and I was just seeing all these metrics, all these amazing numbers for different things like training load, and that that I was like, “wow, where’d that come from? Has that been validated?” So a lot of metrics, you know, because of digital tools, we live in this amazing time when we’re able to measure things we’d never have been able to measure before but we also are trying to measure things that we’re not actually able to measure. And so, Marco Altini, who is with us today and has the wonderful heart rate for training application, he tweeted not so long ago, a very basic thing, he said, “Look, a lot of these watches, they measure one thing but they estimate 10 more or estimate many things”, you know, so he was basically saying “we measure things and sometimes we estimate” and I think A lot of our listeners, a lot of cyclists, a lot of runners, they may be easily fooled to believe that the estimates that they’re getting on their watches are more precise than they really are. So that’s kind of the starting point for this is what are we measuring, what’s useful, a little bit about some of the catch words like validity and reliability, and then we’ll take it from there.

Why Do We Monitor Our Training?

Trevor Connor 05:06

Before we dive into some of this complexity, let’s go to some real basics here and just asked a simple question, “why do we monitor our training”.

Dr. Stephen Seiler 05:33

I, perceptually, think of it as this way: the first thing I think about when I think of measurement is I – if I’m a coach, tell my daughte “I want you to do four times eight minutes a day” or “I want you to keep it easy today”, I want to see whether or not her execution or the athlete’s execution matches with the prescription and that is simple, but it’s surprisingly important – point one. And then that creates a starting point for individualization that I, as a coach, or as my own, self coached, you know, coaching myself, I can make adjustments, I can individualize things, because I start to see how my body responds to certain prescriptions and execution of training – so that’s point two. And then third, that builds on that, again, is I can detect deviations, you know, if my heart rate is running low or running high after workouts or a period of high load, that tells me something, and hopefully, I can make some adjustments. Very often the adjustments involve rest or reduced loads, because often the telltale signs are associated with having pushed too hard or too long, and so forth, or not had any rest. So that’s the third point is just being able to make adjustments early enough so that it doesn’t become a big problem. And then finally, I think it’s more about institutional knowledge. Whether your institution is a huge team of pro-cyclists or me as just a coach of my daughter, I’m trying to build a library of understanding that helps me to both coach those specific athletes, but also future athletes.

Trevor Connor 07:11

I found that really interesting, because we had Dr. Coggins and Hunter Allen on the show, and we were talking about training zones with them. And that was something they brought up, which was a big misinterpretation of training zones, they said “training zones were never meant really for the analysis”, to say “you had this exact physiological effect, because you were in this zone”. They said they were designed to be prescriptive, they were a communication tool, they were a way to make it easier for a coach to say to an athlete, “here’s how to do this particular workout”. It was never meant to be a “if you’re at 98% of threshold, you’re getting this training adaptation but if you’re at 102% of threshold, you’re getting a different training adaptation”. I heard you saying a lot of that it’s – it just helps you coach and guide your daughter, it gives you that tool for communication.

Dr. Stephen Seiler 07:59

Well, yeah, not just me, I think that’s one of the biggest success stories – for example, in Norwegian endurance sport, which has kind of punched above its weight class in endurance for some decades – is just having the same intensity scale, that everybody kind of understands the same – you know, if they say zone four, we all know what that means, you know, and there’s some there’s some sports specificity to it – but it’s universal enough that we have a good starting point for interaction, for communication. The other thing is, of course, I agree and I think, number one, from a stimuli for adaptations standpoint, these various training zones are so much overlap in terms of the generation of a stimuli for altered protein synthesis, you know, getting down into the rabbit hole of what’s happening to the cells, the muscle cells are not clearly distinguishing zone two from three from four, there’s overlap. But maybe what we’re really using the intensity zones to is to manage stress responses, to manage “how is the body recovering”, making sure that that flow is sustainable. So that’s been my kind of 25 year take home on all this is, no, I don’t think I’m controlling precisely the stimuli, but I’m doing a better job of controlling the ebb and flow of stress and trying to keep on the plus side for the athlete over time.

Trevor Connor 09:30

Marco, what’s your thoughts on this?

Marco Altini 09:31

Yeah, maybe I would add that, outside of what Steven was mentioning and monitoring training itself, I think the other aspect that we consider linked monitoring training is monitoring not the session, but what happens to the body after the session, for example, and I think this links back to a lot of the aspects we want to discuss in terms of what we are measuring and what we are estimating – for example, in terms of the body’s response to the session, right? So we could look at data collected during the session and maybe advise that from what we had prescribed and are the athletes responded, but we can also look at the data collected after the session through different technologies, apps and wearables that now athletes are using a lot devices that you can use to measure first thing in the morning or in the night to measure our physiology and try to see if these measurements reflect accurately the body’s response to the stimulus. If we have a response that is not what we expect, that is also an indication that maybe the stimulus was not appropriate for an athlete at a given point in time, or that maybe there were other stressors that were playing a role, right, because when we isolate training and look on your training data, sometimes we forget that a lot is happening. And maybe we’re traveling, maybe there’s some work related stress or some other things that will also have, in a way, impact on our ability to assimilate the training stressor, and respond positively to that. So as we start looking at monitoring the response through different technologies and devices, I think it can get a bit confusing, as many of these devices come up with numbers and scores and estimates that are not necessarily even things that actually exist, that you can measure with other devices, right? So, some are sort of made up, you know, readiness, recovery scores, things like that. I think it’s important, maybe later that we frame this as something that are a bit of a different category with respect to actually looking at the physiology, which could be just your resting heart rate, or your heart rate variability or things that you’re actually measuring, as opposed to things that you are estimating, and that you cannot even in some cases, possibly validate, because they are not quantities that have a reference device.

Categories of Training Data

Dr. Stephen Seiler 11:58

It’s useful to have a couple of frameworks here that kind of give us some pegs to hang things on. One framework would be for me this idea that when we’re measuring this monitoring process, we kind of have this triangle or triangulation where we’ve got some things we can measure that tell us just what was done, the external load and for the runners that is distance and time, you know, how long did you run? How long did it take to do so – you can get an average velocity, you can get a number of kilometers, you can get some pace in those kinds of characteristics, the cyclists can get power times duration. And those are basic things that – they are measurable, there is essentially a gold standard for power, there’s even ISO industrial standards for instrumentation for some of these things – ensure at least some degree of precision in those measurements. You cannot sell certain products unless they are able to measure correctly. The treadmill has to provide you with a reasonably accurate measure of treadmill speed or you can’t sell it. So we have some protection on those kinds of measures, then we go over to some physiology. And we’ve got our old standby of heart rate, which I think you know, we can say fairly confidently that “yes, we can measure that”. But there are caveats – in terms of the ECG standard versus meaning the electrical, the belt versus the photoplethysmography on the wrist. And there, there are methodological issues and user issues – the user just uses it wrong. So there are problems there, but we know what the problems are. And we can fix them, for the most part. So heart rate can be a very useful tool, and it’s valid and it’s reliable, if we do it right. We’ve got lactate. Lactates tricky to measure, you got to do it, right, you have to have skills – and especially in the field, you know, so people make a lot of mistakes. But it’s the technology that works. But it depends on user skills. We’re getting online with some other things like breathing with these shirts, we can measure perceived exertion, we can ask people how they feel and that puts us over in that other category, which is this perceptual. So external, internal physiology and an internal perception – those are kind of three main categories of data, at least in the training process itself. You know, we can measure what we actually did, we could measure how we responded to it, there and then. And then we get to where Marcos taken us in, which brings up this another framework, which is this idea of that engineers use and that’s load stress strain. Load is just what you do. Stress is how you respond to it there and then, and then strain is – at least in a biological context – I would argue that the strain is lingering effects that don’t go away after 24 hours, that there is somewhat of a lingering negative consequence or fatigue or a change in heart rate responses, a change in heart rate variability, a change in readiness to train or some other perceptual measure of “how do I feel”. So that’s, that takes us to that post-training that something is changing and “I got to think about this, do I need rest, do I need to ease up a bit”, you know – so that’s kind of a framework that I work with is load stress strain, I stole it from engineers. But I find it kind of useful in this forest of data.

Trevor Connor 15:39

And I love the analogy you use for it, because it makes it really clear where you have this plank of wood that’s sitting across two beams. So it’s – and you put a large weight or a brick or something on that piece of wood. So that brick is the load and the wood’s response trying to hold up that break is the stress and over time, you might see a bending in the board or the piece of wood, and that’s your strain.

Dr. Stephen Seiler 16:06

And then you take away the load – and of course the strain disappears -but it may not disappear immediately. That, you know, that wood may be bent, and it slowly returns to its starting point. So there are some analogies that kind of are useful. They’re not perfect. There’s no model that’s perfect. But I find the load stress strain from engineers to be somewhat useful to me, and to try to categorize some of these variables that we’re measuring and say “where are we in this process?” And I think there’s some of these measures that we do we use that maybe are misplaced – I’ve often harped on training stress score from Training Peaks and I love the guys at Training Peaks but I don’t think that’s appropriately named because it can’t really measure stress. It just measures what you’ve done in a kind of a calibrated way. So I would say that should just be a load score, you know, and then we’d be interested in saying, “All right, what’s the stress response to those loads, right?”

What Power Does and Doesn’t Say

Trevor Connor 17:07

Power is very direct and clear measure of load: 300 watts is 300 watts. But what it doesn’t necessarily capture in the numbers is a subtle differences in the stress experience for what appears, by the numbers, to be the same load. Here’s Jack Burke to explain.

Jack Burke 17:24

So this is something that took me so long to figure out my career and it was such a game changer when I did. And so going back to like power, like it’s a measure of torque, right? And so there’s a very big difference on how you make power at 50 kilometers an hour versus 8 kilometers an hour. So for me, I would always do all my training on climbs where like – or I do a lot of my intervals – when I was a junior, I lived in Toronto, we didn’t have any climbs – so I would do all my intervals on the TT bike, and I got very good at time trialing. When I moved to the West Coast, suddenly, we have all these mountains and I just want to go do my efforts there, I start doing all my efforts on the climbs and now suddenly, I can’t put at the same power on the flats. So if you want to be able to put out the same watts on the climb versus the flats, you have to train an equal amount of time and intensity on the flats versus the climbs and it all comes down to inertia – like the way your muscles fire at 50 kilometers an hour compared to 8 kilometers an hour is completely different. So knowing how to train depending on what you’re training for – if you’re training to be a climber versus a time trial is versus a sprinter, something like that – that was just something I never factored in because I always just thought “watts are watts, power is power”. But it actually matters how fast you’re going because the way the muscles fire is different – and it’s like you have the engine, already, you just need to teach the transmission to fire differently, you need to give the transmission a different set of gears. And that’s why like motor pacing can be really effective, because you need to teach the muscles to fire at race speeds.

Trevor Connor 18:43

You and I have talked about this a lot. And if I was looking at the stress response to that, we’ve often looked more at things like what’s the cardiac drift over the course of the workout –

Dr. Stephen Seiler 18:52

Right.

Trevor Connor 18:52

And there’s where you’re seeing the body’s response.

Dr. Stephen Seiler 18:55

We’re seeing ventilatory drift is an even stronger measure, you know, so we’re, we’re coming online with some new – you know, with taking ventilation out of the laboratory and into the field and in that relationship between heart rate drift and ventilatory drift is really interesting. But we’re not ready to make a variable, we’re not ready to say “Well, here’s the Seiler breathing index” – we got more work to do! But it’s very, it’s tempting, because it’s so easy with digital tools to just make up a new thing, you know, all we got to do is divide a numerator and a denominator or multiply this times that or take the fourth power and then the fourth root of something, and we’ve got a new number. So this is dangerous, I think and it’s part of our issue here today.

Trevor Connor 19:43

So before we go there one thing I just want to say in response to everything you’ve just told us, I think you’ve made a really important point that I really want the listeners to remember as we continue this conversation, which is we tend to think “well, we got these metrics” and you go into whatever software platform you use – Training Peaks, Golden Cheetah, WKO, and you see all these different metrics and you think they’re all made equally, but they’re not – and you just brought up there’s different categories and there’s a different quality of those metrics, so you raise the fact that there’s some that are external load and some that are internal response. We did do an episode on this, I’ll put it in the show notes, but as you said, you know, power, distance – those are external. That’s what you’ve done. Doesn’t say anything of what’s going on in your body. Heart rate – that’s an internal metric – but you’ve just raised another thing, which is the direct measure versus estimates. And Marco, I think you had a few other categories to bring up, which was the known and unknown parameters, and also the health readiness versus training vectors that just describe what happened. Do you want to quickly explain what you mean by those two categories?

Measurements versus Estimates and Unknown versus Known Parameters

Marco Altini 20:49

Yeah, for sure. I think that the first distinction we can make is measurements and estimates – and sometimes it’s maybe easier to look at wearables metrics, because here’s where we have more of this, like when we look at training data and everything we talked about so far, a lot of the things we look at are actually measured, right, so we measure power and cycling, we measure heart rate, we measure distance. During training, there’s a fair amount of measuring and maybe less estimating. When it comes to the response to the stressor then I think we get into a bit of a more complex way of assessing this kind of responses, in a way that we have some measurements. But then, in the past few years, we have seen a lot of estimates coming up from wearables – the typical ones would be sleep scores, recording scores, readiness scores. And the way I further classify this, apart from the measurements and estimates, and again, the measurements, here would be the things that your wearable can measure because there is a dedicated sensor in the device that can measure that parameter – for example, if you have an optical sensor, you know, there’s a green light that you see there and there is a detector that is going to measure the light that is transmitted or reflected by the light, then you have a sensor there that is measuring changes in blood volume, and this measuring your pulse rate, and therefore your heart rate – so that is something that has been designed for that job and it can do that job. It does not mean that it does it perfectly, right? Context matters -measurements are not perfect. So, at rest, it might read very well, as you move around and maybe move your wrist and things like that, there will be added noise and the measurement might also become inaccurate. But there is a sensor there that is designed to measure that parameter. Outside of pulse rate, heart rate and variability, almost nothing is measured, right? The devices provide a lot of additional parameters to track your behavior or your response and those, I’d like to classify them between, distinguish them between two groups, as you were mentioning, so that would be the known and the unknown parameters. And that will be simply something that allows us to distinguish between the things that we can actually validate with reference systems – and these will be the non-parameters, things like calories or sleep stages, right, we can get indirect calorimetry or direct calorimetry and measure calories and then we can see what our body is providing or we can get polysomnography and measure brainwaves and look at what that is providing with respect to our wearable, that is measuring sleep stages – but there are other parameters that are not something we can measure, even with another device or reference system. And those will be the unknown parameters. And there’s a lot of this now, right? There’s stress estimates, recovery readiness, sleep scores – and all of this is sort of made up – so that is, I think, one of the most challenging things because it’s also difficult to evaluate. It’s easy for people that maybe have invested in the device to think that this is working because we want it to work, right – maybe we pay subscription every month, you know, there’s a lot of marketing that tries to convince us that it is working – but there is really no reference for that. And I think we need to be a lot more careful with this kind of estimates and what are the implications in terms of how we might try to adjust training or assess the impact of training based on these parameters. And to get there, to try to do that a bit better, I think we really need to understand how these are built so that we understand their limitations and we can try either to use them a bit more effectively. Or we simply decide that we ignore those and we use the wearable to look at the actual physiology since it is measuring it, it is providing it to us and that is probably what it can do best – a measurement of your resting physiology, as opposed to building maybe things on top of that that might be tricky to evaluate or not particularly relevant for athletes. I will maybe just say one thing about this, that I think in the context of people interested in exercise, or athletes, or coaches or people that use wearables to understand how training is going, I think, in this context, it is particularly problematic at times to use these scores, because they combined our physiology and our behavior and they think that sometimes it’s not too clear. So when we look at – an athlete looks at their readiness score or recovery score, they might think, “hey, it is very low, something is wrong in my body”. But that is often just part of what is making up the score. Part of the score is just your behavior, which means that if you sleep a bit less or if you were a bit more active, you will also get a lower score because the device makes an assumption based on a genetic model that is not dependent on you, that may be less sleep requires more recovery. But that does not mean that your physiology was impacted negatively, it could actually be perfectly normal – so there was no change, no sign in your body that sleeping a bit less was detrimental for you and that you needed more recovery, but the device will still tell you so because it relies on this genetic model that is also using your behavior to determine the output, so I think especially in the context of training and working with athletes, when we do things like manipulate training, or go to an altitude camp or try different things, we really want to see the body’s response. We don’t want to see this cumulative score that maybe makes us think that it’s actually more informative just because it’s putting together multiple pieces of information – but, in fact, it’s less informative, because we do not really know “has the body responded negatively” or is this just an assumption that this model is making being based on a change in behavior, for example.

Dr. Stephen Seiler 26:56

I think this is really important, but it also takes us into another aspect of the psychology of measurement. Because if my measurement device tells me “Oh, you didn’t get enough sleep last night, you’re sleep deprived”. I felt fine, but now I don’t feel so fine because now I’ve been told that I have a sleep deficit from last night and I can be – now we’re talking placebo effects – so this is one of the big concerns I have with monitoring is that, almost in any setting, whether it’s academia or business, the things we measure, they can easily be gamified, they can easily have psychological impacts on the organization, they can have unintended consequences that are not positive. “We perform to the metrics instead of what we really – we really want to sell better cars, but we’re measuring some other aspect of it in the sales division and we perform to that metric” or in research, we have metrics related to publish it, “I really want to impact people to help them train better, but what I’m measuring is how many times I publish in a year”, for example – and those two things may not be related. So we see this everywhere, but in the context of training, I think we do need to be concerned, we need to be somewhat critical of what we’re measuring. And just before we started this podcast, I was speaking with my daughter, and I wanted to, you know, I asked her so you know some things about her max heart rate and so forth. And then she said, “you know, dad I don’t use heart rate variability, I think it might be really good, but I just don’t want to spin out of control on the data”. Because she knows she’s a bit OCD, obsessive compulsive disorder, she can easily go down rabbit holes on data – she’s self aware on that – so she purposefully tries to limit what kinds of information are kind of flowing into her head, and mostly tries to go on feeling. You know, she just says, I trust if I feel like not training, I trust that, you know, and I said, “Yeah, I think in the research actually supports that”. So far, the you know, the psychological kind of perceptual stuff triggers quicker than a lot of the physiology when it comes to those long-term strain type issues. So I do think that we need to also remember, our brains are still pretty darn good. They’re pretty useful if we let ourselves use them, unencumbered by too much data.

Trevor Connor 29:35

A great example of what you’re talking about that we’ve seen is with the Oura rings and the Whoop, where you get athletes buying these and they get a recovery score, they get an estimation of their sleep, and they – as you said, they start gamifying it and they start going, “Oh, I need to get more sleep, I need to get better sleep, I need to get a better Whoop score for my sleep”. And then they get stressed about it and start developing insomnia because they’re not going to bed relaxed, they’re going to bed feeling they have to perform.

Dr. Stephen Seiler 30:03

It’s, it’s just spiral.

Trevor Connor 30:06

So you know, like that I laugh at this because I have that gene where I’m a short sleeper, so every morning I wake up and Whoops like “you didn’t get nearly enough sleep, you now need to catch up”. And I just kind of laugh. But you know, not everybody can do that.

Marco Altini 30:20

Certainly. In the original paper, actually, that introduced the word autosomnia , right, that we now use as a term to define the people that have the sort of obsession basically to optimize the sleep metrics and the wearable metrics in a way that becomes unhealthy for them. So they like are like this. So they were people with insomnia that had these wearables and then they understood quickly that the more time you spend in bed, the higher the score typically, and so they would end up spending, forcing themselves to same but even more, and that would result in even worse insomnia, right? So we ended up getting better scores, and making our health or performance worse by optimizing the scores. So clearly, we have an issue there. And I think maybe there’s a meaningful difference between like Stephen’s example where your daughter does not want to include HRV, which is, I would say it’s about “do we want to track something or not” and it can be totally fine not to rely on feel and in what we think works best for us and our psychology. But in that case, at least we are measuring something, I think it’s really crazy in a way that when we look at something that is completely made up and has absolutely no relation with health or performance like sleep score, then that can lead to negative consequences for health and performance because it’s not some health-related data that we have received, or something that is really wrong in our body, it’s something that is almost entirely made up and it’s not really, again, there is no evidence anywhere that the sleep scores are associated with poor health, if you have a lower one,

Dr. Stephen Seiler 32:05

Let’s go back to how is sleep being estimated. They’re not putting any kind of a halo on your head to actually measure brainwaves. They are measuring how your hand flops around in the bed, as far as I can tell, you know, so movement of the arm, that’s one indicator that you’re not sleeping, and then I guess perhaps there may be some heart rate associated issues that are brought into the algorithm to try to quantify sleep. But the point is, is that neither one of those are anywhere near actually measuring sleep. And this is exemplary.

Trevor Connor 32:43

Here’s something that’s interesting for you, because I have read a study on this, where they took all these different devices that measured sleep – I believe this was before the Oura ring, so I don’t think the aura ring was included in this – but if you ever look at, you know, the Garmin watch or the Whoop, they’ll try to tell you how much time you were in deep sleep, how much time you were in REM sleep, and it’ll actually break it down over the course of the night. And in this study that I read, it sically said most of them not even close to what you see in the lab, but they did say that of all them, the Whoop was actually close enough that you could potentially use it for research. So I was, I was same thing – I was highly skeptical, but was surprised to see that.

Marco Altini 33:28

I developed myself the – together with the team – I am an adviso for them and years ago I was doing some more technical work so I was actually developed the new generation algorithm that is in the ring. And this, let’s say, last generation algorithms they perform decently – let’s say better than anything before the Whoop or even I would say the new versions in the AppleWatch – they are very similar. They may be get 80% right over four stages – which is, again, as good as it gets without measu brainwaves, basically, you know, the previous generation was maybe 60-65% – so it’s been a decent step up, but it’s still far from what you get with polysomnography and actually measuring EEG. And I think there are various aspects there that all require, say, maybe a long conversation – one would be even if we were to detect how much REM sleep and deep sleep we get and all of this, we really have no idea what to do with that. Like even when we do with, when we do it with polysomnography and we get sleep stages, it’s not that this information is particularly actionable. The second aspect there is that when we look at what these devices are estimating, they’re doing it through mostly measurements of autonomic nervous system activity, like again, changes in temperature and heart rate variability and heart rate during sleep, and also, of course, movement. But I would say the step up recently has been due to measuring atomic nervous system activity, which is not too bad, but it’s not the same as measuring brain activity, brain states, so is this something different and I was recently part of this sort of committee with a series of scientists and sleep specialist scientists to write recommendations and guidelines for using wearables in sleep research. And one of the common themes there was that maybe we’re even looking at this the wrong way – so we shouldn’t really be looking much at trying to emulate polysomnography with a wearable – so trying to guess sleep stages with still others that are remained quite large, but maybe we should embrace what the wearable is actually providing us which is different and possibly more insightful, which is autonomic nervous system activity. So maybe that is what we want to study in the context of sleep, but also in the context of performance and training and all of that. So you see your body responding through – again, changes in temperature, heart rate, heart rate variability, those kinds of measurements – that are at this stage, very accurate as you sleep collected with these devices, might be more insightful than than trying to mimic what another device is capturing, let alone that PSGs it’s own promise, I think the reference device here is something that requires experts to look at the data, and then agree that this 30 seconds segment is deep sleep. And this 30 second segment is REM sleep. And when you put the three aspects together to do this, they agree 85% of the time, so that 85% of the agreement is actually your reference, what you consider 100% when you develop your algorithm, and then you get 80% of that accurate with the web – so many layer here of different errors that are introduced to estimate something that even if we were able to estimate to measure it correctly, we wouldn’t really know exactly what to do with it nor probably we should expect to need the same amount of right, deeper REM sleep every day, every week, depending on stress in our lives, training sessions that we did, most likely, the distribution of these stages should change over time – all things that we don’t really know.

Dr. Stephen Seiler 37:19

Marco, you know, you’re going down the rabbit hole on sleep. And it’s really an important issue, because a lot of people are being misled, ut it’s a very popular thing to try to measure. But I think within that it’s also useful to kind of back up a little bit and think about, “alright, if the measurements are incorrect, what are the sources of incorrectness or error” as sleep is a fairly complex one, but we can take something that the cycling community will understand much better – just power, or some derivative there of like normalized power – where we’ve introduced various algorithms, but power itself is become our gold standard tool in cycling and even there, we’re being challenged by certain problems. Because, you know, if we assume that most cyclists want to know about cycling out in the, in the real world on real roads, and we test them in the laboratory on various devices, trainers, you know, ergometers – we’re finding out even if they take their bicycle from the road, and put it on a trainer, the same trainer they may have used, otherwise, they put it in the laboratory, they put that bike on that trainer, they’re getting different power measurements with the power meter on their bicycle, when it’s on a trainer than they’re getting outdoors actually cycling on that bike.

Trevor Connor 38:43

Very quickly, something I want to bring up, this is a pet peeve of mine, even though it’s a very minor thing. Power is not a measure, power is a calculation. We measure torque and we measure cadence, you don’t actually measure power.

Measuring in a Lab versus in Real Life

Dr. Stephen Seiler 38:54

Right, that’s true. So yeah, so we’re already deriving good correction – so yeah, we’re using a strain gauge that somewhere in the crank, pedal system, cadence is being measured, and that combination is giving us a measure of power. And, you know, my cycling community friends, they’re even having to use correction factors because our lab, which was supposed to be the gold standard, laboratory testing of power and doing the lactate profiling, that it’s not transferring to the field, in a one-to-one way, and I even have a colleague, Espen Aareskjold, that’s been on your program – been on this podcast, he basically just says “well yeah, we use 8%, we use a an 8% correction, the powers are higher out in the field”. We don’t know for sure if it’s just a function of some biomechanical issues, you know what it is exactly – so I just exemplified that even the fundamental gold standard kinds of measurements – I can give you another example from the laboratory that we’re working a lot with, For decades, we have measured breathing by either using what’s called a Hans Rudolph valve, there’s a mouthpiece in their mouth, and we clamp their nose, or we put a mask on our athletes and then it’s got a tube that we’re collecting that ventilatory exhaust and sampling it for O2 and CO2, and we’realso measuring the amount. And that’s our gold standard way of measuring the metabolics, but what we found out is when we take off the mask, our athletes breathe differently. They breathe on average at higher frequencies, because there’s less resistance and so their brain, our brains adjust the breathing cycle, the breathing frequency and tidal volume is adjusted in a different way. Now, we think the overall volume is about the same, but the brain uses – this seems to regulate breathing differently to solve for that or took a counteract for that resistance from that mass that is part of our gold standard technology. So we are influencing the physiology with the technology. When we take away the mask and just use the shirt, we calibrate the shirt and the mask, then we see – oh my goodness, some of these athletes breathe, they hit peak breathing frequencies during the max test that are 20 breaths per minute faster. And on average, 10 or 12, faster and at the very extreme, some of them are 30 breaths – you know, they’re hitting 90 breaths per minute, without a mask or hitting 65 with a mask. So, we are really in – we’re in a crisis right now, my physiology friends, because our wonderful laboratories that are gold standard kind of halos where we collect data are being challenged fundamentally, because the data doesn’t match up with what happens out in the real world, always. So we’re, we’re dealing with this right now.

Trevor Connor 41:56

That’s a conversation I’ve had with physiologists, I’ve had that experience myself and I’ve done a lot of lab testing and I can tell you, I go in, they don’t want me on my bike, they put me on the scientific velotron which they try to match up with my bike but never quite get the same position, it’s a different saddle. Then they come, you know, block off your nose, put a mask on your face, all these other things and they had me start riding at 150 watts and I’m sitting there going, “this already feels hard”. Not because 150 watts is hard, but just because I’m in such an uncomfortable state, and you know it’s – the whole experience is going to be different and so I’m a very big believer that when you get athletes in the lab to test them, you don’t put them on the velotron, you have them on their bike, because I think it’s a mistake. To test them on this velotron and give them these numbers and you go How do you know if that matches up with their power meter, and I’ll have physiologist say, “well, the velotron is accurate” and I go “but that doesn’t matter because they’re not doing their training on the velotron”.

Dr. Stephen Seiler 42:50

We have two beautiful load Excalibur devices in our lab – we never use them because any cyclist that has any capacity – they’re saying “I want to be tested on my bike”. So they’re bringing in their own equipment and we’re accommodating that or we may be using a white bike but we’re not using the fancy load – you know, so they’re sitting there collecting dust. And that’s just the realities of it but the most common situation will be that if we do testing in the lab now – and this is what professional athletes, you know, cyclists – they’re bringing in their own equipment, they’re bringing in their bike, and they’re bringing in their trainer. And we’re just serving as some reference to try to standardize the process but some of the measurement devices are the same ones are using in the field, and that gives them a bit more security and what they’re measuring.

Rob Pickels 43:41

Hey, listeners, it’s Rob Pickels, co-host of the Fast Talk Podcast. Believe it or not, I don’t just talk about training on the airwaves. I also talk about the science of training with the athletes that I coach. If you like hearing from me on the show, then you can have me as a resource to help you achieve your goals, whether that’s racing, a big adventure, or improving your fitness. For more information about coaching with me or great coaches like Grant Holicky, US Cyclocross Champ Steven Hyde, or many other amazing coaches, check out foreverendurance.com.

What are the Properties of Good Metrics?

Trevor Connor 44:14

Dr. Seiler, I’m going to flip this around because I think you’re touching on a really important point and I’m actually going to ask you a question that you put in the outline, which is, what then are the properties of good metrics? Because we’ve just raised a whole bunch of concerns that we’ve seen with metrics, so what do you look for to say “this is a valuable metric”? What do you look for to say “this is a good metric, this is something we can use”.

Dr. Stephen Seiler 44:35

Well, I want to know where all the parts are coming from. So first principles, I want to understand where each part – like as you pointed out, power is not actually being measured, its cadence times torque. Okay, how’s cadence been measured – I get that, that’s fairly straightforward. What about torque? Is that straight- you know, so I want to know the pieces to the puzzle. I am allergic, as a scientist, to black boxes, if you understand what I mean by that. So if a proprietary company says, “Hey, we’ve got our black box measurement for whatever this metric is”, I’m out of there, because that is, for me, a red flag. But I understand it’s business – they have a business that their trying to protect – so I’m cognizant of that, but you know, I feel comfortable with measurements where I know, I know where the various elements are coming from, whether it’s vO2 or heart rate variability, or whatever it might be – so that’s the actual parts of the measurement and then the other is that kind of – the execution of the measurement, which is user error, you know, there is to some of these measurements, it depends on how you do it. With blood lactate, for example, a lot of recreational athletes that by blood lactate monitors and boxes of strips and – they waste a lot of money because they just make mistakes in the actual process of sampling their own blood, you know, because it’s not easy. I mean, it’s not easy if you’ve never done it, right.? So there’s a technical error aspect. So there’s measurement error that is device-associated, the actual elements that are being measured and how they’re being technically achieved, there’s user error. And I guess, the third source of error is not really error at all but it’s just variation, and that’s just day-to-day variability in these measurements that we assume that are going to be static. We assume maximum heart rate is going to be maximum heart rate every day – but it’s not. We assume peak lactate is always peak lactate – it’s not – because there is biological variation. So you’ve got three kind of sources of variability in these measurements that you try to minimize, you know, you can’t minimize the day-to-day, you do it in the lab with testing and you say, “Well, I want you to show up the same time of day, wearing the same shoes, having not had caffeine – you know, so we try to standardize some things like that and we actually standardize, often for testing, they standardize in ways that they wouldn’t actually do in the real world, that cyclists as well. I wouldn’t have not had a meal for the previous three hours before a workout. But that’s what I do in the lab. I would not have to abstain from a coffee or a cup of coffee for three hours before but I do that in the lab. You know, so we’ve created some kind of artificial situations that actually don’t apply to what our athletes are really doing in the field.

Trevor Connor 47:38

As Dr. Seiler just pointed out, even a good measure we trust such as heart rate can vary and still ultimately lead to estimates. Here’s physiologist Brady Holmer to explain a little more.

Brady Holmer 47:47

So I mean, maybe I think the most common and most popular one would be sort of thinking about like zone two training. So a measure or a metric of whether you’re in zone two or not would actually be to measure your lactate levels – if you did a lactate test, you could actually kind of determine where your different thresholds are and whether you were in zone two or not, versus an estimate, which is, what most people when they say they’re training in zone two are probably doing, which is training based on – it’s an almost an estimate of an estimate because I think what most people do when they’re estimating their zone two is estimating their maximal heart rate, probably using 220 minus age, which is a very, very rough estimate. And then, from that, they’re kind of estimating what their zone two is based on what you know, say maybe 60 to 70% of their estimated maximal heart rate might be, so you know, hopefully, “this is kind of like what you’re looking for”. But yeah, I think you know, regarding that, while that can be valuable and maybe get you somewhere towards your zone two training, it’s certainly not as accurate as a measure of lactate would be and I think a problem though even with a measure is that that measure is constantly going to be changing. So even if I have a measure – someone had even posted something on X or Twitter about this a while ago but -say I get my zone two, I measure my zone two using a lactate meter and so I know what my zone two, say, power output is or running pace is – that could also change from from day-to-day. So unless I’m actually measuring my lactate every single day to determine where my zone two is that day, perhaps an estimate might even be better in that case, because you get like a wider range.

Trevor Connor 49:29

So you just brought up basically validity, reliability, and then you called it functionality. I’m going to throw something out here just to get a conversation going. To me, of these three, reliability to me is the most important. Meaning, if I have a power meterthat’s not valid, that’s – say – 50 watts off, I’m okay with that as long as it’s always 50 watts off, because then I can still use it for training.

Dr. Stephen Seiler 49:58

Not me, it would be okay if it’s 50 watts too high.

Trevor Connor 50:03

Right, exactly.

Dr. Stephen Seiler 50:06

So my, my ego is so intimately connected to those power numbers that there’s no way in heck, I’m going to be able to survive a 50 watt deficit. So the psychology is still important.

Are Reliable Metrics Useful?

Trevor Connor 50:18

We’ll give you 50 watts too high, but how do you two feel about that? If you have something that is reliable – it’s fairly consistent day-to-day – can it be a useful metric?

Marco Altini 50:28

I would say that, in general, it can be useful in certain situations. I mean, as my expertise has been mostly related to measuring, you know, the stress response and third-party variability and wearables and I’ve seen a bit evolution of these devices that started measuring HRV in the night, different wearables, and what they measure is not even really heart rate variability, right – by definition, that is only coming from the activity of the heart – so what you measure as pulse rate variability will just be different because by the time, you know, blood has traveled to the wrist or the finger, there is additional variability due to changes in blood pressure and things like that. But, even if the absolute value there is not really the same as your ECG-derived heart rate variability, the ability of these devices to capture day-to-day changes in response to stressors – so the changes with respect to your own data, the previous days, or your normal range, or, you know, whichever way you quantify your previous data, and how it changes over time – is very good. These devices can do that very well in a way that is very similar to what you would capture with a nice ECG, even though the absolute values will be a bit off. So, certainly, I think reliability in that sense, is more important than the device being actually valid at measuring heart rate variability. So it gives us a way to use it in certain contexts. I think that this technology also tried to solve some of the issues you both brought up earlier in the context of what we could call ecological validity, right? So how there is a difference in measuring things in the lab and imagining things in real-life. With sleep, this is also the case, right? Yes, sure, we can use polysomnography again and use measure brainwaves and all of that, that when you are in a sleep lab, and you have all these devices linked to you, and you’re sleeping there, maybe for one night or two – how does that relate to your actual sleep when you’re at home, right? If you take that data and think that that’s how you sleep and try to maybe implement changes in your behavior, or sleep routine based on that, I don’t think it’s particularly meaningful. And that’s probably the case also for some of the tests you mentioned earlier, measuring the lab with the oxygen mask, or lactate, or a different position on the bike and all of that. As the measurements move outside of the lab, I think we do have something to gain there with our ability to capture these changes in a way that is more representative of real life. It’s the same with heart rate variable, right? In the past, we’re doing things like Stephen was saying before – do not eat, do not drink coffee, do not do anything a lot, exercise, then come to the lab. And then three hours later, you head to the lab, and then they tell you, “hey, now lay down and relax”. And then, you know, you’re ordered to relax, and then you measure your heart rate by a bit and that is probably nothing to do with your resting physiology the way you measure it now is the night or first thing in the morning at home – which is a lot more useful in terms of, we interpret the data with respect to the alternative that we had before. So there is, there is something to gain, I think, as long as we are measuring things, it also gets, I think, a bit more challenging than when when we start – not measuring things, but estimating them and making them up a bit. When you build a model, right, even if we are – if we say that, “okay, we use this very complex machine learning model that uses all these parameters”, but still you are learning also from certain data that you have collected that might not be really representative of the person that is using the device now. So if I use this device, and is measuring something that is accurate, typically, if it’s estimating something, it might be accurate if I’m very similar to the person that was used to create the dataset and create the model that is used now to develop this sleep-staging estimation algorithm or any other algorithm, but if this person is a bit different from me, then maybe what I get is not particularly useful. So it’s always difficult also to generalize, right? If we take it in the simplest form possible, an estimate would be, you know, what’s your maximal heart rate based on your age, right? None of us would use that but there are people, of course, for which this will exactly – to work perfectly as their maximum heart rate that it would get in a maxima test, because there will be those people. But that does not mean that that is a good method, so I think that’s also some sometimes difficult for people to understand that there is a lot of interindividual variability on how these things work and even when they are validated, maybe they’re not validated on, you know, an individual that maybe is older or has a different behavior, which might be someone that is using this device, but it was not the person used to develop the model.

Dr. Stephen Seiler 50:54

So I want to go back to your 50 watt issue or your 50 watt example where you say reliability is more important than validity. And I would say, I would argue this, I would say, “look, if we’re within a person, within a specific individual, I can buy your argumentation and say yes, if I have to choose then repeatability trumps absolute validity” – although I don’t really think you can separate them because now, what happens if I don’t have validity, I am 50 watts, I think I’m cycling at 400 but I’m really cycling at 350 watts, it’s always 50 watts, I’m feeling great about myself, but then I go to the race and I compete against my opponents and suddenly I get this “ah-ha experience” : holy cow, I’m not what I thought I was, my 400 is actually 350 – I’m getting my butt kicked here. So as soon as I want to make comparisons within a team, or across individuals, validity really matters. I’ve got to have standards, if I know, you know, if I’m the coach of Visma/Lease-a-Bike and they’re trying to prepare to win Monuments against, you know, some of the other great athletes like Van der Poel of Alpecin – Deceuninick in Phoenix and – well, that power needs to be valid, okay? They need to have an upfront understanding of “look at when we come, we’re going to have to be able to generate these kinds of powers in these particular situations to get where we want to be”, so validity matters. Within individuals, it’s about patterns. Across individuals, we need some absolute calibrations, I think, and that’s a worthwhile thing to remember because sometimes we do want to compare across and see “where are my athletes relative to some standard”, you know, “what does it take to qualify for the World Champions chips in the 1500?” You know, there’s just – you got to have a certain speed. So, that better be valid during your training,

Trevor Connor 56:15

I actually agree with that. I’ve raised that intentionally to ruffle some feathers. I agree with you.

Dr. Stephen Seiler 57:17

That’s okay, but I think it’s a useful distinction to remember what are we using variables for? Internal monitoring versus standards, you know, making sure – you know, where are we relative to certain targets, you know, for performance. And even in our training process – like Kavanagh, one of the coaches who’s worked with some of the best Kenyan runners in the world, you know, he would say that during early phase training, the metrics that we’ll be using will be more perceptual or maybe heart rate when they’re doing hard sessions but then as they move towards the season, they will move to very pace-oriented, very, you know, validity-dependent measures – they have to be able to run these paces, these lap times if they’re going to be competitive for 10,000. So that construct, this idea, you know, sometimes you need the absolute numbers to match up pretty well.

Marco Altini 58:12

Yeah, that’s actually a great point, even – let’s create a side application of having valid measurements or estimates is that, something I see a lot with, you know, new devices, wearables and things, you know, every day, there’s a new, either a new device or a new parameter that is estimated, right. And if there is a level of validity, then we can try to figure out if this parameter is worth something, even before it has been validated or found in scientific research, looking at comparison with reference system if that is a parameter that does have a reference system, which is something that might take two years by that there – the device has been validated, and the peer-review process and everything that has been done – so a way that we can look at metrics is typically to compare a couple of devices worn on the same person, right? If these are parameters that we can measure reliably – or estimate reliably even and not all estimates maybe are so bad – then we expect all of these wearables to provide us with very similar data related to this parameter. So if I measure the resting heart rate, or heart rate during the night, with three or four wearables, I want to see these data to be very close in absolute terms, but also in relative terms as they change over time, over days, right, and for the same person, and then the same for HRV. But then if I look at sleep stages, then it will be all over the place when I look at multiple wearables for the same person over time, and the same if I look at calories. So, this tells us something, I think that is that maybe we are unable to estimate these parameters with the required validity or reliability at this stage. And so simple, I would say, exercise we can do if we have access to more wearables and we have some doubts about which metrics or estimates we can rely on, use a number of wearables and the ones that track very well between each other, it means that we are actually measuring something or estimating something in a way that we can trust it. But if the data is all over the place – and we talk about, you know, the major players out there – so it’s very unlikely that one is better, I mean, we’re talking about companies that millions of dollars of budget, hundreds of employees, a lot of smart people, right? The reason I think anyone that is better than the others at this stage, right, it’s – they’re all doing the same things with the same signals. So either we do it well, and we can do it or, if it is all over the place from multiple devices, then maybe they’re really not worth our time.

Dr. Stephen Seiler 1:00:46

It’s not doable, yeah. I’m gonna give an example – and I do not want to disparage this technology, or the users of it, who find it very useful – but I find this to be a challenge with near-infrared spectroscopy, the NIRS method. Muscle oxygenation. Muscle oxygenation is an interesting thing for me because it’s a technology using light refraction just like you were talking about for photoplethysmography – so there’s different wavelengths and so forth – it’s a combination of blood flow and oxygenation of the hemoglobin that’s passing under the tissue. But the challenge, the big difference for NIRS versus heart rate, or versus ventilation, is that when I measure heart rate or ventilation, I’ve got a system wide response, it’s very robust, it’s very – I’m looking at the entire body’s ventilatory output or the entire body’s, the heart rate response is a function of the entire system’s responses. When I look at NIRS, I am sampling a really small amount of tissue because that light only travels maybe one or two centimeters deep – I’m giving an extra credit with two centimeters to put it that way – and so it is sampling a small amount of tissue in one muscle and then it’s trying to say something about all – you know – all the musculature that’s active. And that’s challenging and we’re seeing that if you put multiple NIRS devices on multiple muscles, you get different values at the same time. So also, it’s not necessarily that NIRS itself is wrong, it’s just, EMG, NIRS, some of these issues, we’re just not able to sample enough tissue – we’re taking a very small sample and trying to make a really big projection about how the whole body is responding. So those technologies are more vulnerable than the technologies that are built more on kind of robust system wide responses. And I – again, not to disparage NIRS, probably NIRS is one of those things that if you use it every day, you start to detect patterns that are useful, that you’re able to help your athlete. But if you don’t use it every day, and you don’t get that detailed variability, understanding how that athlete what their pattern is, then probably it’s not so easy to use.

What to Look for in Estimates

Trevor Connor 1:02:59

So I’ve got one last big question for both of you – and before I go there, I just want to say that the reason I use that power example with you, Dr. Seiler, and to fully agree with you here, I bought a new power meter in 2016 that read about 40 watts too high and spent the whole spring thinking I was on great form – and it’s exactly like you say, I was feeling great until I went to my first race went “Oh, no, I’m not at great form at all”. So I agree with you completely, but I just want to ask one last big question here, because we have been talking about a lot of the metrics, but the theme of this episode is also talking about some of these estimates. So right now I’m thinking to things like training load, TSS, PMCs, these sorts of things – what do you look for with these estimates to be able to say “this is something I can use”? Because I’m you know, I always get concerned of – we just spent a fair amount of time talking about things that claim to be direct measures and all the particular issues with those, so those have their their issues, they have their points where they don’t work – now you have these estimates, which are taking these imperfect measures and then doing calculations on them, often then doing calculations on those calculations and do you hit a point where you’re having something that could be more harmful than good? How do you make that assessment?

Dr. Stephen Seiler 1:04:23

Well, I mean, we can start with something as simple as FTP: functional threshold power. And no disparaging to those who created it, but if we go back to the origins of it, it was a fairly straightforward: “what’s your average power for an hour?”, right? The Hour of Power. Been done for decades. It’s not a fun test, but it is a truth teller, right? Well, then you have say, “well, that’s kind of hard, can’t we estimate with a shorter distance?” and so you’ve had this slow industry of “well, you know, let’s take – do a hard five minute effort and then we’re going to do a 20 minute, and then we’re going to take point nine five and” – but as soon as you try to estimate the average power the person can maintain for 60 minutes from an average power for some other duration, there’s already – you’re introducing error, because individuals are different. Their fatigue curves, their fiber type compositions and so forth dictate that, yeah, there’s not going to be a one-to-one correspondence between or a perfect correlation between their 20 minute power and their 60 minute power, even if you measure both of those beautifully and perfectly, so we shouldn’t even expect that to be a perfect correlation, okay. So those kinds of issues are rampant and then you have this metrics type thing where we get athletes that, unfortunately, will take every way of trying to get a higher FTP, and then they’ll use the highest FTP they ever had when the power meter was probably not quite calibrated right – and that will become their benchmark. And that’s tough, you know, because it’s probably 30 watts off, you know, so I had this happen to me on Zwift one time. I had my regular bike, my texneil bike had to be repaired, I was using another bike, and I’m riding in this race and all of a sudden, for whatever reason, I was like, “Holy crap, I am feeling great”. And I was just flying, I was just kicking butt and set in power records along the way and I was like, “Where did this come from? Where did my magic come from?”. And then I realized that, at the end of the ride I realized that a towel had gone down – and I was using the concept to – and it was slightly covering the intake. And anyway, it was disrupting the calibration of the ergometer and I was getting a bunch of free watts. And I have just lived with that because there’s – those numbers are still on my Zwift you know, old time performance numbers. They’re, they’re wrong! But I can’t get rid of them. And they haunt me, you know, they still haunt me to this day. But it just shows that we have such a struggle with understanding that, look, the best calibration for us is, “what can you do on a so-so day? You know, what is your daily grind FTP? That’s the one you need to really use as a reference, not that one that was probably the calibration was off, you’re on an all time high, you just can’t – you know, don’t use that, because that ends up creating problems for us. That’s just one example.

Trevor Connor 1:07:35

All right, maybe our group here isn’t going to offer a lot of positives about estimates. So let’s hear from the Tri-doc himself, Jeff Sankoff, who shares a few calculations and estimates that he finds valuabe, including FTP.

Jeff Sankoff 1:07:48

The major things that I use are bike-derived or bike-measured. I use WKO to interface with Training Peaks and, because of that, I get a lot of data that’s pulled out of Training Peaks so most of my athletes have power meters, if they don’t have power meters, I encourage them to get power meters. I have, I think, one athlete who still doesn’t have a power meter and so for them, I’ll use a lot of subjective kind of information. But for all my other athletes that have power meters, I’m getting a lot of calculated data that is coming through WKO – and let’s face it, I mean, I’m still one of those people that really leverages FTP. I don’t do specific FTP tests – WKO has its own protocol for testing and then calculates an ongoing FTP – and I’ll use that. But, I do like FTP, I do think it’s a good metric. I don’t use it. I’m not slavish to it in any way, I don’t think it’s the end all to be all, but I do think it’s a good number that I like to follow. I think it gives me a good sense of where my athletes are in terms of their ability. And then I kind of pair that with other things – so because I’m coaching triathlon mostly, I – for running, I will use pace, I use a threshold pace, but I kind of pair it with… I don’t love heart rate as an individual metric on its own because I think there are so many things that can impact that, especially for age groupers, and we’re not dedicated professional athletes so we’re not spending, you know, getting up in the morning and immediately training. Most of us have jobs, most of us are squeezing our training in whenever we can, and so there are so many things that can impact our heart rate. And so for that reason, I look specifically at how heart rate is interacting with things like pace or interacting with power output and I want – I’m looking for any signs of decoupling – so any decoupling where I don’t expect it to come, I take that as a sign of fatigue or I take that as a sign of overreaching. So those are both measured paces, also measured, obviously. And then the other data that I’ll really use, I guess, is going to be related to things like cadence. I really feel like cadence is very important – especially running cadence. There’s so much science showing how running cadence is important for running economy and so I harp on my athletes continuously on the importance of cadence, I have a lot of them using the metronome function on their Garmin watches to try and get them to run at a good cadence. And spinning cadence as well, I think, is a very important number – so most of my data is measured and a lot of it is things that I try to follow on a workout by workout basis.

Dr. Stephen Seiler 1:10:24

The normalized power – it’s a arbitrary decision made by someone to say, “well, it looks like, you know, taking the fourth power of these numbers, and then the fourth root to get this normalization” – it kind of looks right, but there is absolutely no gold standard to say “this makes sense, it has to be the fourth power. Third power is not right. Fifth power is not right. It’s fourth power, right?” No, that was never the case. And then now we know that in Training Peaks, they don’t even actually use the normalized power in the raw format, they take a 30 second smoothing version of the raw – of the normalized power because that turns out to end up matching up better. So even the normalized power has been, in a sense, bastardized to tweak it so that it looks a little more correct. And we see this, if we take true normalized power from some of these highly stochastic races, we get absolutely crazy values. But it’s perpetuated in – we’ve just gotten used to it.

Trevor Connor 1:11:23

Here’s my issue with normalized power – and I, whenever athletes send it to me, I immediately tell them, “No, tell me your average power” and they go, “Oh, Trevor, you’re so outdated, you got to stick with normalized”. But let’s even give the benefit of the doubt and say normalized power does what it says it’s supposed to do. What they are trying to do is take an external measure, and give an estimate of internal stress. So, normalized power is supposed to be “here’s what the race or the ride or whatever felt like to your body” – so even though you average, you’re doing a crit, even though you average 275 watts, to you, it felt like 400 watts. But I have athletes send me the normalized power all the time and say, “look how hard I was going and how fast I was going” and I go, “that’s not what the normalized power tells you, it’s actually the exact opposite, stop sending me that number”. And athletes won’t stop because normalized power is always higher than average power, it’s a nice number.

Dr. Stephen Seiler 1:12:21

We’re actually using those types of metrics in race analytics to say, “where are the power losses happening?” because what we would like to do is reduce the energy bleed – you know, every time you go around a turn, every time you get your head in the wind, every time you have to jump back on a wheel because you got a little lazy and you fell back a few meters, you have to do these power surges, and they are costly energetically. So we’re actually look using normalized power in a different way to help because we would like to bring it down, we would like to be, you know, using the famous “eat from the plates of the others before we start licking our own”, you know, we want to reduce because you can only do so much with power in so we’ve got to look at power losses, and say, “Where’s the wasted energy coming”? Where can we – in a classic, in the Paris Roubaix, where can we kind of save some energy because if we can save some of those surges – so in that respect, we want raw MP, we want to use the raw data, because it’s telling us the actual power surges, not some smooth variant thereof. So I think sometimes there are ways to use some of these constructs that have been developed but just be very cognizant of what you’re – you know, if you smooth out, for example, if you use 30 seconds smoothingas in Training Peak, with that normalized power, you’re actually miss, you’re losing data, you’re losing information about how did the cyclist actually cycle? How do they actually solve the problems of the course? And so we’re finding that it’s more interesting, it’s more truth telling to use the raw, normalized power, not the Training Peaks version.

Trevor Connor 1:14:07

That’s interesting.

Marco Altini 1:14:08

I found it particularly interesting that in this context, or the examples you guys have made, we talk about estimates where we are estimating something from actually the same thing collected maybe over a shorter time period, right, like your power over an hour but we start from your power over 20 minutes. And that is already a million issues, right? As you were describing. But now with the devices we have, we actually estimate things from things that are completely different, not even the same parameter measured over a different timeframe or in different contexts, right – again, brain activity from part activity. So I think that says a lot about how things can easily go wrong, and how important it is to understand the limitations. It’s not that it’s all bad or not useful, right? It’s just that we need to understand limitations and to know that there can be issues if we do an FTP estimation from a 20 minute test, depending on the type of audits we have, and also different variables. And the same is true for all other estimates – there might be a use for it but there is likely a margin of error, we might understand where that error comes from. And that maybe allows us to use the data more effectively, or sometimes we might even not know where the error coming from. And that makes it a lot harder maybe to use the data more effectively.

Dr. Stephen Seiler 1:15:32

I want to say just another – I work with companies a bit and I teach a course in sports technology and so in that respect, I’ve gotten to know some of these fantastic innovators that develop companies and try to bring technologies to the market – and what I find, and some of my colleagues, we find is that young companies, hungry, almost bankrupt, you know, engineers or scientists, they want to do things, right and they want to get it right, they want to measure everything, right and they’re just dedicated to their hardware development and their user interface, and they’re trying to make things happen – those are the ones I love working with. But if the company gets really successful, they start really making money selling a lot of product, what happens? Well, the engineers get pushed out of the head office and the marketing people come in and then they start to say, “Hello, we cannot afford to do another hardware iteration here. 2.0 is fine. But we’re going to do some more algorithmic work, we’re going to build in some new, some software changes so that we can make some more estimates”, then we start talking about these big companies like Polar or Catapult, which does all the inertial measurement units for team sports, they can measure hundreds of variables, they will tell you. They don’t. They estimate hundreds, they measure a few – they become the world champions of estimates, because they don’t want to iterate hardware. That’s too expensive. They want to iterate software, because that’s relatively cheap to do. And your marketing whizzes can fool the public into believing that they’re buying a product that has a lot of new features that gives them more information when actually it’s fuzzier than ever. So this is just the nature of the beast that we need to – our consumers need to be aware of that how the process works in the technology business.

Marco Altini 1:17:35

Yes, right. I think we see it also in the wearables that we use these days, right? The technological innovations have stopped almost 10 years ago, right? Once they introduced PPG that was the huge change in improvement from the previous just accelerometers, right? So before you had the Fitbit, and now we have the PPG – so we have HRV, heart rate and everything. But then, the hardware has been exactly the same.For a long time now, there is no innovation from a hardware point of view – sensing, actually sensing something differently, is just all software and with the problems that we discussed.

Dr. Stephen Seiler 1:18:09

An example of that is if the heart rate company, if Polar wanted to, they could build a belt that had a stretch transducer in it that could measure at least breathing frequency quite well, quite accurately. And it would be on first principles, it would actually be measuring thoracic excursion, you know, which is the gold standard way in the field to capture breathing frequency. But instead, they don’t want to do that because that would imply, you know, a big shift technologically and have a new hardware and going back to a chest strap – and they don’t want to do that they want to put everything into one device – and so instead they say, “well, we’ll measure it through heart rate variability”, and that’s not going to work because heart rate variability just keeps going down, down down as intensity increases and breathing frequency increases so you can’t capture that accurately, that – maybe at rest, you can be reasonably accurate but at 90% heart rate and going up a mountain, and your breathing frequency’s going up to 75, you’re not able to capture that – so it’s a dead end, I would argue it’s a dead end. It can’t work, at least across the whole spectrum. But it’s cheap. It’s relatively cheap to create an algorithm and give people a number. But it’s wrong. You can’t trust it. And it’s unfortunate.

Trevor Connor 1:19:22

I’ve got one final question here – and this might be a bit of a challenge, because we’ve been sitting here, have we beat up on metrics a bit, we’re really beating up on estimates here. So I just want to flip this around and ask the question, where can these estimates be valuable? Or can you think of particular estimates where you “go even though that’s a bunch of calculations, I find that useful”?

Dr. Stephen Seiler 1:19:45

Well, I’m gonna go with the fundamentals. What’s going on up at the brain is important. It’s useful. It’s interesting. The fundamental is just “hey, how are you feeling,” right? That’s our good basic communication. “I feel tired, so tired is a construct, right? It’s something that the, even here, we’re having to estimate, right? Our brains are estimating some Gestalt from lots of inputs, lots of different neurons that are firing less or more in the net result is my daughter says “I feel tired, right? But that only exists up in her brain. Well, it’s an estimate and I find it very important. So, I do think that, you know, perceived exertion, for example, RPE, that’s an estimate, you’re taking a very complex brain kind of iterate, the brain is trying to capture information, it’s a construct up in the brain even and now we’re trying to metric-size it, you know, turn it into a number and, Borg scale, if you really think about it, can it truly be possible? Can – is it conceivable that sitting on a sofa with an RP of 6 watching your favorite silly show versus coming at the end of the Paris-Roubaix, three abreast with two other greatest cyclists in the world – who’s going to get their wheel ahead, driving heart rates at max, that that’s only 3.33 times higher perception of exertion than sitting on the sofa? Because that’s what the scale says, it goes from 6 to 20, you’re with me? It’s inconceivable. It can’t be true. But yet, the scale has existed for decades and it has some utility, but probably the scale should go from 0 to 500 if it was going to truly capture the difference in brain activity and perceptual, just, lightening going on up in my head in that moment, trying to get across the finish line first versus sitting doing absolutely nothing – but those are estimates and we’re using them. But I can’t even conceive that they’re truly representative of brain activity differences, if that makes sense. So I really wiped out Borg scale there – i’s just always captured my fant- it’s fascinating to think about.

Marco Altini 1:22:20

Yeah, for sure.

Trevor Connor 1:22:24

Brr. Winter. The air is cold. But again, back to conditioning and looking to rev up your training. If you haven’t already, now is a great time of year to reflect on the past season, specifically when it comes to data and recovery: two very important metrics in endurance sports. Visit Fast Talk Labs and take a look at our pathways on recovery and data analysis. These two in-depth guides can help you get the most from your offseason. See more at fasttalklabs.com/pathways. According to Coach Taylor Warren, while even perception has issues, it can still be a valuable coaching metric. Let’s hear what he has to say.

Taylor Warren 1:23:05

Think about these measurements, right? It’s like, yeah, power is a calculation. And then you have all these internal measurements, heart rate, HRV, core body temperature – these are all internal objective measurements. The measurement that I found to be – this is really old school – but I think one of the most impactful measurements is the subjective measurement of RPE, or just rate of perceived exertion. And it’s really just like learning how an effort feels, learning how an effort feels when you’re fatigued, learning how an effort feels when you’re fresh. And using that internal subjective measurement to guide your training in a purposeful way, I think is very, very valuable. And a big part of the training process is learning your body. It’s understanding the relationship between workload and fatigue. And if you can master this ratio, I think that goes a really long way in how you’re planning your training day-to-day and how you’re planning training blocks.

Marco Altini 1:23:59

I think as long as we understand that these are approximations, in some cases, we can, you can use them, as long as maybe they check some of the boxes we talked about, right, in terms of, for example, being able to measure – or estimates, sorry, reliably things and how they change over time, at least within individuals so that there is some level of reliability. And, again, if we understand we are approximating something, and we are not taking it as the reference on which to base everything else that we do, or to interpret it, then in some cases, there can probably be a use.

Dr. Stephen Seiler 1:24:34

And in some of these measures like Readiness to Train and Profile of Mood States and things like that, they use what’s called a “Likert scale” – you know, they use the scale that goes from 1 to 7 where you completely disagree, and then in the middle is both yes and no, and then completely agree, 100% agree. And so it’s – so again, you’re taking some kind of a very fuzzy idea of “how much do I agree with this” and your metrics, you turn it into a number. And what we – what the research shows is that the assumption of linearity, meaning that the difference between a one and a two is the same as the difference between a two and a three and a three and a four, that that interval is equivalent across that scale – the reality is that the human brain usually doesn’t actually treat it like that. The extremes are bigger, they’re less likely to be used – they feel, you know, if you go all the way down to a one that feels like, “wow”, you know, “he really disagrees” or “really agrees” so we tend to use the middle of the scale a bit and the variation is smaller. And even in cultures, the Italians may interpret a scale differently than the Scandinavians, just as a cultural bias. Or even within Norway, I would say the one the Norwegians up in the north, they’re more willing to use the extremes, they cuss more in their in their language, you know, if they don’t feel good, they’re gonna tell you, whereas where I come from in the Bible Belt of Norway, they’re going to stick to the middle of the scale. Well, this translates and transfers out into how they use these various measures related to training and so forth, that they – it’s difficult to get them to say, “I’m really tired, I really feel great”. They always say, “I feel fine”. Well, then – then your measures not very sensitive, right? Because you need, you need the measured, you need people to say “when I feel really great, I want to use the sixes and sevens, when I feel really sucky I want to use the ones and twos, I don’t want to always just give the coach either a three, four or a five”, that’s not going to be very useful. So even these kinds of – that’s why palms and some of these are challenging and particularly they’re challenging to start comparing across people are across cultures or across teams, and so forth.

Trevor Connor 1:26:51

So don’t be stoic.

Dr. Stephen Seiler 1:26:53

Just to add to the complexity here, we haven’t really talked about it but as a, you know, a property of these variables should be that they’re sensitive to change, if we want to use it as an effective monitoring tool, it’s sensitive to, if my state of readiness is changing, if my physiology is changing, it’s captured in a robust way by that variable. That’s what we’re looking for in our metrics, that they’re valid, they’re reliable, but they’re also sensitive to real changes.

Trevor Connor 1:27:27

Marco, final thoughts?

Marco Altini 1:27:29

I would say, stick to measurements. And if you really want to go into estimates, try to look at estimates that are consistent across devices that can help you understand which ones we are actually able to estimate with reasonable accuracy and reliability as opposed to the ones that we might not be able to yet.

Takehomes and Final Remarks

Trevor Connor 1:27:50

Well, I hate to say it, guys, we’ve been going a while now so I think we need to wrap this up. Marco, you’re new to the show, we always finish with what we call “our one minutes” where we give everybody on the show one minute to summarize what they think is the most important lesson to learn from the entire episode. So Marco, we’ll give you a little time to think about it. Dr. Seiler, why don’t you go first?

Dr. Stephen Seiler 1:28:14

Okay, think of it as a heads-up display, you want to have the minimum number of metrics or measurements that give you the maximum amount of information. So be careful what you trust, and keep it simple, as much as possible – so, you know, yes, measure – but if you’re measuring 12 different things every day, probably that’s not helping, you know, and think like the fighter pilot, they’ve got to keep their eyes up, looking forward at what matters. And they need a few numbers on their display but they can’t be too many. So pick your metrics carefully.

Trevor Connor 1:28:49

Marco, do you have your thoughts?

Marco Altini 1:28:51

Measuring can be useful. It’s not necessarily for everyone, right? We might have people that have a healthy relationship with the measures that are looking at and people that don’t. We might end up risking focus too much on what we are able to measure, sometimes losing focus on the actual outcomes that matter, which could be health- or performance-related simply because they might not be as easy to quantify – so those are things I think that we need to think about apart from everything that we just mentioned, related to which metrics are derived in which weights and what does that mean in terms of trusting that more or less.

Trevor Connor 1:29:35

Good answer. So I guess I’ll wrap this up here. Mine is going to take a little more than a minute –

Dr. Stephen Seiler 1:29:40

Oh, now, you’re cheating.

Trevor Connor 1:29:41

I’m cheating a little.

Dr. Stephen Seiler 1:29:42

You’re not supposed to cheat, you’re cheating.

Trevor Connor 1:29:44

Remember, I only have something that measures five minutes and I don’t have something that measures one minute, so… so mine is, I think the most valuable lesson I ever learned about measurements and calculations and estimates was a class that I took on body composition – so anybody listening who doesn’t know what that means is basically just measuring body fat percentage – and we spent this entire course learning all the different methods of measuring body composition, our professor had spent most of his career studying this, and I couldn’t resist at the end of the class, I asked him “so which is most accurate?” And he goes, “eye-test”. Being an idiot, I’m like, “what do you mean by eye-tests?” And he goes, “you look at them”. And we continue this discussion and, you know, he brought this really good point that no matter what method – hydrostatic weighing, you know, electrostatic impedance, you know, your, your calipers, all these different methods – said, “ultimately, you’re using the calculations”. And this gets a little bit gross, but how did they come up with those calculations? They took corpses, they did these measurements on them, and then they literally cut up the corpses and weighed the fat tissue, the muscle tissue, the bone tissue to figure it out. So if you are similar body type to those corpses, you’re gonna get a somewhat accurate measurement. If you’re not, it’s actually not going to be that great for you. So it’s still an estimate. Now, does that mean we throw all these out? No, I still have on those scales that does the bioelectric impedance. And it gives me some useful information. But I know that it’s not perfect. It’s not accurate. What I learned from this class is, it’s the eye test. And the equivalent to me in endurance sports is “it’s feel”, it’s RP. I think all these metrics are great. But at the end of the day, you got to trust feel, it is the best.

Dr. Stephen Seiler 1:31:42

I don’t know what it is, maybe they don’t trust their own eyes, or they’re looking for a confirmation, they’re looking for something that tells them something different than that reality they see or they feel. And, sometimes like you Trevor, just how you feel and what you see, those still are pretty useful measures.

Trevor Connor 1:31:57

Maybe you’ve just hit on the most valuable aspect of metrics, they are reality check.

Marco Altini 1:32:03

Yeah, I think they lead to awareness in many situations, right? So the ultimate goal is that we rely on feel and perception and all of that more as athletes, but the data cannot, I think, in that process. I think for some people, then it goes sideways and you think about the metric and you use the metric and you completely ignore how you actually feel and obviously in that case, that’s not the proper way of using the metric, but in many cases, I think it helps us to pause for a second and a self-assess how we feel and over time become maybe better at using feel and perceived effort.

Dr. Stephen Seiler 1:32:38

We recalibrate our own feeling or perception sometimes, you know, and that can be useful.

Trevor Connor 1:32:43

Well, guys, I hate to say it, we need to call it there but that was a great episode – it was a lot of fun talking.

Dr. Stephen Seiler 1:32:48

Thanks, Marco, for being part of this and, Trevor, thanks for pulling this off.

Trevor Connor 1:32:52

My pleasure.

Marco Altini 1:32:53

Thank you everyone.

Trevor Connor 1:32:54

That was another episode of Fast Talk. The thoughts and opinions expressed on Fast Talk are those are the individual. Subscribe to Fast Talk wherever you prefer to find your favorite podcasts. Be sure to leave us a rating or review. As always, we love your feedback. Tweet us at @fasttalklabs. Join the conversation at forums.fasttalklabs.com. Or learn from our experts at fasttalklabs.com. For Dr. Stephen Seiler, Dr. Marco Altini, Jack Burke, Brady Holmer, Dr. Jeff Sankoff, and Taylor Warren, I’m Trevor Connor. Thanks for listening!

How to Navigate a World of Exploding Metrics and Estimates with Dr. Stephen Seiler and Marco Altini

The number of, well, numbers we track during training is exploding, but they’re not all made equal. Some represent actual measurements while others are just estimates. We discuss the implications.

Please login or join at a higher membership level to view this content.

Episode Transcript

Why Do We Monitor Our Training?

Categories of Training Data

What Power Does and Doesn’t Say

Measurements versus Estimates and Unknown versus Known Parameters

Measuring in a Lab versus in Real Life

What are the Properties of Good Metrics?

Are Reliable Metrics Useful?

What to Look for in Estimates

Takehomes and Final Remarks

Related Posts

Fast Chats — Navigating Complex Injuries, and Evidence that Hormonal Changes Don’t Degrade Performance

Zwift’s Next Chapter in Training and Racing

Dissecting Training Zones with Siren Seiler and Dr. Stephen Seiler

Fast Chats – New Research Challenging Whether We Should Do Efforts on Our Long Base Rides