Welcome to part two of the Steppin Up Challenge analysis! If you missed part 1, I encourage you to check it out here.
Last time we looked at the scatterplot of everyone’s 4 week totals. But today I want to look more closely at the few individuals. The reason for doing this is to see exactly how much I can intuit about their lives from the Steppin’ Up data. In data science, there is a line between comforting (oh thanks google calendar for putting my flight information from gmail onto my calendar!) and creepy (How on earth did Hulu know, based on my viewing preferences alone, that I am an unmarried 30-35 year old male?).[theory ref] [google ref] I’m going to look around at few interesting things and see what we can figure out. But first I have to focus my attention, because I don’t have time to crawl through each persons data manually.
There are a couple individuals who stand out as exceptionally good at high steps per min or low steps per minute. These individuals can be found on the scatterplot by looking for the people who are the furthest from the linear regression line. To do this, imagine a line perpendicular to the linear regression line. We pick the competitors who are associated with the longest purple perpendicular line. In this case, P, G, and L are all similarly above the linear regression line. These three competitors all took more steps per step of exercise than the average. Competitors N, A, and F are all similarly below the linear regression line. Thus these competitors have more active minutes logged than the average competitor, meanwhile they have lower step counts. I’m going to choose competitor P and N to take a deeper dive. I’m choosing ‘P’ because I know she is a runner. Let’s call her ‘Powerhouse’. And I’m choosing ‘N’ because that one’s me. And it seems only fair that I analyze my data in public as much as I’m analyzing anyone else’s. I’ll call myself ‘Namaste’.
Across all days, steps and minutes combined, Powerhouse completed 246 steps per active minute. In contrast, Namaste only completed 145 steps per minute. If we subtract the 9124 we computed in the last article as the standard deduction, then Powerhouse has 84 steps/min and Namaste has 19 steps/min. From the last article we know that the group average was 49 steps/min. Thus, Powerhouse has almost twice the group average and Namaste has less than 1/2 the group average. Clearly some variations are possible!
Let’s take a look at Powerhouse’s time series of minutes and steps to try to understand more about her month.
It looks like Powerhouse had at least three days where she took a long run. The first day of the competition as well as two other days where her steps spike above 20,000 steps per day. However, it’s also possible there were other days where Powerhouse went for a run. For example, we can see that there are a few days in early July where Powerhouse had more than 15,000 steps for 4 days in a row. This is probably not just casual walking. There are probably some short runs in there, just based on the steps. However, there is a very interesting relationship between the minutes and the steps graphs.
Why are there some days where the number of minutes seems to peak (or valley) at the same time as her steps? Well, let’s take July 5th. Here is a low point in steps and minutes. Changes are good Powerhouse took a rest day and only did the obligatory walking required by her normal day. However, what do you say about her peak days: June 27th (25458 steps, 214 min) and July 6th (22963 steps,152 min)? Probably these are days where she did a lot of activity. We can figure out her steps/active minutes on those days by subtracting off the average steps that can’t be associated with exercise. But we don’t want to use the group average (9124) like we did above, we want to find Powerhouse’s personal average non-activity based steps. Personalization! It’s what makes things both comforting and/or creepy.
Powerhouse has a non-activity average of 8865 steps per day. Thus, we can determine that June 27th has 16593 exercise based steps over 214 min and July 6th has 14098 exercise based steps over 152 min. On these days she is hitting 77.5 steps/min and 92.75 steps/min. These are certainly underestimates of her steps/min when she is actually running. They are from data points which land on the upper right of the scatterplot, almost exactly on the linear regression line. Last time we calculated a few basic steps/min ratios for common activities (see below). From this quick analysis we can conclude that she are probably including some low intensity activity (like walking) on her highest activity days.
But, perhaps it’s possible to cherry pick a particular day where Powerhouse only went for a run. And perhaps we can figure out how fast she runs? Or at least get a lower bound on her speed. Let’s take the days which are furthest from her average. One of the days which is furthest from the linear regression line is on June 22nd. On this day, Powerhouse did 22 min of activity and went 16362 steps which translates to approximately 7497 exercise steps. When we look at this day on her time series, there is nothing particularly notable about this day. Clearly the relationship between steps and active minutes is not a simple one. So, the linear extrapolation we are making is probably not the most reliable method of determining running speed. If we translate Powerhouse’s numbers, she did 340 steps per exercise minute on June 22nd. Which means, based on this data, she ran 6 minute miles for almost 4 miles. My next step was to validate. How fast did Powerhouse normally run during this month? How close will I get to her actual run speed?
I spoke with Powerhouse and she told me something very unexpected! She told me that she didn’t run at all. Nada. Zip. Zilch. She also told that tore her MCL (which is a muscle inside of her knee) a while back and she hasn’t been able to run. All activity in this challenge was walking, biking, and workout videos. Her job involves a lot of one on one conversations which are generally done while walking. The good news is that our computed step rate of 77.5 – 94.75 steps/min was definitely walking. And this implies that her highest stepping days mainly consisted of many hours of walking (because that’s what most days consistent of). So, some of the conclusions were reasonable. However, my attempts to cherrypick a running speed where totally thwarted.
What a fabulous and terrible dilemma! At this point, I could definitely go back and edit this post to disguise and hide my previous attempt to determine her run speed. If this was a peer-reviewed article, I, almost certainly, would decline to share my false conclusions. But this isn’t peer reviewed! And I think there is something really valuable about seeing that even a professional can be lead astray when she (or he) starts trying to get more information out of a data set than is appropriate. If I continue to slice and dice the data into smaller partitions, I’m likely to find all kinds of weird things that may or may not have anything to do with reality. Data Skeptic recently published a podcast with Chris Stucchio about Multiple comparisons and p-hacking. So, I want to leave my erroneous analysis in the article, because this is a great problem in data mining. How far do you mine before you are just making stuff up?
In summary, I thought it could be really cool to figure out Powerhouse’s run speed. But, in fact, she didn’t run at all. My basic assumption about my predictive methods was wrong. Perhaps I found a day where she did a work-out video (lots of steps over a short period of time). Or perhaps there was some combination of exercise that just didn’t hit the 10 minute mark to make her FitBit count it as active minutes. Or perhaps there was some human error in reporting (which is also totally possible). Who knows! My original basis for inquiry (that she was running) was wrong. So many of my future conclusions were also wrong. Perhaps this there is a moral here about understanding the context of the problem before diving into the math?
This brings up a tangential but interesting point: Data Science can only be creepy if it’s accurate. We’ve probably all received advertisements on social media that are totally off base. In this situation we are amused or annoyed but we are definitely not creeped out. It’s like I expect a computer algorithm to not be able to understand me. So when it can’t predict something about me, I’m not surprised. I expect that. But there’s always human assumptions behind those algorithms. There’s someone back there fundamentally assuming that Powerhouse is running, even when she isn’t. They might assume that I’m single because I’m watching the Bachelorette, even though I’m not. What I’m trying to allude to is the idea that there can metaphorical fingerprints on data science results, depending on how the model was built. And in this case, it brought me to very wrong conclusions about Powerhouse.
Tune in next time when I divulge the details of my month and conclude the series of posts on this challenge.