Steppin’ Up Challenge, Part 2

Welcome to part two of the Steppin Up Challenge analysis! If you missed part 1, I encourage you to check it out here.

Last time we looked at the scatterplot of everyone’s 4 week totals. But today I want to look more closely at the few individuals. The reason for doing this is to see exactly how much I can intuit about their lives from the Steppin’ Up data. In data science, there is a line between comforting (oh thanks google calendar for putting my flight information from gmail onto my calendar!) and creepy (How on earth did Hulu know, based on my viewing preferences alone, that I am an unmarried 30-35 year old male?).[theory ref] [google ref] I’m going to look around at few interesting things and see what we can figure out. But first I have to focus my attention, because I don’t have time to crawl through each persons data manually.

There are a couple individuals who stand out as exceptionally good at high steps per min or low steps per minute. These individuals can be found on the scatterplot by looking for the people who are the furthest from the linear regression line. To do this, imagine a line perpendicular to the linear regression line. We pick the competitors who are associated with the longest purple perpendicular line. In this case, P, G, and L are all similarly above the linear regression line. These three competitors all took more steps per step of exercise than the average. Competitors N, A, and F are all similarly below the linear regression line. Thus these competitors have more active minutes logged than the average competitor, meanwhile they have lower step counts. I’m going to choose competitor P and N to take a deeper dive. I’m choosing ‘P’ because I know she is a runner. Let’s call her ‘Powerhouse’. And I’m choosing ‘N’ because that one’s me. And it seems only fair that I analyze my data in public as much as I’m analyzing anyone else’s. I’ll call myself ‘Namaste’.

Scatterplot_distance

Across all days, steps and minutes combined, Powerhouse completed 246 steps per active minute. In contrast, Namaste only completed 145 steps per minute. If we subtract the 9124 we computed in the last article as the standard deduction, then Powerhouse has 84 steps/min and Namaste has 19 steps/min. From the last article we know that the group average was 49 steps/min. Thus, Powerhouse has almost twice the group average and Namaste has less than 1/2 the group average. Clearly some variations are possible!

Let’s take a look at Powerhouse’s time series of minutes and steps to try to understand more about her month.

Powerhouse_time_series_socialmath

It looks like Powerhouse had at least three days where she took a long run. The first day of the competition as well as two other days where her steps spike above 20,000 steps per day. However, it’s also possible there were other days where Powerhouse went for a run. For example, we can see that there are a few days in early July where Powerhouse had more than 15,000 steps for 4 days in a row. This is probably not just casual walking. There are probably some short runs in there, just based on the steps. However, there is a very interesting relationship between the minutes and the steps graphs.

Why are there some days where the number of minutes seems to peak (or valley) at the same time as her steps? Well, let’s take July 5th. Here is a low point in steps and minutes. Changes are good Powerhouse took a rest day and only did the obligatory walking required by her normal day. However, what do you say about her peak days: June 27th (25458 steps, 214 min) and July 6th (22963 steps,152 min)? Probably these are days where she did a lot of activity. We can figure out her steps/active minutes on those days by subtracting off the average steps that can’t be associated with exercise. But we don’t want to use the group average (9124) like we did above, we want to find Powerhouse’s personal average non-activity based steps. Personalization! It’s what makes things both comforting and/or creepy.

Powerhouse_Scatter_socialmath

Powerhouse has a non-activity average of 8865 steps per day. Thus, we can determine that June 27th has 16593 exercise based steps over 214 min and July 6th has 14098 exercise based steps over 152 min. On these days she is hitting 77.5 steps/min and 92.75 steps/min. These are certainly underestimates of her steps/min when she is actually running. They are from data points which land on the upper right of the scatterplot, almost exactly on the linear regression line. Last time we calculated a few basic steps/min ratios for common activities (see below). From this quick analysis we can conclude that she are probably including some low intensity activity (like walking) on her highest activity days.

steps_per_min_exercise_SocialMath

But, perhaps it’s possible to cherry pick a particular day where Powerhouse only went for a run. And perhaps we can figure out how fast she runs? Or at least get a lower bound on her speed. Let’s take the days which are furthest from her average. One of the days which is furthest from the linear regression line is on June 22nd. On this day, Powerhouse did 22 min of activity and went 16362 steps which translates to approximately 7497 exercise steps. When we look at this day on her time series, there is nothing particularly notable about this day. Clearly the relationship between steps and active minutes is not a simple one. So, the linear extrapolation we are making is probably not the most reliable method of determining running speed. If we translate Powerhouse’s numbers, she did 340 steps per exercise minute on June 22nd. Which means, based on this data, she ran 6 minute miles for almost 4 miles. My next step was to validate. How fast did Powerhouse normally run during this month? How close will I get to her actual run speed?

I spoke with Powerhouse and she told me something very unexpected! She told me that she didn’t run at all. Nada. Zip. Zilch. She also told that tore her MCL (which is a muscle inside of her knee) a while back and she hasn’t been able to run.  All activity in this challenge was walking, biking, and workout videos. Her job involves a lot of one on one conversations which are generally done while walking. The good news is that our computed step rate of 77.5 – 94.75 steps/min was definitely walking. And this implies that her highest stepping days mainly consisted of many hours of walking (because that’s what most days consistent of). So, some of the conclusions were reasonable. However, my attempts to cherrypick a running speed where totally thwarted.

What a fabulous and terrible dilemma!  At this point, I could definitely go back and edit this post to disguise and hide my previous attempt to determine her run speed. If this was a peer-reviewed article, I, almost certainly, would decline to share my false conclusions. But this isn’t peer reviewed! And I think there is something really valuable about seeing that even a professional can be lead astray when she (or he) starts trying to get more information out of a data set than is appropriate. If I continue to slice and dice the data into smaller partitions, I’m likely to find all kinds of weird things that may or may not have anything to do with reality. Data Skeptic recently published a podcast with Chris Stucchio about Multiple comparisons and p-hacking. So, I want to leave my erroneous analysis in the article, because this is a great problem in data mining. How far do you mine before you are just making stuff up?

In summary, I thought it could be really cool to figure out Powerhouse’s run speed. But, in fact, she didn’t run at all. My basic assumption about my predictive methods was wrong. Perhaps I found a day where she did a work-out video (lots of steps over a short period of time). Or perhaps there was some combination of exercise that just didn’t hit the 10 minute mark to make her FitBit count it as active minutes. Or perhaps there was some human error in reporting (which is also totally possible). Who knows! My original basis for inquiry (that she was running) was wrong. So many of my future conclusions were also wrong. Perhaps this there is a moral here about understanding the context of the problem before diving into the math?

This brings up a tangential but interesting point: Data Science can only be creepy if it’s accurate. We’ve probably all received advertisements on social media that are totally off base. In this situation we are amused or annoyed but we are definitely not creeped out. It’s like I expect a computer algorithm to not be able to understand me. So when it can’t predict something about me, I’m not surprised. I expect that. But there’s always human assumptions behind those algorithms. There’s someone back there fundamentally assuming that Powerhouse is running, even when she isn’t. They might assume that I’m single because I’m watching the Bachelorette, even though I’m not. What I’m trying to allude to is the idea that there can metaphorical fingerprints on data science results, depending on how the model was built. And in this case, it brought me to very wrong conclusions about Powerhouse.

Fascinating!

Tune in next time when I divulge the details of my month and conclude the series of posts on this challenge.

 

Posted in Exercise | Tagged , , , , , | Leave a comment

Steppin’ Up Challenge, Part 1

Last month, I participated in a team building “Steppin Up Challenge” at work. We all had pedometers and logged our daily step count for 4 weeks. We also included our number of active minutes. There were 20 individual competitors who signed up for the event. For the next few posts I’m going to dig around the data and share some of the most interesting mathematical findings in this competition.
steppin_challenge_shoesIn total, the 19 people in our competition walked a total of 7.1 Million steps in 4 weeks. That’s roughly 13,383 steps per person per day. Wow! Now, we’ve probably all heard the recommendation that each person should take 10,000 steps per day. Why is that? The Centers for Disease Control and Prevention actually recommends getting 150 minutes of activity a week. For walkers, this amounts to about 10,000 steps per day, speed depending. So, on average, our competitors walked more than the suggested number of steps. Good job to us! But the next question is: did we achieve the minutes of exercise goal as well?

As it turns out, we were completing a weekly average of 607 minutes per week, or 10 hours per week. Definitely above the CDCs recommended values. So, how it is possible that we exceeded the goal of 10,000 steps per day by 30%, but we completely smashed the goal of 150 active minute per week by 400%?

This is probably because we were doing things that were active, but not step generating. Considering bicycling. A friend gave me some rough math from his bike commute; he found that he got about 450 “steps” on his pedometer for each mile he biked. This is less than a quarter of what he would get if he actually walked the mile. But, then we also need to consider that he is taking less time to travel that mile. So a pedestrian might spend 20 minutes to walk 2000 steps (=1 mile) while a bicyclist could spend 6 minutes biking the same distance to get 450 steps. If we normalize these, the pedestrian gets 100 steps per minute while the bicyclist gets 75 steps per minute. What about runners? If a runner can run an eight minute mile, they are running 2000 steps in 8 min. Thus a runner does approximately 250 steps per minute. (edit: normal runners do not do 250 steps/min! Because a running stride is longer than a walking stride, it takes fewer steps to complete a mile. Perhaps a reasonable value would be 1500 steps/mile for an 8-10 minute mile. Thus an 8-10 minute mile could reasonably be between 170-180 steps/min.)

steps_per_min_exercise_SocialMath

So, what was the average steps per minute of our competitors? Are we more like runners or like walkers?

The average steps per active minute for the competitors over the 4 weeks of our challenge was 165.9 steps per minute of exercise. Wait a second- this is a value that is WAY above the steps/min we just computed for pedestrians and bicyclists! Why could that be? One option is that we are all runners. But I know I’m not a runner and did no running during the competition. I also know of many others in the same boat. We didn’t get high steps because we are avid runners. So why is that value so high?

Well, this is probably because there are a certain number of steps you take in a day that are definitely not associated with activity. All the steps you earn from walking back and forth from the refrigerator to get another diet orange Fanta don’t count towards your active minute total. But how do we decide what counts as activity and what does not? For this challenge, we decided to use the definition of active minutes that FitBit uses:

“Active Minutes” are awarded after 10 minutes of continuous moderate-to-intense activity. This includes walking at a brisk pace:

  • E.g. You walk briskly to work for 11 minutes = 11 active minutes
  • E.g. You walk briskly to get lunch in 5 minutes = 0 active minutes

So, we should be able to partition the steps into activity based steps and non-activity based steps. Like, there must be some average number of steps that everyone takes everyday that have nothing to do with the active exercise one does. This is a great question a basic a linear regression. With a linear regression, we are finding the best fit line of the data. We’ll get our regression results as an “m” and “b” as part of the y=mx + b equation for a line. The m tells us the what our average steps/active minute was. And the b is the y-intercept of the line– in this situation that is representative of the number of steps that we all took regardless of activity. For our data, b = 255487 steps. Take a look at the graph below for a scatterplot of the 19 competitors as well as the best fit line.

Scatter_4wk_socialmath

So if b = 255487 for the four weeks, this means we are, on average, taking 9124 steps per day, that are not exercise related. For reference, the average American supposedly walks between 300 and 3000 steps per day. Clearly there is a selection bias in our competitors! The people who signed up for the Steppin’ challenge have active lives even when they are not exercising. (Well except maybe contestant R, who appears to be a bit of an outlier).

Now let’s get back to the steps per minute computation. We want to see if our competitors are more like bikers (75 steps per minute) or runners (250 steps per minute). We originally computed 165.9 steps per day, but this included the 9124 steps we took everyday that had nothing to do with exercise. If we subtract that average from everyone and recompute the steps per active minute we get: 48.7 steps per active minute.

That feels really low! (Also note, the m from the regression gives us the slope between active minutes and steps. In this case, m = 49 steps per active minute. This is pretty close to our gross approximation of the 48.7 steps per minute.) As mentioned earlier, it’s probable that we were all doing exercise that doesn’t generate a lot of steps. But, we are even lower than biking? How?

Well, pedometers are very bad at measuring activity that doesn’t involve walking motions. And I know from anecdotal conversations that contestant M lifts a lot of weights. Weight lifting is definitely physical activity even though you get very few steps per minute. And personally, I do a lot of CorePower yoga. Yoga is definitely low step count. But if my sweat is any indication, it’s a good workout! Based on my experience, I get about 200 steps per hour of yoga. That’s only 3.3 steps per minute. If a bunch of contestants were all doing a few hours activity that involved only 200 steps/hour, then our global average will definitely drop below bicycling levels.

Based on this initial analysis, I’m concluding that this group does a lot of walking in their day to day lives. In contrast, the exercise completed by this group (on the whole) involves a lot of low stepping activities. However, there are some exceptions to this rule… But that’s a topic for next time!

Posted in Exercise | Tagged , , , , , , , , | Leave a comment

Pinpoint the moment when…

Graphs and data are always there for us, especially if we want to look back in time and find the exact moment that something changed. Today I’m presenting a bit of a poem of images about time series.

  • Here’s the exact moment that Robert De Niro “gave up” on his career.

  • In episode #576 of NPR’s podcast Planet Money, they highlight moment when women stopped coding.

women_majors_npr

infection_graph

  • Even the Onion gets in on the action. They claim that FB has a new feature which tells you exactly when things are over.

Onion_Screenshot

Time series can be intuitive and very easy to understand. They are a great tool for looking at how something has changed over time.  In fact, the 6 Sigma methodology has made a whole theory around controlling processes with time series graphs.

Time series are great for helping pinpoint the moment when…

Posted in Business, Communicating Math, Dancing and Performance, Nature | Tagged , , , , | Leave a comment

Big Data

Big_data_is_like

Image | Posted on by | Tagged , | Leave a comment

Social Example 1

Futurama_Writer

Image | Posted on by | Tagged , , , , | 1 Comment

Athlete Mathematicians?

SportsAndNerd

 

Image | Posted on by | Tagged , | 1 Comment