Steppin’ Up Challenge, Part 1

Last month, I participated in a team building “Steppin Up Challenge” at work. We all had pedometers and logged our daily step count for 4 weeks. We also included our number of active minutes. There were 20 individual competitors who signed up for the event. For the next few posts I’m going to dig around the data and share some of the most interesting mathematical findings in this competition.
steppin_challenge_shoesIn total, the 19 people in our competition walked a total of 7.1 Million steps in 4 weeks. That’s roughly 13,383 steps per person per day. Wow! Now, we’ve probably all heard the recommendation that each person should take 10,000 steps per day. Why is that? The Centers for Disease Control and Prevention actually recommends getting 150 minutes of activity a week. For walkers, this amounts to about 10,000 steps per day, speed depending. So, on average, our competitors walked more than the suggested number of steps. Good job to us! But the next question is: did we achieve the minutes of exercise goal as well?

As it turns out, we were completing a weekly average of 607 minutes per week, or 10 hours per week. Definitely above the CDCs recommended values. So, how it is possible that we exceeded the goal of 10,000 steps per day by 30%, but we completely smashed the goal of 150 active minute per week by 400%?

This is probably because we were doing things that were active, but not step generating. Considering bicycling. A friend gave me some rough math from his bike commute; he found that he got about 450 “steps” on his pedometer for each mile he biked. This is less than a quarter of what he would get if he actually walked the mile. But, then we also need to consider that he is taking less time to travel that mile. So a pedestrian might spend 20 minutes to walk 2000 steps (=1 mile) while a bicyclist could spend 6 minutes biking the same distance to get 450 steps. If we normalize these, the pedestrian gets 100 steps per minute while the bicyclist gets 75 steps per minute. What about runners? If a runner can run an eight minute mile, they are running 2000 steps in 8 min. Thus a runner does approximately 250 steps per minute.

steps_per_min_exercise_SocialMath

So, what was the average steps per minute of our competitors? Are we more like runners or like walkers?

The average steps per active minute for the competitors over the 4 weeks of our challenge was 165.9 steps per minute of exercise. Wait a second- this is a value that is WAY above the steps/min we just computed for pedestrians and bicyclists! Why could that be? One option is that we are all runners. But I know I’m not a runner and did no running during the competition. I also know of many others in the same boat. We didn’t get high steps because we are avid runners. So why is that value so high?

Well, this is probably because there are a certain number of steps you take in a day that are definitely not associated with activity. All the steps you earn from walking back and forth from the refrigerator to get another diet orange Fanta don’t count towards your active minute total. But how do we decide what counts as activity and what does not? For this challenge, we decided to use the definition of active minutes that FitBit uses:

“Active Minutes” are awarded after 10 minutes of continuous moderate-to-intense activity. This includes walking at a brisk pace:

  • E.g. You walk briskly to work for 11 minutes = 11 active minutes
  • E.g. You walk briskly to get lunch in 5 minutes = 0 active minutes

So, we should be able to partition the steps into activity based steps and non-activity based steps. Like, there must be some average number of steps that everyone takes everyday that have nothing to do with the active exercise one does. This is a great question a basic a linear regression. With a linear regression, we are finding the best fit line of the data. We’ll get our regression results as an “m” and “b” as part of the y=mx + b equation for a line. The m tells us the what our average steps/active minute was. And the b is the y-intercept of the line– in this situation that is representative of the number of steps that we all took regardless of activity. For our data, b = 255487 steps. Take a look at the graph below for a scatterplot of the 19 competitors as well as the best fit line.

Scatter_4wk_socialmath

So if b = 255487 for the four weeks, this means we are, on average, taking 9124 steps per day, that are not exercise related. For reference, the average American supposedly walks between 300 and 3000 steps per day. Clearly there is a selection bias in our competitors! The people who signed up for the Steppin’ challenge have active lives even when they are not exercising. (Well except maybe contestant R, who appears to be a bit of an outlier).

Now let’s get back to the steps per minute computation. We want to see if our competitors are more like bikers (75 steps per minute) or runners (250 steps per minute). We originally computed 165.9 steps per day, but this included the 9124 steps we took everyday that had nothing to do with exercise. If we subtract that average from everyone and recompute the steps per active minute we get: 48.7 steps per active minute.

That feels really low! (Also note, the m from the regression gives us the slope between active minutes and steps. In this case, m = 49 steps per active minute. This is pretty close to our gross approximation of the 48.7 steps per minute.) As mentioned earlier, it’s probable that we were all doing exercise that doesn’t generate a lot of steps. But, we are even lower than biking? How?

Well, pedometers are very bad at measuring activity that doesn’t involve walking motions. And I know from anecdotal conversations that contestant M lifts a lot of weights. Weight lifting is definitely physical activity even though you get very few steps per minute. And personally, I do a lot of CorePower yoga. Yoga is definitely low step count. But if my sweat is any indication, it’s a good workout! Based on my experience, I get about 200 steps per hour of yoga. That’s only 3.3 steps per minute. If a bunch of contestants were all doing a few hours activity that involved only 200 steps/hour, then our global average will definitely drop below bicycling levels.

Based on this initial analysis, I’m concluding that this group does a lot of walking in their day to day lives. In contrast, the exercise completed by this group (on the whole) involves a lot of low stepping activities. However, there are some exceptions to this rule… But that’s a topic for next time!

Posted in Exercise | Tagged , , , , , , , , | Leave a comment

Pinpoint the moment when…

Graphs and data are always there for us, especially if we want to look back in time and find the exact moment that something changed. Today I’m presenting a bit of a poem of images about time series.

  • Here’s the exact moment that Robert De Niro “gave up” on his career.

  • In episode #576 of NPR’s podcast Planet Money, they highlight moment when women stopped coding.

women_majors_npr

infection_graph

  • Even the Onion gets in on the action. They claim that FB has a new feature which tells you exactly when things are over.

Onion_Screenshot

Time series can be intuitive and very easy to understand. They are a great tool for looking at how something has changed over time.  In fact, the 6 Sigma methodology has made a whole theory around controlling processes with time series graphs.

Time series are great for helping pinpoint the moment when…

Posted in Business, Communicating Math, Dancing and Performance, Nature | Tagged , , , , | Leave a comment

Big Data

Big_data_is_like

Image | Posted on by | Tagged , | Leave a comment

Social Example 1

Futurama_Writer

Image | Posted on by | Tagged , , , , | 1 Comment

Athlete Mathematicians?

SportsAndNerd

 

Image | Posted on by | Tagged , | 1 Comment

Variations of Data Science

I spent some time last month at a SIAM conference. Since I graduated and joined the Industry two years ago, I hadn’t been to a conference. The attendees were mostly academics. But there were a few Data Scientists coming from industry as well.  Why so few? Why don’t we get to go to conferences? There are lots of questions about what Data Science is and what it isn’t. This made me wonder what the current state of Data Scientists was. Because data scientists are a weird breed. How many are there? How many should a company have?

We know a lot about how many people get PhDs. Less than 2% of Americans have PhDs. How many of those end up being data scientist? There are about 52k new PhDs every year [1]. But how many are prepared to become data scientists?  Finding that stat isn’t super obvious. And since the data isn’t super rigorous, I’m going to do some estimations. Something I’m much more comfortable with now that I’ve spent a few years giving “directional” analysis summaries.  Suffice to say, a back of the envelope computations will probably be good enough for this article. Let’s try it.

There were 1,900 new math PhDs in 2014 [2]. But people who aren’t mathematicians can be Data Scientists. Let’s expand to include everyone in STEM. This may be an overstatement, but let’s soldier on anyways. Of those 52k new PhDs, there were 40,588 PhDs in Science & Engineering fields in 2016 [3]. Of those, about 60% have non-Academic jobs [4]. So there are 24.5k new PhDs looking for industrial work each year, potentially as data scientists. Let’s assume these people all have at least 25 good years of working. So there are about 612k PhDs who are, theoretically, capable of being Data Scientists and currently working. There are 308M Americans in the US, so the (potential) data scientists make up about 0.2% of the total American population. …plus or minus a few thousand who leave or enter the country over time.

At this point in our analysis, we have a starting point of 0.2% of all Americans are doing work that looks similar to that of a data scientist. If data scientists were equally distributed across all the companies in the US, I would expect to see data scientists make up a maximum of 0.2% of each company… maybe. Because, of course, there are hundreds of thousands of companies with no data scientists at all.

Now I want to know how many companies are actually employing data scientists. This brings us to another big reason for uncertainty; there are lots of intrinsic biases and opportunities for error in researching employment numbers. Primarily, job title nomenclature is fairly arbitrary. There are a lot of data analysts and engineers out there who aren’t called “data scientists” but who are doing data science work. Additionally, a company might start giving out data science titles because it’s the “hot” thing to do right now. However, I would argue that if a company is embracing the idea of data scientists and you want a job that has the specialization of a data scientist, then it’s worth it for you to know who is hiring a “data scientist”. Regardless, there is definitely error coming from how a company decided to title their employees.

The second bias could come from my research methods for determining how many Data Scientists a particular company has. I’ll go into my methods in the next paragraph. But at a high level, public companies are a little easier to get information on than a private company.  And it’s difficult to find out how many data scientists a company has no matter their public/private status. Let’s look at my methodology next.

As a first step, I chose a few companies which are either big or popular right now. Then I used LinkedIn to get an estimate of how many people claimed to be data scientists at each company. Since LinkedIn is subject to self-selection bias and people who don’t update their LinkedIn, I think these numbers under-represent reality. So, additionally, I found some comparisons in companies where I could find a more reliable source of the number of data scientists at the company. From these few companies, I can determine a modeled view of how many Data Scientist there are.  For example, Yahoo Labs! says they have 200 employees in the Lab but LinkedIn says they only have 34 data scientists. Meanwhile, Google has 231 data scientists on LinkedIn and their website says they have 982 people in their research lab. From these, and a few others, I inferred an “effective” number of data scientists per company. With my newly created effective data scientist title, I’m trying to measure the number of actual data scientists + the employees who act like data scientists. Thus, the set of effective data scientist is greater than or equal to the set of titled data scientists. From here on out I’m mostly going to be talking about “effective” data scientists.

HQ_headcount_DS

Above are the results from my initial research. I focused on headquarters population only. Satellite and store employees are not included in the company size. Note: some of these companies are quite small; I’m looking at you Snapchat. Thus, it may be more beneficial to understand what the percentage of effective data scientists are. Here are those results:

Pct_DS_at_HQ

As you can see, Uber and Snapchat have the highest percentage of data scientists of any of the companies I considered. So, perhaps there is a start-up bias to this… or, once we notice that the top four companies are all located in Silicon Valley, perhaps Silicon Valley is the reason why?

HQ_map_on_US

So, location can play a large role in how many data scientists a particular company hires. However, these apparently high levels of data scientists could be due to a methodology problem. I made an assumption that the number of people who are titled “data scientists” is a fixed ratio compared to the number of people who do data science-type work. This may be a flaw in my analysis which effects Silicon Valley companies. If most of the effective data scientists are actually titled data scientists within Silicon Valley, that is, for Silicon Valley,

effective data scientists = titled data scientists

then my inferred results may be over estimating the percentage of effective data scientist. But I’m also willing to believe that Silicon Valley companies are more focused on data supported results. So these companies might, as a consequence or cause of their location, believe that more data scientists will result in higher earnings. Startups also contain a higher than average percentage of data scientists. But who knows if this is because they are startups or they are hip to the hotness of the data science title or that they actually use that many data scientists?

Looking past the “Silicon Valley Effect”, the companies with higher percentages of data scientists are companies which are known for their data science. Netflix is famous for its data science, and their percentage data show that. Meanwhile, Walmart has a negative reputation for not being able to keep data scientists and they don’t have as many [5]. Maybe there is something to this relationship?

Lastly, I’m took an informal survey of a small collection of my friends with Effective Data Science titles (n=15). They are going to help me make a totally subjective guess at what the relative reputation for good data science is for each of these companies. I took the mean of responses I received and plotted this against the percentage of effective data scientists.

Eff_DS_vs_quantity

 With this fairly random looking scatterplot, I have no great conclusions. Clearly the respectability of the data science department is not a function of its size for my data set. But, beyond that, there isn’t much to say. I don’t have a recommendation about how many data scientists a company should have because the (limited data) I’ve collected does yield any strong correlations. What do you think? How many data scientists/mathematicians is appropriate for a particular company to employ?

This is something I’ll continue to investigate.  I’m also planning to get some resources together for academics who want to transfer into the world of “Data Science”.  So, perhaps in a few months or a few years, we’ll have a better answer on what it means for a company to have data scientists and what kind of value those data scientists bring.

Posted in Business | Tagged , , , , , , , , , , | 23 Comments