I spent some time last month at a SIAM conference. Since I graduated and joined the Industry two years ago, I hadn’t been to a conference. The attendees were mostly academics. But there were a few Data Scientists coming from industry as well. Why so few? Why don’t we get to go to conferences? There are lots of questions about what Data Science is and what it isn’t. This made me wonder what the current state of Data Scientists was. Because data scientists are a weird breed. How many are there? How many should a company have?
We know a lot about how many people get PhDs. Less than 2% of Americans have PhDs. How many of those end up being data scientist? There are about 52k new PhDs every year [1]. But how many are prepared to become data scientists? Finding that stat isn’t super obvious. And since the data isn’t super rigorous, I’m going to do some estimations. Something I’m much more comfortable with now that I’ve spent a few years giving “directional” analysis summaries. Suffice to say, a back of the envelope computations will probably be good enough for this article. Let’s try it.
There were 1,900 new math PhDs in 2014 [2]. But people who aren’t mathematicians can be Data Scientists. Let’s expand to include everyone in STEM. This may be an overstatement, but let’s soldier on anyways. Of those 52k new PhDs, there were 40,588 PhDs in Science & Engineering fields in 2016 [3]. Of those, about 60% have non-Academic jobs [4]. So there are 24.5k new PhDs looking for industrial work each year, potentially as data scientists. Let’s assume these people all have at least 25 good years of working. So there are about 612k PhDs who are, theoretically, capable of being Data Scientists and currently working. There are 308M Americans in the US, so the (potential) data scientists make up about 0.2% of the total American population. …plus or minus a few thousand who leave or enter the country over time.
At this point in our analysis, we have a starting point of 0.2% of all Americans are doing work that looks similar to that of a data scientist. If data scientists were equally distributed across all the companies in the US, I would expect to see data scientists make up a maximum of 0.2% of each company… maybe. Because, of course, there are hundreds of thousands of companies with no data scientists at all.
Now I want to know how many companies are actually employing data scientists. This brings us to another big reason for uncertainty; there are lots of intrinsic biases and opportunities for error in researching employment numbers. Primarily, job title nomenclature is fairly arbitrary. There are a lot of data analysts and engineers out there who aren’t called “data scientists” but who are doing data science work. Additionally, a company might start giving out data science titles because it’s the “hot” thing to do right now. However, I would argue that if a company is embracing the idea of data scientists and you want a job that has the specialization of a data scientist, then it’s worth it for you to know who is hiring a “data scientist”. Regardless, there is definitely error coming from how a company decided to title their employees.
The second bias could come from my research methods for determining how many Data Scientists a particular company has. I’ll go into my methods in the next paragraph. But at a high level, public companies are a little easier to get information on than a private company. And it’s difficult to find out how many data scientists a company has no matter their public/private status. Let’s look at my methodology next.
As a first step, I chose a few companies which are either big or popular right now. Then I used LinkedIn to get an estimate of how many people claimed to be data scientists at each company. Since LinkedIn is subject to self-selection bias and people who don’t update their LinkedIn, I think these numbers under-represent reality. So, additionally, I found some comparisons in companies where I could find a more reliable source of the number of data scientists at the company. From these few companies, I can determine a modeled view of how many Data Scientist there are. For example, Yahoo Labs! says they have 200 employees in the Lab but LinkedIn says they only have 34 data scientists. Meanwhile, Google has 231 data scientists on LinkedIn and their website says they have 982 people in their research lab. From these, and a few others, I inferred an “effective” number of data scientists per company. With my newly created effective data scientist title, I’m trying to measure the number of actual data scientists + the employees who act like data scientists. Thus, the set of effective data scientist is greater than or equal to the set of titled data scientists. From here on out I’m mostly going to be talking about “effective” data scientists.
Above are the results from my initial research. I focused on headquarters population only. Satellite and store employees are not included in the company size. Note: some of these companies are quite small; I’m looking at you Snapchat. Thus, it may be more beneficial to understand what the percentage of effective data scientists are. Here are those results:
As you can see, Uber and Snapchat have the highest percentage of data scientists of any of the companies I considered. So, perhaps there is a start-up bias to this… or, once we notice that the top four companies are all located in Silicon Valley, perhaps Silicon Valley is the reason why?
So, location can play a large role in how many data scientists a particular company hires. However, these apparently high levels of data scientists could be due to a methodology problem. I made an assumption that the number of people who are titled “data scientists” is a fixed ratio compared to the number of people who do data science-type work. This may be a flaw in my analysis which effects Silicon Valley companies. If most of the effective data scientists are actually titled data scientists within Silicon Valley, that is, for Silicon Valley,
effective data scientists = titled data scientists
then my inferred results may be over estimating the percentage of effective data scientist. But I’m also willing to believe that Silicon Valley companies are more focused on data supported results. So these companies might, as a consequence or cause of their location, believe that more data scientists will result in higher earnings. Startups also contain a higher than average percentage of data scientists. But who knows if this is because they are startups or they are hip to the hotness of the data science title or that they actually use that many data scientists?
Looking past the “Silicon Valley Effect”, the companies with higher percentages of data scientists are companies which are known for their data science. Netflix is famous for its data science, and their percentage data show that. Meanwhile, Walmart has a negative reputation for not being able to keep data scientists and they don’t have as many [5]. Maybe there is something to this relationship?
Lastly, I’m took an informal survey of a small collection of my friends with Effective Data Science titles (n=15). They are going to help me make a totally subjective guess at what the relative reputation for good data science is for each of these companies. I took the mean of responses I received and plotted this against the percentage of effective data scientists.
With this fairly random looking scatterplot, I have no great conclusions. Clearly the respectability of the data science department is not a function of its size for my data set. But, beyond that, there isn’t much to say. I don’t have a recommendation about how many data scientists a company should have because the (limited data) I’ve collected does yield any strong correlations. What do you think? How many data scientists/mathematicians is appropriate for a particular company to employ?
This is something I’ll continue to investigate. I’m also planning to get some resources together for academics who want to transfer into the world of “Data Science”. So, perhaps in a few months or a few years, we’ll have a better answer on what it means for a company to have data scientists and what kind of value those data scientists bring.
Note: This post was also republished on SIAM NEWS in July 2016.
Math is interesting thing…
Suprising information .keep sharing such interesting facts !
Math…. never my strong subject. I was an English major. I would have loved to take a math class on nature geometry or some other hippie crap. Nice post, even if math made me want to tear my left brain apart!
Thank you! I try to make the articles as approachable as possible, but sometimes that’s still difficult. Thanks for looking around!
Very nice post
I’m much more interested in Data Science now than I was before I read this post. I love the methodical and easy-to-follow way that you present your findings. I’ve always loved mathematics, and I am so glad I found your blog!
Thank you! My goal is to be understandable and entertaining. So I’m glad you found it! If you are interested in seeing some great math memes and the best articles from other sites, I recommend liking Social Math’s Facebook Page.
I will check that out. You are awesome!
Nice try ! Very interested conclusion.
A awsome post on my favourite subject …hv a look at my post at https://sakshibhojane.wordpress.com/
I understand that there are more titled data Scientists than actual data Scientists. My question is doesn’t this effect the actual industry of data science? Data science is interesting and an in-depth field with wide possibilities and potential for great work, yet here are some companies that refer to their simple data analysis as data science. Doesn’t this effect the future prospects of the field? And misguides those who look for a future in it?
I agree, there is something deeper to be said about people who do data science work but do not hold the title of data scientist. I definitely didn’t get into that in this article, but it’s an excellent point for a growing field. I haven’t personally done much research on how a miss-representative title can impact the field. Perhaps this is a topic for a different day.
Math is my favorite
Too bad your scatterplot didn’t provide a trend/definitive conclusion, but very interesting post
Yeah, very disappointing. But my sample size was quite small, maybe with more responses, we would see a different result.
Very interesting! I’d love to see the results with a larger sample size. Data science is one potential career path I’m considering, I took a number of econometrics courses during my undergrad and it really got me interested in the field.
I’m working on some more content specifically related to moving into the field of data science. So, if you check back in a few weeks there might be some more resources here for that. 🙂
Perfect, will do!
Samantha, my experience is that the role of the Data Scientist is so new, only the most integrated matrix organizations would know how to use one. Many functional organizations have hr departments who have not even created the requisition for the role. It is as likely in those organizations, the senior management is as clueless in delegating activities.
The most interesting map was the location of large ($billion rev annually) companies who you targeted as employing DS’s. Indicates there is more potential for the role to be identified in complex matrix organizations. Not the type of organization I think of when I think of Walmart or Target or government and state orgs.
It is my experience that if you can show management the bene’s of highly statistical research and investigations and communicate the information in creative visual ways so non mathematically inclined people get it, then you qualify as a data scientist.
I think these are excellent points! Though, I wonder about the details of what “highly statistical research and investigations” really means. This is only because I think there might be something dividing BI&A and Data Science. Both may do extensive statistical research, but I believe there should be a difference between the two job titles. Though, I’m not sure the industry has really decided what the difference is. I expect that there is a distinction in rigor, specialization or repeatability. I’ve heard others talk about the difference between a general practitioner medical doctor and a specialist. You, as a human, need both to exist, but you will use their services at different times and sometimes the general practitioner will refer you to a specialist. What do you think? Is there a distinction between BI&A work and Data Science?
i loved this post, the review of the industry state and efective data scientist was awesome, thanks.
I do believe all of the ideas you have presented in your post.
They’re very convincing and can definitely work. Still, the posts are too short for newbies.
May you please lengthen them a bit from next time?
Thanks for the post.
Extremely good read. Thank you Samantha.