I spent some time last month at a SIAM conference. Since I graduated and joined the Industry two years ago, I hadn’t been to a conference. The attendees were mostly academics. But there were a few Data Scientists coming from industry as well. Why so few? Why don’t we get to go to conferences? There are lots of questions about what Data Science is and what it isn’t. This made me wonder what the current state of Data Scientists was. Because data scientists are a weird breed. How many are there? How many should a company have?
We know a lot about how many people get PhDs. Less than 2% of Americans have PhDs. How many of those end up being data scientist? There are about 52k new PhDs every year . But how many are prepared to become data scientists? Finding that stat isn’t super obvious. And since the data isn’t super rigorous, I’m going to do some estimations. Something I’m much more comfortable with now that I’ve spent a few years giving “directional” analysis summaries. Suffice to say, a back of the envelope computations will probably be good enough for this article. Let’s try it.
There were 1,900 new math PhDs in 2014 . But people who aren’t mathematicians can be Data Scientists. Let’s expand to include everyone in STEM. This may be an overstatement, but let’s soldier on anyways. Of those 52k new PhDs, there were 40,588 PhDs in Science & Engineering fields in 2016 . Of those, about 60% have non-Academic jobs . So there are 24.5k new PhDs looking for industrial work each year, potentially as data scientists. Let’s assume these people all have at least 25 good years of working. So there are about 612k PhDs who are, theoretically, capable of being Data Scientists and currently working. There are 308M Americans in the US, so the (potential) data scientists make up about 0.2% of the total American population. …plus or minus a few thousand who leave or enter the country over time.
At this point in our analysis, we have a starting point of 0.2% of all Americans are doing work that looks similar to that of a data scientist. If data scientists were equally distributed across all the companies in the US, I would expect to see data scientists make up a maximum of 0.2% of each company… maybe. Because, of course, there are hundreds of thousands of companies with no data scientists at all.
Now I want to know how many companies are actually employing data scientists. This brings us to another big reason for uncertainty; there are lots of intrinsic biases and opportunities for error in researching employment numbers. Primarily, job title nomenclature is fairly arbitrary. There are a lot of data analysts and engineers out there who aren’t called “data scientists” but who are doing data science work. Additionally, a company might start giving out data science titles because it’s the “hot” thing to do right now. However, I would argue that if a company is embracing the idea of data scientists and you want a job that has the specialization of a data scientist, then it’s worth it for you to know who is hiring a “data scientist”. Regardless, there is definitely error coming from how a company decided to title their employees.
The second bias could come from my research methods for determining how many Data Scientists a particular company has. I’ll go into my methods in the next paragraph. But at a high level, public companies are a little easier to get information on than a private company. And it’s difficult to find out how many data scientists a company has no matter their public/private status. Let’s look at my methodology next.
As a first step, I chose a few companies which are either big or popular right now. Then I used LinkedIn to get an estimate of how many people claimed to be data scientists at each company. Since LinkedIn is subject to self-selection bias and people who don’t update their LinkedIn, I think these numbers under-represent reality. So, additionally, I found some comparisons in companies where I could find a more reliable source of the number of data scientists at the company. From these few companies, I can determine a modeled view of how many Data Scientist there are. For example, Yahoo Labs! says they have 200 employees in the Lab but LinkedIn says they only have 34 data scientists. Meanwhile, Google has 231 data scientists on LinkedIn and their website says they have 982 people in their research lab. From these, and a few others, I inferred an “effective” number of data scientists per company. With my newly created effective data scientist title, I’m trying to measure the number of actual data scientists + the employees who act like data scientists. Thus, the set of effective data scientist is greater than or equal to the set of titled data scientists. From here on out I’m mostly going to be talking about “effective” data scientists.
Above are the results from my initial research. I focused on headquarters population only. Satellite and store employees are not included in the company size. Note: some of these companies are quite small; I’m looking at you Snapchat. Thus, it may be more beneficial to understand what the percentage of effective data scientists are. Here are those results:
As you can see, Uber and Snapchat have the highest percentage of data scientists of any of the companies I considered. So, perhaps there is a start-up bias to this… or, once we notice that the top four companies are all located in Silicon Valley, perhaps Silicon Valley is the reason why?
So, location can play a large role in how many data scientists a particular company hires. However, these apparently high levels of data scientists could be due to a methodology problem. I made an assumption that the number of people who are titled “data scientists” is a fixed ratio compared to the number of people who do data science-type work. This may be a flaw in my analysis which effects Silicon Valley companies. If most of the effective data scientists are actually titled data scientists within Silicon Valley, that is, for Silicon Valley,
effective data scientists = titled data scientists
then my inferred results may be over estimating the percentage of effective data scientist. But I’m also willing to believe that Silicon Valley companies are more focused on data supported results. So these companies might, as a consequence or cause of their location, believe that more data scientists will result in higher earnings. Startups also contain a higher than average percentage of data scientists. But who knows if this is because they are startups or they are hip to the hotness of the data science title or that they actually use that many data scientists?
Looking past the “Silicon Valley Effect”, the companies with higher percentages of data scientists are companies which are known for their data science. Netflix is famous for its data science, and their percentage data show that. Meanwhile, Walmart has a negative reputation for not being able to keep data scientists and they don’t have as many . Maybe there is something to this relationship?
Lastly, I’m took an informal survey of a small collection of my friends with Effective Data Science titles (n=15). They are going to help me make a totally subjective guess at what the relative reputation for good data science is for each of these companies. I took the mean of responses I received and plotted this against the percentage of effective data scientists.
With this fairly random looking scatterplot, I have no great conclusions. Clearly the respectability of the data science department is not a function of its size for my data set. But, beyond that, there isn’t much to say. I don’t have a recommendation about how many data scientists a company should have because the (limited data) I’ve collected does yield any strong correlations. What do you think? How many data scientists/mathematicians is appropriate for a particular company to employ?
This is something I’ll continue to investigate. I’m also planning to get some resources together for academics who want to transfer into the world of “Data Science”. So, perhaps in a few months or a few years, we’ll have a better answer on what it means for a company to have data scientists and what kind of value those data scientists bring.
Note: This post was also republished on SIAM NEWS in July 2016.