## Variations of Data Science

I spent some time last month at a SIAM conference. Since I graduated and joined the Industry two years ago, I hadn’t been to a conference. The attendees were mostly academics. But there were a few Data Scientists coming from industry as well. Why so few? Why don’t we get to go to conferences? There are lots of questions about what Data Science is and what it isn’t. This made me wonder what the current state of Data Scientists was. Because data scientists are a weird breed. How many are there? How many should a company have?

We know a lot about how many people get PhDs. Less than 2% of Americans have PhDs. How many of those end up being data scientist? There are about 52k new PhDs every year [1]. But how many are prepared to become data scientists? Finding that stat isn’t super obvious. And since the data isn’t super rigorous, I’m going to do some estimations. Something I’m much more comfortable with now that I’ve spent a few years giving “directional” analysis summaries. Suffice to say, a back of the envelope computations will probably be good enough for this article. Let’s try it.

There were 1,900 new math PhDs in 2014 [2]. But people who aren’t mathematicians can be Data Scientists. Let’s expand to include everyone in STEM. This may be an overstatement, but let’s soldier on anyways. Of those 52k new PhDs, there were 40,588 PhDs in Science & Engineering fields in 2016 [3]. Of those, about 60% have non-Academic jobs [4]. So there are 24.5k new PhDs looking for industrial work each year, potentially as data scientists. Let’s assume these people all have at least 25 good years of working. So there are about 612k PhDs who are, theoretically, capable of being Data Scientists and currently working. There are 308M Americans in the US, so the (potential) data scientists make up about 0.2% of the total American population. …plus or minus a few thousand who leave or enter the country over time.

At this point in our analysis, we have a starting point of **0.2% of all Americans are doing work that looks similar to that of a data scientist**. If data scientists were equally distributed across all the companies in the US, I would expect to see data scientists make up a maximum of 0.2% of each company… maybe. Because, of course, there are hundreds of thousands of companies with no data scientists at all.

Now I want to know how many companies are actually employing data scientists. This brings us to another big reason for uncertainty; there are lots of intrinsic biases and opportunities for error in researching employment numbers. Primarily, job title nomenclature is fairly arbitrary. There are a lot of data analysts and engineers out there who aren’t called “data scientists” but who are doing data science work. Additionally, a company might start giving out data science titles because it’s the “hot” thing to do right now. However, I would argue that if a company is embracing the idea of data scientists and you want a job that has the specialization of a data scientist, then it’s worth it for you to know who is hiring a “data scientist”. Regardless, there is definitely error coming from how a company decided to title their employees.

The second bias could come from my research methods for determining how many Data Scientists a particular company has. I’ll go into my methods in the next paragraph. But at a high level, public companies are a little easier to get information on than a private company. And it’s difficult to find out how many data scientists a company has no matter their public/private status. Let’s look at my methodology next.

As a first step, I chose a few companies which are either big or popular right now. Then I used LinkedIn to get an estimate of how many people claimed to be data scientists at each company. Since LinkedIn is subject to self-selection bias and people who don’t update their LinkedIn, I think these numbers under-represent reality. So, additionally, I found some comparisons in companies where I could find a more reliable source of the number of data scientists at the company. From these few companies, I can determine a modeled view of how many Data Scientist there are. For example, Yahoo Labs! says they have 200 employees in the Lab but LinkedIn says they only have 34 data scientists. Meanwhile, Google has 231 data scientists on LinkedIn and their website says they have 982 people in their research lab. From these, and a few others, I inferred an “effective” number of data scientists per company. With my newly created *effective data scientist* title, I’m trying to measure the number of actual data scientists + the employees who act like data scientists. Thus, the set of effective data scientist is greater than or equal to the set of titled data scientists. From here on out I’m mostly going to be talking about “effective” data scientists.

Above are the results from my initial research. I focused on headquarters population only. Satellite and store employees are not included in the company size. Note: some of these companies are quite small; I’m looking at you Snapchat. Thus, it may be more beneficial to understand what the **percentage** of effective data scientists are. Here are those results:

As you can see, Uber and Snapchat have the highest percentage of data scientists of any of the companies I considered. So, perhaps there is a start-up bias to this… or, once we notice that the top four companies are all located in Silicon Valley, perhaps Silicon Valley is the reason why?

So, location can play a large role in how many data scientists a particular company hires. However, these apparently high levels of data scientists could be due to a methodology problem. I made an assumption that the number of people who are titled “data scientists” is a fixed ratio compared to the number of people who do data science-type work. This may be a flaw in my analysis which effects Silicon Valley companies. If most of the effective data scientists are actually titled data scientists within Silicon Valley, that is, for Silicon Valley,

effective data scientists = titled data scientists

then my inferred results may be over estimating the percentage of effective data scientist. But I’m also willing to believe that Silicon Valley companies are more focused on data supported results. So these companies might, as a consequence or cause of their location, believe that more data scientists will result in higher earnings. Startups also contain a higher than average percentage of data scientists. But who knows if this is because they are startups or they are hip to the hotness of the data science title or that they actually use that many data scientists?

Looking past the “Silicon Valley Effect”, the companies with higher percentages of data scientists are companies which are known for their data science. Netflix is famous for its data science, and their percentage data show that. Meanwhile, Walmart has a negative reputation for not being able to keep data scientists and they don’t have as many [5]. Maybe there is something to this relationship?

Lastly, I’m took an informal survey of a small collection of my friends with Effective Data Science titles (n=15). They are going to help me make a totally subjective guess at what the relative reputation for good data science is for each of these companies. I took the mean of responses I received and plotted this against the percentage of effective data scientists.

With this fairly random looking scatterplot, I have no great conclusions. Clearly the respectability of the data science department is not a function of its size for my data set. But, beyond that, there isn’t much to say. I don’t have a recommendation about how many data scientists a company should have because the (limited data) I’ve collected does yield any strong correlations. What do you think? How many data scientists/mathematicians is appropriate for a particular company to employ?

This is something I’ll continue to investigate. I’m also planning to get some resources together for academics who want to transfer into the world of “Data Science”. So, perhaps in a few months or a few years, we’ll have a better answer on what it means for a company to have data scientists and what kind of value those data scientists bring.

## The Mathematics of Aliens

Do aliens exist? The evidence is varied across time and across the globe. But mathematicians who study aliens always impress me. These are individuals who firmly stand on the pillar of logical reasoning- mathematics, that is. And yet, they are considering one of the most contested ideas of humankind… Are we alone?

Mathematicians (along with a lot of other really smart people) sometimes appear crazy. Their ideas are too advanced and so, as Arthur C. Clark once coined, the ideas indistinguishable from magic. And everyone knows that magic doesn’t exist. You can’t make something move without touching it. Except… my garage door opens every day and I’ve never touched that thing! Honestly, I enjoy viewing the world as though every scientific thing is actually magical. A new line of code that makes my life simpler? Magic! Internal combustion engines? Magic! Hot Water?! You get the idea.

But while I love attributing scientific advances to magic, I don’t enjoy attributing non-scientifically proven theories (or magic) as science. Unlike the scientists who founded the Jet Propulsion laboratory (as told by this Cracked Podcast), who believed in and cast spells on a regular basis, I don’t actually believe in magic. And similarly, I don’t actually believe in aliens. Not seriously anyways. Not until it’s proven.

And while all the green alien paraphernalia in Area 51 cannot convince me, mathematics might… Woodruff Sullivan and Adam Frank recently published a paper summarized in the NY Times article, “Yes, There Have Been Aliens,” which described how mathematics show that intelligent life probably existed in our universe at some point. It employs the Drake Equation which is basically the multiplication of a bunch of different probabilities. Let’s take a quick look at the details of that equation:

where

- rate of star formation. (known)
- fraction of stars that have habitable planets (current research)
- number of planets/star that has habitable planets (current research)
- fraction that develop life (unknown)
- fraction of life that is intelligent (unknown)
- length of time to release communications (we could make a guess)

For a while and were expected to be the limiting factor(s) in this equation. However, as scientists discover more and more stars and planets, it seems this is very close to 1 and at least 1. At this point, the biggest unknown is the probability that life is formed and the probability that this life is intelligent.

The authors plug in some values into the equation to get a sense of what the values of and would need to be to make intelligent life unlikely. This is kind of like saying: “If I know how many lottery tickets are winners, how many are printed, how many people play and how many tickets each player buys, then I can tell you the likelihood that you’ll win the lottery.” In this case, the authors are say, “I can tell you the likelihood that someone won the lottery at some point in history.” And, certainly the chances of someone winning the lottery over the entire history of the lottery’s existence are higher than the chances of **me** winning the lottery. *(and if you want to know more about winning the lottery and how to game this system, check out Planet Money‘s episode on the subject.)*

So, Sullivan and Frank are trying to show the limiting values of . Historically, the probability of getting a civilization on a habitable planet was pessimistically considered at one in 10 billion per planet. Sullivan and Frank show that unless is less than one in 10 billion trillion, life is likely to have existed. This means that even if you take the pessimistic number of 1/10 billion, then 1 trillion civilizations existed across the universe. Magic!

Before I close this inquiry into alien life, I have to point out the counter argument. The Atlantic recently posted a rebuttal article. The core of the rebuttal is that while 1/10 billion was seen as pessimistic, we don’t have any idea what the real number is. Like, 1/10 billion trillion is really small, but it’s possible that we are the **only** planet with humans on it. This argument is factual, we don’t actually know the value of …

But, wouldn’t it be cool if we did? Or maybe we should be willing to believe (in the face of scientific argument) that could be large enough to make intelligent life probable. Or maybe not? I guess it depends what kind of mathematician you are. Are you willing to believe that magic is just science we haven’t solidified yet? Or are you a mathematician who believes is magic worthless until the moment when it is definitively proved and can be reproduced ad nauseam?

## Game, Set, Match

Games provide structure. And this is a particular kind of structure because the structure of a game often leads to creative thought. It’s a casual environment where the mind can wander. Set is a particularly famous game amongst mathematicians. If you were a mathematics major in undergraduate, then you probably came in contact with Set.

The premise of set is simple, identify a set of 3 cards where all the cards either share a particular attribute or differ on that attribute. There are four such attributes in Set: color, shape, shading and number. Usually the game is played with 12 face-up cards. And usually there are multiple sets within those 12 cards. But sometimes there isn’t a set in those 12 cards and when that happens 3 more card are added until a set is found.

This simple game provides endless opportunities for combinatorial proofs for new and seasoned mathematicians. There are easy proofs: Prove why there are no sets among these 12 cards. And there are hard proofs: How many cards can be added to the table and there are still no sets? That is: Identify the largest number of cards which can potentially have no sets.

This last question plagued mathematicians for a long time. And it came from playing with this game! Well, if I’m being honest, the question didn’t come from the game. The problem equivalent to “the smallest collection of cards containing 4 attributes with no set” was solved in 1971. The game Set wasn’t created until 1974. In fact, mathematicians were worried about this problem long before the game was developed. But the game, so easy a 6 year old can play, introduces many new ideas about combinatorics to new mathematical thinkers. And since I’m being honest, this is a game where the youngest player at the table generally wins.

But that’s beside the point! The point is that we solved it! …Well, not quite! But, we came up with a better upper bound. Which is pretty awesome. For a collection of cards, we can give a much improved bound on how many cards are needed to guarantee the existence of a set. Previously, for decks larger than 200 cards, we could only bound the number of cards needed at 0.5% of the deck. Now we know it’s only 0.0000043% of the deck.

The coolest part is the **way** this proof was solved. I’m not going to in the details, the folks at Quant Magazine and Gower’s Weblog do a great job of that. But, suffice to say, even Terence Tao called the proof “sort of magical.” Mathematicians are currently rushing to use the techniques from this proof to prove other things. Which is one of the coolest parts about math. It’s abstract enough that a new technique can be applied in lots of areas. One example is using the technique to improve things like matrix multiplication– which has very little to do with a card game built around finding sets of similar attributes.