Physics to data science: What has surprised me about data science

It's been six months since I decided to become a data scientist, and I want to take a moment to reflect on what's been surprising about the journey. When I started, I had a study plan that involved machine learning and statistics courses, and plenty of programming practice.

Surprising things about the physics -> data science transition:

Data scientists love hypothesis testing. Specifically, they are serious about binary hypothesis testing, and they tend to take the classical view of null and alternative hypotheses. In business this is called A/B testing, and it's used to make high-impact decisions at major companies every day. This surprises me because I didn't run into formal hypothesis testing very much in physics (although it's more popular in biology, for example). On the other hand, we expect that anywhere people stand to make or lose money based on decisions, those decisions should probably considered formally. The emphasis on hypothesis testing is symptomatic of rationality, but we have to be careful about defining the hypothesis space. Not everything that seems binary is binary.
Cognitive bias takes a back seat. I'm used to thinking about rationality as the study of a trifecta of concepts: decision theory, probability, and cognitive bias. People in DS seem very concerned about the first two, while the third is just kind of kept in mind while we do our analysis. It's a hidden variable that causes the results of A/B tests to come out the way they do, and it plays a role in the way we communicate our results to the decision makers. But there doesn't seem to be much attention paid to the pitfalls of cognitive bias in our own analysis. I also haven't found anyone trying to exploit it to influence decisions, even though many data scientists work in marketing. I wonder if this is a place where I can contribute with a unique skill set.
Statistics is hard. I thought I would be able to knock this out really quickly since I'm extremely comfortable with probability, but that has not been the case. Part of the problem is that it's typically taught from a tool-oriented viewpoint. We learn that in situation $A$, with $N$ samples from $M$ ensembles, test $X$ is appropriate. Contrast this with the bottom-up approach of Bayesian probability, where we start with the question of how to define a measure of likelihood, and we write out a complete hypothesis space before any problem. This may be why concepts like confidence intervals and p-values are commonly misunderstood even by expert practitioners. I have struggled to reconcile the tools of statistics with the formal logic of probability theory, but I can at least use the tools appropriately. The rest will come with practice.
Machine learning is easy. Or at least, for most problems. ML makes up a large percentage of the data science news and tutorials I see, and there's a lot of emphasis on figuring out when to apply which method (again, a tool-oriented approach). But in basically every real example I've found, you can throw any classifier at the problem and it's pretty much OK. Furthermore, unless you have a huge amount of data or a very large parameter space, you can set up a pipeline in Python that tries different classifiers with different hyperparameters (like regularization parameters) and find the one that performs best. You just have to be careful about setting aside test and validation sets. All this to say that we can often afford a brute-force approach to machine learning.
Interviews take a lot of resources. I had assumed that it would be rational for companies to do phone screens for any candidate who was possibly qualified for a position, since it's well known that resumes are not good predictors of success. By asking specific technical and non-technical questions, I assume that a competent hiring manager could separate the wheat from the chaff. Sending a data challenge seems like a similarly good idea. But this ignores the fact that even half-hour phone screens require time and effort that's not being put toward high-priority projects, and that someone has to review the data challenge results, which is a lot like grading a test. Which sucks. So it seems like companies are still sort of stuck choosing interviewees based on resume keywords. Seems like a bad idea, but I don't have a great solution.
People are afraid of hiring academics. This is something Kathy Copic mentioned at a panel I attended, and it sounded ridiculous at the time, but I can tell you that it's absolutely true: managers are afraid of hiring "stereotypical" academic researchers, who prefer to work alone on very difficult problems for a long time, and generate theoretically perfect results that are of no use to anyone. They also prefer well-defined problems, are not intellectually agile, and are culturally incompetent. I don't know if this fear is founded - maybe there are horror stories about previous hires who fit this description. But a good academic does none of these things either: she works efficiently on small problems in pursuit of bigger ones, adapts her strategy according to previous successes or failures, and is able to collaborate with others and communicate her results.

That's it for now. Thanks for reading!

Physics to data science

Friday, November 27, 2015

What has surprised me about data science

3 comments: