Tuesday, May 1, 2018

A model-training template I sort of like

I hope someone can prove me wrong about this, but I don't think any model training script is appropriate for all cases. Even with the advent of automated machine learning (TPOT, AutoML, etc.), every data set has its pathologies, and every project has its bizarre use cases and requirements.

After a large handful of projects in which I've made a variety of mistakes, I have put together a modeling template that I'm more or less happy with. The tension in making such a script is that on one hand we want something general enough that it can be cloned and applied to the widest variety of modeling tasks, and on the other hand that it does each modeling task very well. I think this realization is a fair compromise between the two.

You can clone the repo here.

Project structure, and the point of it all

The idea is that the main script, process.py, implements the usual modeling processing steps at an abstracted layer, and that you can edit the contents of the individual steps depending on your project needs. An example is included.

The project skeleton is based on CookieCutter Data Science's template, which I've adjusted for my needs. I also left a bunch of the original template that I thought would be useful, like the notebooks/ directory for exploratory studies. That template contains placeholders for data processing and modeling modules, and I've implemented those to a limited degree. 

Project structure is critical, and you may want to rearrange it to suit your needs. As the saying is paraphrased, "code is meant for humans to be read." That applies to the project structure as well: we want collaborators or strangers to be able to understand what we are doing, and to continue or replicate our work. In this project template, there's only one script in the home directory, and it refers to modules with obvious names.

The modeling script itself loosely follows suggestions published on MacieJ's blog, but I've added and subtracted to conform to my preferred workflow. It looks something like this:
  1. Load and pre-process raw data. Element-wise cleaning
  2. Construct a pipeline that includes feature building and training the model
  3. Run pipeline on training data
  4. Test on test data
  5. Save results

Usage

To set up and run the example modeling task, follow these steps.
  1. Clone the repo
  2. Create and activate the Conda environment. If you don't have Conda you can install it here. To create the environment on Windows, open a command prompt in example_pipeline/example_pipeline/ and run the command
    conda env create -f environment.yml
    Once that finishes, run the command

    activate example_pipeline

    to activate the environment.
  3. Run the script using
    python process.py

    Results will appear in a timestamped directory inside
    results/.

The script

I want to discuss the choices I made when building this script, roughly in the order they appear.
  • Set parameters
    • Usually I would use a config file here, especially if I needed credentials for the script. in this case I just hard-coded the file paths, and there were no other free parameters.
  • Load data
    • This is a custom function that will change depending on your situation. The output should be a multi-index data frame, where training data are indexed as 'train', and test data are indexed as 'test'. Here, I'm loading from two files. If you're loading all data from a single file and then selecting train/test data from that, then it should be done in the load_data() function.
  • Load variable codes
    • Every time I have a modeling task, I make a variable codes file. For each base feature in the data, this file contains the feature name, types, levels if categorical or ordinal, and indicators for whether or not to use each feature in a particular model. We can also add columns to indicate processing steps or to identify weight or target data columns. Not only is it convenient to use in code, but I catch data problems by going through the exercise of building it.
  • Pre-process data
    • This is where we strip white space from strings, and relabel the target, etc. Note that this is not the same as feature creation. That will come later. The pre-processing step is only allowed to include element-wise operations that are agnostic of any other elements. Only you can prevent data leakage.
    • In this specific case the target can have two values, but they're labeled slightly differently in the train and test sets. I provide a dictionary to do the mapping.
  • Split into train and test
    • I prefer for this to be done by label during the load step. The earlier we can distinguish between test and train, the better.
  • Build the pipeline
    • This will, of course, be project specific. But the pipeline should include any feature selection/creation steps as well as one or more model fitting steps.
    • In this case, I pass the pipeline builder a list of features to include. I prefer to read this list from the variable codes file, where we can always add columns for additional models, but feel free to pass a hard-coded list. 
    • For the example, I've chosen to use a standard scaler on numerical columns, and one-hot encoding on categorical columns. I'm using a gradient boosting as an example model here.
    • For the transformations, I'm using sklearn_pandas, in which each feature is given a transformation, and we filter the list of transformations according to which features we want to include.
    • I also include a gridsearch for good measure. As currently set up, it's awkward to set the parameter grid. You could read it from a config file, or modify it manually. I'm thinking ahead to a future blog post in which I'll do automated machine learning. Hopefully this issue will go away at that time.
  • Train the model
    • Split into X and y for convenience. I'm passing DataFrames instead of arrays to scikit-learn.
    • Fit the pipeline and choose the best estimator. In principle this step can be much more complicated, involving training hundreds of models and ensembling. 
  • Test the model
    • Same thing as above, but predict using the best pipeline instead of running the full gridsearch.
    • Evaluate the results in whatever way you prefer. Here I'm just printing out a classification report.
  • Save the results
    • I err on the side of recording too much information. In this example, I save the model, the classification report, and the details of the pipeline object. In practice, I would also save a snapshot of the .py scripts unless I would reveal sensitive info by doing so.

Some comments on complexity

This exercise has reminded me again of the trade-off between general applicability and complexity. As we adapt a strategy to be applicable to more distinct cases, we must make the strategy more complex, or we must accept that it will handle the cases worse on average. I think this modeling project structure and script strikes a good balance - it should be adaptable to most modeling tasks, and most of the added complexity takes the form of custom modules for each sub-task. Thanks to CookieCutter Data Science's prior work, the structure is intuitive despite being complex. 

Monday, April 30, 2018

Some new post ideas

It's been two and a half years since I started my first job as a data scientist, and I find that I have a few things to say by now. So I'm starting up the blog again!

Here are some post ideas, in no particular order:

  • What I discuss with data science applicants in interviews
  • What can go wrong if you don't use pipelines in model training
  • The relationship between model complexity and validation robustness
  • The changing role of data science in business
  • Exotic types of data leakage
  • Why neural nets are usually not very good
  • Very general approaches to new data science projects
  • Ethics of data and data science
  • How to succeed by making yourself obsolete
We'll see what I can get to in my free time. See you soon.

- b

Saturday, August 27, 2016

Epilogue

I started my first job as a data scientist about seven months ago, in January. Since then I've logged a ton of hours in SQL, Python, R, and Excel. I've also ridden a horse to a company function, and currently hold all five high scores on the KISS Pinball machine in the office. Here are some reflections on my time so far.


Technical skills I've used

When I was at Insight I was given a two-page list of skills to brush up on, from abstract data structures to algorithms to interview tips. I'm sure that each of these is useful across the wide range of job descriptions that go with the title of data scientist, but here are the ones that are useful to me:
  • SQL: I spend a lot of time writing database queries, and my SQL coding has improved drastically. I've learned that the capabilities of the language go far beyond what's covered in online tutorials, and that there are many things that can go wrong. There are also multiple ways to accomplish the same task, and they may vary greatly in efficiency. I think the only way to learn this is through experience
  • R and Python: My usage is about 40%-60% in favor of Python. I've found that R is convenient for quick manipulation of data frames. By comparison to R's dplyr package, Pandas in Python is longwinded and unintuitive. But Python is better for longer scirpting projects for a few reasons - not the least of which is that package version control is easier. The point is, learn them both.
  • Microsoft Office: First of all, Excel is the bomb. I hadn't really used it since high school, but for very standard analysis like filtering, histograms, and pivot charts on small-ish data sets, it can't be beat. It blows IPython notebook out of the water for speed, and the chart styles have come a long way since 1996. PowerPoint is still the gold standard of deck-building, like it or not. And I work at a consulting company, so I build decks. PPT gets the job done.
  • Machine learning: Here's a helpful hint about machine learning. Gradient boosted decision trees get you 90% of the way there 90% of the time.

What my supervisors expect and appreciate

  • Technical ability is essential, but taken as a given. 90% of my job is technical in nature, but very little of the interview process or later evaluations directly tested those abilities. It's also taken as given that I will be rigorous and intellectually honest. It's in my interest and the company's to test my results at every step along the way, and to ask other people to look over things when I need a pair of eyes outside the problem.
  • When I've received explicitly positive feedback, it has without exception been due to my ability to translate my results to our clients. 
  • Catching mistakes, before or after they happen, is crucial. I think this skill follows nicely from the skepticism that is learned from academic research.
  • My workplace is a community, and my contribution to building that community is appreciated. I trust my colleagues to be highly competent and helpful, and they trust the same of me. My former PI used to say that in order to be successful in research you need (1) devotion to work, (2) creativity, and (3) the ability to work with others. You can do it with only two of these, but it's much harder. I've found that this is not true for business. You must have all three.

Where I'm going next

Day by day, I'm choosing a trajectory in data science. Through a combination of expressing interest, volunteering to take on responsibilities, and performing well on certain tasks, I'm more likely to be assigned tasks like that. To balance that, my supervisors have an incentive to make me a well-rounded employee so I can be applied to a wider scope of problems. But where should I aim to go?

Should I try to become the modeling expert in the company? Should I learn more about data engineering to be a more well-rounded technical resource? Should I aim for project management and client interaction? There is little feedback from my supervisors on this, mostly because they want me to do what I like, and to accomplish my own goals. I'll be useful to them regardless. 

My inclination has always been to increase breadth of expertise, sometimes at the expense of depth. I find that by having more context I can work efficiently and be creative. It helps me aim for a few big wins instead of many small wins. I can also take on more diverse projects that way, which is part of the reason I transitioned to data science. Right now this means trying to get more client interaction, and absorbing as much domain knowledge as I can. That's a frustratingly slow process, but I'm not in a rush. I'm having fun.

Wednesday, December 16, 2015

Book summary: Predictably Irrational


A future work colleague recently recommended a book to me that might be useful as a consultant working with data. It's called Predictably Irrational, by Dan Ariely, and it contains a catalogue and commentary on some of the cognitive biases that cause humans to make irrational decisions, particularly in economic settings. He describes experiments (both virtual and real) that demonstrate the effects, argues for their pervasiveness in our lives, and proposes ways to mitigate them.

I wanted to put up this post as a summary of the book and a record of some of my reactions to it, which will hopefully force me to read actively rather than skimming. Hopefully you'll find it useful too.

Summaries of chapters are on the left in black,
my reactions are on the right in purple.

Introduction

Economics is based on a model that says humans make rational decisions according to their preferences, and with complete information. In contrast, behavioral economics considers that people sometimes make irrational decisions. These decisions are not random, however, but can be tested experimentally and predicted. Some of these errors may be mitigated by learning about the forces that shape our irrational choices.


I was just writing about how cognitive bias takes a back seat in data science, but it's driving the car in this book. Having read Dan Kahneman's Thinking, Fast and Slow and spent some time on lesswrong.com, I'm familiar with the concept. It sounds like this book is principally concerned with cognitive bias as it applies to economic decisions, which I assume will be useful in my job. I don't know what Ariely's motivation is yet - is he going to teach us how to use behavioral economics to sell more hamburgers, or to make sure we buy the best hamburger? 

1. The Truth about Relativity:  Why Everything is Relative -- Even When It Shouldn't Be

It's hard to assess the value of something unless it's compared to other similar things. Given three options (A, B, C) with B and C being similar but B superior to C, people will tend to choose option B regardless of the quality of A. This is called the decoy effect. Ariely suggests that it's partly responsible for the creep of materialism, in that when we buy a car we tend to compare it to slightly better cars. He also suggests that to limit the decoy effect's effect on us, we can try to limit the options we have access to. Experiments are presented that show that the effect is real and strong.

I hadn't really thought about how pervasive this effect is in our lives. Ariely points out that it's responsible for the "wingman effect," where you can bring a less attractive friend to a bar when you're looking for a date. You not only look better compared to your friend, but to the rest of the bar as well. It's also used extensively in advertising to make us "creep" up to more expensive items or packages. 

Ariely's writing is very conversational and a bit hyperbolic. The analysis is qualitative, and it's not exactly clear how strong this effect is, and whether the results are statistically significant. I'm interested in how the decoy effect competes or cooperates with updating preferences based on prior information. We don't assign value to a set of speakers without hearing similar speakers in part because we haven't heard any of the speakers yet.

I'm somewhat surprised by the suggestion that we intentionally and artificially limit our access to options in order to curb this effect. This is not the usual approach to cognitive bias, but it's consistent with the view that ignorance is bliss.

2. The Fallacy of Supply and Demand: Why the Price of Pearls -- and Everything Else -- Is Up in the Air

The law of supply and demand is supposed to set the price of goods and services, but this is based on a rational economic model. In reality, the price of things is pretty arbitrary. Ariely introduces the concept of arbitrary coherence, where prices and behaviors may be based on previous prices or behaviors, but the original ones were set arbitrarily. He uses the well-known bias of anchoring, in which our beliefs about numerical quantities are influenced by numbers we've seen or heard recently, even if they have no logical connection. He talks about some experiments he's done that show that not only does anchoring affect the price we're willing to pay for something, but it can even make the difference between being willing to pay for something or demanding to be paid for it, if the thing in question has ambiguous value.

He also shows that the influence of anchors is long-lived: you can't replace them with new anchors easily. Some theories as to why we are willing to pay so much for Starbucks coffee are put forward (we don't compare to similar products because of the difference in atmosphere).

Ariely also talks about self herding, in which we think about how often we've done something before, and take that as independent evidence that that thing is good to do. This shores up the theory that our habits may be quite arbitrary at their roots. His advice is to become aware of these vulnerabilities and question our habits. Consider the rewards and costs of each decision. Mathematically speaking, this theory suggests that supply and demand are not independent, but coupled.

We tend to rationalize our choices after the fact, and Ariely argues that this might not be a bad thing as long as it makes us happy.


Arbitrary coherence is an interesting idea. It reminds me of filamentation in nonlinear optics, or actually just anything in nonlinear optics. Or just anything nonlinear with a random seed. It also a pretty bold claim, and I've never heard anyone else make it. Granted, I don't run in sociological circles. I'll have to run it by some friends. Self-herding as a mechanism is sexy.

This chapter leaves me wanting to see some equations. What happens formally when you assume various kinds of coupling terms between supply and demand? Surely this is solved. I assume it results in prices that are either saturating exponentials, oscillatory functions, or hyperbolic trig functions. This seems like a simple opportunity to extend conventional economics into the field of behavioral economics.

Finally, I feel strongly that rationalization is bad. If you have arbitrary preferences for one car over another one, they can be taken into account formally using decision theory. Even with the craziest preferences in the world, you can still make consistent decisions as long as you don't violate a few axioms. As soon as you allow ad hoc rationalization, decision theory goes out the window. That's a sacrifice I'm not willing to make.

3. The Cost of Zero Cost: Why We Often Pay Too Much When We Pay Nothing


People respond to a cost of $0.00 differently than to other similar costs. For example, consider the following two situations.
People make decisions about these choices as if the relative value of A over B were much greater in Choice 1 than Choice 2, even though the difference in cost is identical, and $0.01 is certainly below the wealth-depletion regime. He gives an example of Amazon, where the inclusion of free shipping for orders above a threshold entices people to spend more than the cost of shipping, and points out that this does not work if they reduce the cost of shipping to a few cents.

He also mentions a thought experiment where you are given a choice between two gift certificates (again, for Amazon). The first is a free $10 gift card, and the second is a $20 gift card that requires you to first pay $7. He speculates that people would be more likely to choose the first.


Seems reasonable. I take issue with the Amazon example though. One one hand, suppose we replace "Amazon" with "turpentine factory." I don't value turpentine, so the decision to take the free option is quite rational. Now suppose we replace "Amazon" with "cold, hard cash." Everyone would obviously choose the second option because everyone values cash as much as cash. So the question seems to depend on how much we value the commodity. I don't see why that's irrational. 


4. The Cost of Social Norms: Why We are Happy to Do Things, But Not When We are Paid to Do Them.

There are market norms, and there are social norms, and it seems that people act as if there is a sharp division between the two. For example, We don't offer payment for a family Thanksgiving dinner, but we might bring an expensive bottle of wine. Essentially, if we apply social norms to situations we tend to undervalue our own time and effort. 

Companies try to sell us things by trying to forge a connection to us. Open source software is another example, where very skilled people spend a lot of their time to work, but would not do so for a small amount of money. Ariely hypothesizes that this effect is responsible for police being willing to risk their lives for relatively small amounts of money. 

Now I feel like we're getting somewhere. This is a familiar concept of course, but this chapter puts it in concrete terms. It reminds me of the subscription program on Twitch.tv, where I subscribe to people for essentially no personal benefit. It puts a name to the awkwardness I feel offering money to friends when they babysit for me. 

It also calls to mind the office culture in my soon-to-be office. They have a family-like atmosphere,  unlimited paid time off, and even award a bonus for taking seven consecutive days off during the year. This may be effective at generating more committed workers. 

I wonder what kind of norms are at play in academic research science. It is certainly true that the amount of money I made was far below what I could have made in the private sector. I also reviewed papers for free even though I paid publication fees when I had papers accepted at journals. Was I motivated by social norms like the quest for scientific truth, the solidarity between academic scientists in a similar situation? Or was I motivated more by the eventual promise of tenure, which is a market-like reward?

5. The Power of a Free Cookie: How Free Can Make Us Less Selfish.

If something is free instead of cheap, we tend to regulate it with social norms instead of supply and demand. So if you offer a tray of free cookies to a group of people sequentially, they will tend to take a small number of cookies. But if you sell those same cookies for $0.05 each to those same people, people tend to buy many of them, even though five cents is basically free.

Ariely suggests that this phenomenon will keep programs like cap-and-trade from working. By putting a price on pollution instead of making it free, companies may feel free to pollute more, since they've paid for it, whereas if it were free they might be bound by social norms.

This seems reasonable, and I can think of anecdotal evidence to support it. Ariely's experiments also show it to be true. I have reservations about his statements on cap-and-trade, since I think the total amount of credits is kept constant. The companies couldn't pollute more in total than before. I also don't think companies are bound strongly by social norms in the first place, but that's speculation on my part.

Frankly, I feel pretty bad when I buy all of something cheap at a grocery store, similar to how people apparently treat free things. Not sure how sharp this division is.

6. The Influence of Arousal: Why Hot is Much Hotter than We Realize.

Ariely describes some fairly interesting experiments that demonstrate two things. (1) We make different decisions when sexually aroused, and (2) we can't accurately predict how different those decisions will be ahead of time. The first of these is pretty obvious, and the second is less obvious and a little worrying.

Specifically, test subjects were more likely to agree to propositions like "I would not use a condom if I thought my partner would chance his/her mind while I went to get it" while aroused than while not aroused. In light of this, Ariely suggests that asking teens to make good decisions while heated up is basically useless. If we want to prevent pregnancy and STD transmission in teens, we either need to keep them from being aroused, or we need to make things like condoms so universally available that teens never really have to think about whether to use them or not.

In this chapter Ariely makes explicit what he's only implied before: in his model of human thought, we are like Dr Jekyll and Mr Hyde. The latter makes decisions that the former would not agree with or necessarily predict correctly. He also brings up id, ego, and superego more than once, and paraphrases Sigmund Freud. He goes so far as to say that "We may, in fact, be an agglomeration of multiple selves." Not being a psychologist, I'm not sure what to make of this model. It was my understanding that Freud's theories were no longer thought to be an accurate description of psychological phenomena, but maybe I'm wrong. In any case, it seems that there should at least be a continuous transition between Dr. Jekyll and Mr. Hyde.

7. The Problem of Procrastination and Self-control: Why We Can't Make Ourselves Do What We Want to Do.

People don't save as much money as they used to, and also not as much as they say they want to. In fact, the average US family has thousands of dollars of credit card debt. People also procrastinate, and are often looking for ways to avoid it. In this chapter, Ariely tells a success story of Ford Motor Company, who managed to get people to follow their maintenance schedule by simplifying it at the cost of part efficiency (i.e. some parts were inspected before it was strictly necessary). He suggests that a similar idea might be applied to medicine, where people tend to fail to get routine checkups done.

He also pitches an idea for a credit card where consumers could decide ahead of time how much money they wanted to spend on certain categories of goods, and what penalties they would face if they attempted to exceed these limits.

Just yesterday I was doing some Pomodoros, which is a way of setting an artificial deadline. Ariely's advice for overcoming procrastination also includes rewarding yourself for doing things you don't wan to do, which is pretty standard advice.

A bit off subject, I'm wondering how much confirmation bias is coming into play while I'm reading this book. Am I too willing to accept that these effects are real? After all, the experimental data are presented anecdotally, with no statistical significances quoted. The person designing them could in principle have fallen victim to any number of well known biases in scientific studies. For the moment, my own recognition of these effects in my life is suspending my disbelief. It's making me very curious to what degree we can quantify these effects.

8: The High Price of Ownership: Why We Overvalue What We Have.

People tend to assign higher value to what they have, and lower value to what other people have. The main experiment presented in this chapter involved interviewing students who had won Duke basketball tickets in a lottery. Owners quoted an average selling price of $\$2400$, and non-owners quoted an average buying price of $175 - a huge difference. Both owners and sellers were more concerned with what they had to lose in the sale than what they had to gain. Ariely speculates that part of this effect comes from the transferrence of our emotional attachment to the buyer. Also, buyers are ignorant of the emotional history that the seller has to a house or a car.

This effect is important on auction sites like Ebay, where "partial ownership"  - the feeling of ownership generated by leading a bidding war for some time - might explain overbidding in the last few minutes of the auction. It also may apply to politics, religion, or other ideologies. We value the ones we hold and undervalue those of others.

How does this compare to "the grass is always greener on the other side of the fence," - where we tend to undervalue what we have and overvalue what others have? Could we set up similar experiments to demonstrate exactly the opposite effect? Could we set up an experiment that quantifies the relationship between these two effects?

9. Keeping Doors Open: Why Options Distract Us From Our Main Objective.

People tend to avoid committing to a single option when many are available, and will sacrifice expected value in order to keep these options available. In experiments, subjects were given a screen with three doors to open. Clicking inside each door gave a randomized payout, and subjects had a limited number of clicks to spend. They turned out to be pretty good at adapting their strategy to maximize expected value. Then a new condition was added, where after 8 consecutive clicks elsewhere a door would be locked permanently. Subjects preferred to jump around keeping doors open, even when the expected value for each door was advertised.

Ariely advises us to focus down on a smaller number of opportunities, and stop investing in things that are getting us nowhere, explicitly citing a woman he knew trying to choose between two boyfriends. He goes so far as to suggest that we "stop sending holiday cards to people who have moved on to other lives and friends." He also models the US Congress as a person reluctant to choose one option among many, which leads to gridlock.

I have a few issues with this chapter. The results of the door experiment are completely nuts, first of all. I can tell you that neither I or my poker buddies would think twice about letting a low-EV door close permanently. In fact, I wonder how robust these effects are against training. I know for example that anchoring is almost impossible to avoid even by experts at cognitive bias studies. Maybe this "commitment bias" is less robust.

Second, while I understand that failing to commit to a serious relationship can have bad consequences, I'm pretty sure that sending holiday cards has basically zero cost. I don't know why cutting off old friends is going to improve my life unless I'm spending so much time writing Christmas cards that I don't have time to hang out with my other friends.

Third, congress is in deadlock not because it's a single entity that fails to commit to a course of action, but because it's made of at least two entities completely committed to conflicting courses of action. I don't see why this bias applies.

10. The Effect of Expectations: Why the Mind Gets What it Expects.

The basic message of this chapter is that our experience depends on what we expect from it. Expensive wine tastes better, even if it's identical to cheap wine. World-famous classical musicians play unnoticed in subways. It's also be responsible for brand loyalty to some extent. MRI images taken from people drinking Coke and Pepsi show that the area of the brain associated with higher functions is preferentially active for Coke, meaning that it's not simply a taste experience, but also a memory experience. Ariely makes a few suggestions to help us make less biased decisions, which involve blinding ourselves to labels.

The mere fact that prior information influences our beliefs in a subjective way is not interesting - it's a fundamental tenet of probability theory. The interesting thing here is that it changes our perception in a way that depends on the time that we learn the information. For example, in a taste test, our reported experience depends strongly on whether we learn the labels before or after tasting, even if reports are made after everything.

Ariely says something interesting in this chapter. Regarding manipulating people's expectations intentionally, he says "I am not endorsing the morality of such actions, just point to the expected outcomes." So he explicitly takes an agnostic position on the ethics of learning about or using cognitive bias effects.

11. The Power of Price: Why a 50 Cent Aspirin Can Do What a Penny Aspirin Cannot.

This chapter covers the placebo effect. In experiments, pills labeled as more expensive work better. In another experiment, an inert drink was offered as a physical or mental boosting agent and had the advertised effect. Ariely suggests that it is ethical to exploit the placebo effect by intentionally prescribing them. He also comments on the ethics of experiments which test whether a treatment functions through the placebo effect.

This seems to be a special case of chapter 10's discussion of expectations. I'm interested in the ethics discussion. By now it's clear that Ariely is in favor of exploiting cognitive bias to put ourselves in positive situations, as opposed to attempting to eliminate it.

12. The Cycle of Distrust: Why We Don't Believe What Marketeers Tell Us.

People are always looking for a catch. In an experiment literally giving away free money to passersby on a college campus, only 20% of people took a $50 bill. Ariely also discusses the Tragedy of the Commons, which is a game theory scenario in which cooperation is optimal but unstable. Defection is preferred over the short term, but causes everyone to lose in the long term.

Examples are given of companies who were able to recover from PR disasters by sustained transparency efforts. Ariely says that companies can get away with lying to some small extent, but then they lose people's trust. Once it's gone it's very hard to regain.

In another experiment, people were very willing to accept obviously true statements like "the sun is yellow" if they came from unnamed sources, but if attributed to entities like Proctor & Gamble or The Democratic Party, they became suspicious (specifically, they started wondering about orange or red colors in the sun). This showed that they were looking for excuses not to believe these entities.

This chapter drives home that trust and distrust don't just cancel out when it comes to people and companies (also probably people and people). Maybe obvious in retrospect, but worth noting. Not sure I have anything more to say about that.

13. The Context of Our Character, Part 1: Why We Are Dishonest, and What We Can Do About It.

This chapter is about dishonesty, and how we think about dishonesty involving cash as different from other dishonesty, even when it has greater equivalent cost.

White-collar crimes cause much financial damage than, say, petty theft. We put lots of resources into dealing with the latter, and very little into dealing with the former. Ariely conducted an experiment at Harvard, which gave people an opportunity to cheat when reporting test scores, in a setting where better scores earned more money. He found that people cheated to a small degree when given the opportunity, but that varying the risk of being caught didn't have much of an effect. He also exposed test subjects to neutral text as well as the Ten Commandments before the test, finding that the latter group cheated less.

He suggests that requiring various professionals to take oaths of honesty policy would curb cheating, even without enforcement.

Again, Ariely uses Freud and the concept of the superego to explain this phenomenon. He states that criminals do not perform cost/benefit analyses before committing crimes. I don't know whether this is true, but I'm sure I personally would do such an analysis.


14. The Context of Our Character, Part 2: Why Dealing With Cash Makes Us More Honest.

People cheat more when cash is not directly involved, even when the link to cash is explicit. In an experiment similar to above where test takers won tokens immediately redeemable for cash, cheating increased significantly. Ariely points out that the rates of cheating measured in this experiment should be taken as a baseline, having been done on otherwise honest students in controlled conditions, with the implication that real-world cheating should occur at a much higher rate. He also contends that companies are very happy to cheat consumers out of money as long as it's not technically cash. He uses blackout days of frequent flier miles as an example: having to spend more miles to make a purchase is effectively equivalent to having to spend more cash.

The results of this experiment were surprising to me, and this is the kind of thing I don't feel like I have a good intuition for. People fail to make consistent decisions when cash vs. cash equivalents are involved. I assume this has implications not only for security, but also for marketing. I'll have to reflect on this for a while.


15. Beer and Free Lunches: What Is Behavioral Economics, and Where Are The Free Lunches?

Ariely discusses an experiment he ran in which he gave away free beer. Given four choices of beer, groups of people tended to disproportionately choose different beers from each other when asked to order out loud in series. Others ordered in secret and tended to gravitate to some beers more than others. When asked to rate the beers afterward, people rated the beers higher if they were ordered in secret. This implies that people are willing to sacrifice expected value in order to appear to be unique.

By "free lunch," Ariely is referring to win-win situations. He suggests that behavioral economics may be able to find many such lunches.

The beer experiment seems to be in conflict with what we've learned about ownership bias. If I chose some particular beer,  wouldn't I be likely to overvalue it and rate it highly? Even so, maybe the effect causes a constant bias regardless of whether it was a secret or public order. The main lesson of the book, Ariely says, is that we are "pawns in a game whose forces we largely fail to comprehend." Or in other words, if you think you're making rational decisions, you're fooling yourself.


Final Thoughts

I learned about a few new things from this book, so I'm happy I read it. I suspect that some of these ideas will be useful in my new job, but it's hard to know for sure without context. Already I'm seeing certain of these biases in my everyday life, which is a good indicator that I've internalized some of this information.

It's not clear to me how strong these effects are, particularly with respect to each other. I wonder to what extent they can be quantified. Ariely has convinced me that they influence our behavior to some degree, but without a full statistical analysis it's hard for me to know how much confidence to have in each one. The premise of the book is that we can predict irrational behavior, but it's going to make decisions without a quantitative measure, i.e. the probability and magnitude of an irrational decision.

I'm more interested in the way Ariely advocate dealing with cognitive biases. His stance is not that we try to eliminate them, or even adjust our calculations to account for known biases, but instead to embrace them. We should put ourselves in situations where our biases lead to good decisions, which may involve limiting the information we allow ourselves to access. And we should recognize that if we are happy because of a bias, that still counts as being happy. It may not be worth trying to correct a bias when doing so would ultimately make us less happy.

Ariely doesn't say anything about the ethics of using these biases to manipulate others, even though he does indeed suggest that certain manipulations would work. I'm looking forward to discussing this with more of the data science community as I meet them.


Sunday, December 6, 2015

New job

I started this blog the day after I decided to begin a career transition, so that would be May 12th of this year. It's been six months and 23 days since then, and I have to say that things have gone just about as smoothly and quickly as they could have gone.

Last Friday, the 4th of December, I signed a work agreement to be a data scientist at a consulting company starting in January. They work with media and technology companies to better market their products. The company has a great culture and some very smart people, and I'm extremely excited to start. I now feel comfortable calling myself a professional data scientist.

The question now is what to do with the blog. It was intended to serve two purposes: build a portfolio, and help other people with a similar transition by talking about my experience. The former goal seems moot now, and I only need a few more posts to talk about the latter. I think the most likely thing is that I'll finish those posts and then start a new blog about data in general.

Specifically, I'm interested in the culture of data science and how it relates to research science. I would like to become a voice in the community, but I'm not sure yet what I have to say. Let's find out.

Saturday, December 5, 2015

Digit recognition part 2: a validation pipeline

[Link to part 1]

I've been looking recently at the MNIST data set, which contains thousands of hand-written digits like this:
Example hand-written numerals from the MNIST data set

where we also have a label for each digit $\in \left[0,9\right]$. We would like to use these examples to assign labels to a set of unknown digits.

In part 1 of this series, I looked at the data set and did some preliminary analysis, concluding that:
  1. There's not much variance within each digit label, i.e. all 5's look pretty much the same.
  2. Most inter-numeral variance occurs near the center of the field, implying that we can probably throw away the pixels near the edge.
Rather than jumping right into optimizing a classifier in part 2, I'd like to build a validation pipeline. Any time we do machine learning, we want to try to quantify how well our regression or classification should perform on future data. To do otherwise is to leave ourselves prone to errors like overfitting. Validation in this case will apply the classifier to a new set of digits, and then compare the predicted labels to the actual labels.

The Methodology

Here is a pretty concise description of the usual validation methodology. Basically, we break the data into three chunks before we start: a training set, validation set, and test set. Every time we train a classifier we use the training set, and then evaluate its performance using on the validation set. We do that iteratively while tuning metaparameters until we're happy with the classifier, and then test it on the test set. Since we use the validation set to tune the classifier, it sort of "contaminates" it with information, which is why we need the pristine test set. It gives us a better indicator of how the classifier will perform with new data.

The pipeline

What do we want our validation suite to look like? It might include:
  1. Standard goodness-of-fit scores, like precision, accuracy, or F1 scores.
  2. Confusion matrices, which illustrate what numerals are likely to be assigned which incorrect labels (e.g. "6" is likely to be labeled "8")
  3. Classifier-specific performance plots to evaluate hyperparameters, like regularization constants. These show the training and test error vs. each hyperparameter.

Example: logistic classification

It will be helpful to have a classifier to train in order to build the validation pipeline, so let's choose a simple one. A logistic classifier is a logistic regression in which we apply a threshold to the probability density function to classify a data point. Besides being simple, it's also not going to work very well. For illustrative purposes, that's perfect. I'd like to look at how the performance changes with the hyperparameters, which won't be possible if the performance is close to perfect.

I'm using IPython Notebook again, and I've uploaded the notebook to GitHub so you can follow along, but I'll also paste in some code in case you just want to copy it (please copy away!).

We're just going to use the logistic regression functionality from SciKit-Learn. First I import the data and split it into three groups. 70% goes to training, and 15% each to validation and test sets.

Partitioning the data into training, validation, and test sets.



Here I implement a logistic regression with a linear kernel from SciKit-learn. To do some basic validation, I'll just choose a regularization parameter (C in this case) and train the classifier.

Then we can create a validation report, which includes precision, recall, and F1 score for each numeral. 




It's a bit easier for me to parse things in visual format, so I also made an image out of the confusion matrix. I set the diagonal elements (which were classified correctly) to zero to increase the contrast.


Whiter squares indicate more misclassifications. We see that the most frequent mistake is that "4" tends to get classified as "9", but we also tend to over-assign the numeral "8" to inputs of "1", "2", and "5". Interestingly, this is not a symmetric matrix, so for example we tend to assign the right label to "8" as an input.


Hyperparameters

If we stick with models that are linear in each pixel value, the only hyperparameter we need to choose for logistic regression is the regularization constant, which controls to what degree we weight the input pixels. The two common regularization choices I'll consider are are $l2$ (ridge regression or Tikhonov regularization), and $l1$ (lasso). The former tends to result in a "smooth" weighting, where we put similar weights on everything, but the total overall weight is small. The latter results in "sparse" weighting, where we eliminate many of the inputs as being noninformative. 

If we regularize too little, we'll find that while we have low fit error on the training set, we have large errors on the validation set, which is called overfitting. If we regularize too much, we'll find that we're ignoring important information from the input, resulting in large errors for th training and validation sets. This is called underfitting, and the error is called bias.

It can be useful to plot the training and validation error as a function of the regularization constants to see where the regularization performs best. And since we have a pretty large data set, I'll take only a small fraction of the training set. This will make the training go faster, and will just give us an idea of the parameters we should use in the classifier. Let's look at l2 regularization first.




In this plot, larger values mean that the classifier is doing a better job, with 1.00 implying perfect classification. On the horizontal axis, larger values mean less regularization. The red squares show that as we weaken the regularization, the classifier does a better job with the training data. But the performance on the validation data improves for a bit, and then slowly degrades. So for very little regularization, we have overfitting. From a probabilistic point of view, the classifier is no longer representative of the ensemble from which we draw the data.

The validation score peaks around $C\approx 10^{-2.5}$, so even though I've trained on a small subset of the data, I would use this value moving forward.

Now let's make the same graph using $l1$ regularzation.
The same trends are present here, but the exact value of the optimum is different - around $C\approx 10^{-5.5}$. As a nice illustration, we can run the classifier with this value and see which pixels it elminates. To do that, we retrieve the coefficients from the classifier, of which we get one per pixel per numeral. Keeping only those pixels whose coefficients are $>0$ for at least one of the numerals generates this map:

So to recap, white pixels are those the classifier decides to keep if we tell it to get rid of the least informative ones. Compare this to our map of the variance of each pixel:

and we see that our hunch was correct. The classifier preferentially kept the high-variance pixels.

Now that we have this pipeline, we should be able to use it for other classifiers. The exact analysis will likely change, but at least we'll have a basis for comparison.

Friday, November 27, 2015

What has surprised me about data science

It's been six months since I decided to become a data scientist, and I want to take a moment to reflect on what's been surprising about the journey. When I started, I had a study plan that involved machine learning and statistics courses, and plenty of programming practice.

Surprising things about the physics -> data science transition:

  1. Data scientists love hypothesis testing. Specifically, they are serious about binary hypothesis testing, and they tend to take the classical view of null and alternative hypotheses. In business this is called A/B testing, and it's used to make high-impact decisions at major companies every day. This surprises me because I didn't run into formal hypothesis testing very much in physics (although it's more popular in biology, for example). On the other hand, we expect that anywhere people stand to make or lose money based on decisions, those decisions should probably considered formally. The emphasis on hypothesis testing is symptomatic of rationality, but we have to be careful about defining the hypothesis space. Not everything that seems binary is binary.
  2. Cognitive bias takes a back seat. I'm used to thinking about rationality as the study of a trifecta of concepts: decision theory, probability, and cognitive bias. People in DS seem very concerned about the first two, while the third is just kind of kept in mind while we do our analysis. It's a hidden variable that causes the results of A/B tests to come out the way they do, and it plays a role in the way we communicate our results to the decision makers. But there doesn't seem to be much attention paid to the pitfalls of cognitive bias in our own analysis. I also haven't found anyone trying to exploit it to influence decisions, even though many data scientists work in marketing. I wonder if this is a place where I can contribute with a unique skill set.
  3. Statistics is hard. I thought I would be able to knock this out really quickly since I'm extremely comfortable with probability, but that has not been the case. Part of the problem is that it's typically taught from a tool-oriented viewpoint. We learn that in situation $A$, with $N$ samples from $M$ ensembles, test $X$ is appropriate. Contrast this with the bottom-up approach of Bayesian probability, where we start with the question of how to define a measure of likelihood, and we write out a complete hypothesis space before any problem. This may be why concepts like confidence intervals and p-values are commonly misunderstood even by expert practitioners. I have struggled to reconcile the tools of statistics with the formal logic of probability theory, but I can at least use the tools appropriately. The rest will come with practice.
  4. Machine learning is easy. Or at least, for most problems. ML makes up a large percentage of the data science news and tutorials I see, and there's a lot of emphasis on figuring out when to apply which method (again, a tool-oriented approach). But in basically every real example I've found, you can throw any classifier at the problem and it's pretty much OK. Furthermore, unless you have a huge amount of data or a very large parameter space, you can set up a pipeline in Python that tries different classifiers with different hyperparameters (like regularization parameters) and find the one that performs best. You just have to be careful about setting aside test and validation sets. All this to say that we can often afford a brute-force approach to machine learning.
  5. Interviews take a lot of resources. I had assumed that it would be rational for companies to do phone screens for any candidate who was possibly qualified for a position, since it's well known that resumes are not good predictors of success. By asking specific technical and non-technical questions, I assume that a competent hiring manager could separate the wheat from the chaff. Sending a data challenge seems like a similarly good idea. But this ignores the fact that even half-hour phone screens require time and effort that's not being put toward high-priority projects, and that someone has to review the data challenge results, which is a lot like grading a test. Which sucks. So it seems like companies are still sort of stuck choosing interviewees based on resume keywords. Seems like a bad idea, but I don't have a great solution.
  6. People are afraid of hiring academics. This is something Kathy Copic mentioned at a panel I attended, and it sounded ridiculous at the time, but I can tell you that it's absolutely true: managers are afraid of hiring "stereotypical" academic researchers, who prefer to work alone on very difficult problems for a long time, and generate theoretically perfect results that are of no use to anyone. They also prefer well-defined problems, are not intellectually agile, and are culturally incompetent. I don't know if this fear is founded - maybe there are horror stories about previous hires who fit this description. But a good academic does none of these things either: she works efficiently on small problems in pursuit of bigger ones, adapts her strategy according to previous successes or failures, and is able to collaborate with others and communicate her results.
That's it for now. Thanks for reading!