2015 INFORMS Conference on Business Analytics and Operations Research
Huntington Beach, Calif.
CORS - INFORMS Joint International Meeting
Le Centre Sheraton, Montreal
INFORMS Healthcare 2015
2015 Analytics Applications Summit
Harrisburg University of Science and Technology, Harrisburg, Pa. (Free event)
Special ArticlesNew NSF program to fund service, manufacturing, operations-related research
The National Science Foundation (NSF) Directorate for Engineering recently announced a new program to more effectively and efficiently fund research on service and manufacturing enterprises, and foundational operations research topics, called the Service, Manufacturing and Operations Research (SMOR) program.Read More
Analytics Section of INFORMS NewsConference Updates
Healthcare Conference: Section Past President Don Kleinmuntz and Treasurer Tarun Mohan Lal are organizing an analytics cluster of sessions/presentations at the upcoming INFORMS Healthcare Conference in Nashville, Tenn., July 29-31. The cluster will have speakers from several healthcare provider organizations and vendor groups.Read More
Analytics Section of INFORMS NewsSave the Date for CAP Information Session
A Certified Analytics Professional (CAP) event will be held April 13 during the 2015 INFORMS Conference on Business Analytics & Operations Research in Huntington Beach, Calif. Section members, along with other conference attendees, are encouraged to attend this informative session to better understand the value a certification can bring to a career.Read More
The times they are a changin’ for advanced analytics
Statistical modelers urged to embrace machine learning, open-source tools for the road ahead.
By Sameer Chopra
My thesis below addresses the following points:
- While statistical modeling is not going away, analytics groups are advised to leverage machine-learning approaches as well.
- While traditional statistical modeling software packages are not going away, analytics groups need to actively embrace new skill-sets in emerging software such as open-source tools (e.g., R, MangoDB) and Big Data tools (e.g., Hadoop). Big Data is just getting bigger, and new tools are emerging that round out the tool suite of analytics groups.
Statistical Modeling vs. Machine Learning
Since the mid-1990s I have used statistical modeling tools such as SAS as the primary tool for advanced analytics. I would place myself squarely in the camp of “statistical modelers” (vs. my machine learning friends – though I realize some might quibble with this distinction). Over the years I have led teams of statistical analysts who have primarily used such statistical packages as SAS/SPSS/S-Plus, etc. as their go-to analysis tool.
In my current capacity, I am responsible for advanced analytics at Orbitz Worldwide. Advanced analytics is a strategic lever at Orbitz and has the good fortune of executive support at the highest levels. Competing on analytics is feasible only if there is buy-in at the highest levels.
I lead the traditional statistical modelers as well as the chief scientist and the machine learning (ML) crew. At Orbitz, we have found value in incorporating both types of data mining professionals (machine learners and statistical modelers) because many problems are well-suited for both camps. For example, the statistical modelers effectively address areas such as marketing mix analysis, predictive models across online marketing channels, customer lifetime value models, churn models, credit card fraud models, etc. Similarly, the machine learning staff deploys their algorithms in areas leveraging Big Data, where system feedback is leveraged to quickly learn from patterns in order to self-improve – areas such as the Hotel Recommendation Engine and Hotel Sort on the Orbitz Web site.
Conceptually, both camps are “data mining” professionals, so there is a lot of overlap. For instance, both fields do work with some common methods such as decision trees and clustering algorithms. I also find that the camps often use different jargon for the same basic concepts (“weights” vs. “parameters,” “learning” vs. “fitting,” etc.).
However, I find the machine learning area to clearly be of a different cloth – the contrast in tools and approaches between ML and statistical modelers is rather stark. The following are but a few examples to illustrate some differences between the two sides:
- Apart from cosmetic differences in labels used, statistical modeling has a probabilistic approach with a strong emphasis on parametric assumptions, regression diagnostics, inference, hypothesis testing, interpretability of model and so on – areas not important in the ML world.
- On the flip side, ML practitioners regularly use tools such as support vector machines (SVM), tools that are not commonly used by statistical modelers. ML focuses on predictive accuracy and not much on interpretation of models. Note that ML has its roots in artificial intelligence (AI), and practitioners of machine learning usually tend to have a strong computer science background – another key difference.
The comparison sparked the following question: “Which side of this analytics fence lends itself better to the road ahead?” My (likely controversial) response: “At this point in time, machine learning!” In fact, never before has the need for this been as forceful and urgent as it is today. I am not implying that statistical modeling is going away, but I am stating that machine learning is rapidly increasing in relevance and prominence. It makes sense for analytical teams to complement their skill sets by incorporating machine-learning approaches in order to be better positioned for the road ahead.
Not surprisingly, general interest in machine learning has exploded in the past year. Late last year, Stanford University offered a free online course in ML/AI that went viral to the point of having well over 100,000 students register from around the world in a matter of weeks! (This speaks to both the growing interest in ML as well as to a fundamental paradigm shift in the making vis-à-vis the educational method/framework.)
Big Data & Open Source Analytics
Machine learning lends itself well to situations where the design and development of algorithms is against high dimensional data where computational issues are very important – and the Big Data paradigm shift, along with open source tools, is ideally suited for ML to leverage.
The open source language R has become the data-mining tool of choice for machine learners for the following reasons:
- R has very good integration with Hadoop, an area where established commercial statistical tools have frankly been playing catch-up over the past year. (Note: At the time of this writing, some established statistical solution providers were announcing an access interface to Hadoop.)
- Many startups and smaller firms do not have deep pockets and are embracing open source tools such as the R programming language and NoSQL database systems such as MangoDB.
- R is a leading language for developing new statistical methods, and it is a platform for statistical innovation and collaboration across both the corporate world and academia. In my opinion, for the first time in years, the stronghold of established commercial players seems to be potentially threatened; open source tools are better suited for Big Data and will slowly but surely continue to take share away from commercialized statistical packages. In fact, traditional statistical vendors have recognized that R is a force to be reckoned with. In response, many of these vendors have developed hooks into R so users can interface with the R language.
- Based on the resumes I’ve been reading, the next generation of data miners is flocking to R as their go-to tool. Professors in general are comfortable with R; they tend to use R and Excel as part of their curriculum.
- In short, open-source analytics tools and platforms have arrived.
R hasn’t been widely adopted in the corporate world because it used to be considered (and still is to a large extent) not quite “enterprise ready,” but even that is changing as firms such as Revolution Analytics focus on the enterprise capabilities for R.
Despite some hype associated with the topic of Big Data, it is generally acknowledged that Big Data and Distributed Computing are rapidly changing the analytics landscape. Leveraging Hadoop and being well-versed in MapReduce jobs is quickly transitioning from a “nice to know” to a “must do” skill. Here again, machine learning practitioners seamlessly tend to adapt, whereas many traditional statistical modelers seem to face a “who moved my cheese” syndrome. Prerequisites such as being well-versed in Python or Java tend to be second nature to those in the ML camp.
What does this mean for today’s traditional statistical modelers?
Gone are the days when a statistical analyst might have been complacent about a relatively slowly changing world (relative to say a computer science or IT professional who had to strive more to stay current with changing languages and new tools). In order to stay competitive, it would behoove traditional statistical modelers to proactively plunge into professional development mode and take a page from the book of our machine-learning friends.
Specifically, the best-in-class analytical organizations of the future will be those that embrace traditional statistical modeling and machine learning approaches along with established and emerging tools and technology associated with Big Data analytics, including R, Hadoop/HDFS, Map Reduce, Java/Python, Pig, Hive, etc.
The times they are a changin’….
Sameer Chopra (Sameer.Chopra@orbitz.com) is vice president of Advanced Analytics at Orbitz Worldwide, Inc., a leading global online travel company. He has more than 15 years of experience in applying data mining and predictive analytics across various business domains at both Fortune 500 firms and startups. Before joining Orbitz, Chopra led the Marketing Analytics and Web testing team at Intuit’s Small Business Group and served as director of analytics at eBay. He holds a master’s degree in Operations Research from the Massachusetts Institute of Technology.