INFORMS Annual Meeting
Winter Simulation Conference
Nov. 18, 4 p.m.
Data Mining: Failure to Launch Webinar
Feb. 19-20, 2015
Data Analytics for Action & Impact:Transforming Data to Goal-Driven Insight for the Data-Rich yet Information-Poor
Special ArticlesReal estate analytics: three tips to improve corporate productivity
Major corporations across the globe are finding a surprising link to business productivity from the data generated in their real estate operations and portfolio. According to JLL experts, real estate data and analytics should be considered an essential element of how corporations make capital investment and occupancy decisions to improve their overall productivity.Read More
Industry NewsAnyLogic relocates to Chicago
AnyLogic North America, LLC, a simulation modeling software and services company, has moved its North American headquarters from Lisle, Ill., to Chicago. The new address is 20 N. Wacker Dr., Suite 2044, Chicago, IL 60606, inside the historic, newly renovated Civic Opera Building.Read More
Special ArticlesKaplan elected president of INFORMS
Yale School of Management Professor Edward H. Kaplan, whose pioneering work in public health and homeland security has received international recognition and numerous awards, was recently elected president of the Institute for Operations Research and the Management Sciences (INFORMS).Read More
The times they are a changin’ for advanced analytics
Statistical modelers urged to embrace machine learning, open-source tools for the road ahead.
By Sameer Chopra
My thesis below addresses the following points:
- While statistical modeling is not going away, analytics groups are advised to leverage machine-learning approaches as well.
- While traditional statistical modeling software packages are not going away, analytics groups need to actively embrace new skill-sets in emerging software such as open-source tools (e.g., R, MangoDB) and Big Data tools (e.g., Hadoop). Big Data is just getting bigger, and new tools are emerging that round out the tool suite of analytics groups.
Statistical Modeling vs. Machine Learning
Since the mid-1990s I have used statistical modeling tools such as SAS as the primary tool for advanced analytics. I would place myself squarely in the camp of “statistical modelers” (vs. my machine learning friends – though I realize some might quibble with this distinction). Over the years I have led teams of statistical analysts who have primarily used such statistical packages as SAS/SPSS/S-Plus, etc. as their go-to analysis tool.
In my current capacity, I am responsible for advanced analytics at Orbitz Worldwide. Advanced analytics is a strategic lever at Orbitz and has the good fortune of executive support at the highest levels. Competing on analytics is feasible only if there is buy-in at the highest levels.
I lead the traditional statistical modelers as well as the chief scientist and the machine learning (ML) crew. At Orbitz, we have found value in incorporating both types of data mining professionals (machine learners and statistical modelers) because many problems are well-suited for both camps. For example, the statistical modelers effectively address areas such as marketing mix analysis, predictive models across online marketing channels, customer lifetime value models, churn models, credit card fraud models, etc. Similarly, the machine learning staff deploys their algorithms in areas leveraging Big Data, where system feedback is leveraged to quickly learn from patterns in order to self-improve – areas such as the Hotel Recommendation Engine and Hotel Sort on the Orbitz Web site.
Conceptually, both camps are “data mining” professionals, so there is a lot of overlap. For instance, both fields do work with some common methods such as decision trees and clustering algorithms. I also find that the camps often use different jargon for the same basic concepts (“weights” vs. “parameters,” “learning” vs. “fitting,” etc.).
However, I find the machine learning area to clearly be of a different cloth – the contrast in tools and approaches between ML and statistical modelers is rather stark. The following are but a few examples to illustrate some differences between the two sides:
- Apart from cosmetic differences in labels used, statistical modeling has a probabilistic approach with a strong emphasis on parametric assumptions, regression diagnostics, inference, hypothesis testing, interpretability of model and so on – areas not important in the ML world.
- On the flip side, ML practitioners regularly use tools such as support vector machines (SVM), tools that are not commonly used by statistical modelers. ML focuses on predictive accuracy and not much on interpretation of models. Note that ML has its roots in artificial intelligence (AI), and practitioners of machine learning usually tend to have a strong computer science background – another key difference.
The comparison sparked the following question: “Which side of this analytics fence lends itself better to the road ahead?” My (likely controversial) response: “At this point in time, machine learning!” In fact, never before has the need for this been as forceful and urgent as it is today. I am not implying that statistical modeling is going away, but I am stating that machine learning is rapidly increasing in relevance and prominence. It makes sense for analytical teams to complement their skill sets by incorporating machine-learning approaches in order to be better positioned for the road ahead.
Not surprisingly, general interest in machine learning has exploded in the past year. Late last year, Stanford University offered a free online course in ML/AI that went viral to the point of having well over 100,000 students register from around the world in a matter of weeks! (This speaks to both the growing interest in ML as well as to a fundamental paradigm shift in the making vis-à-vis the educational method/framework.)
Big Data & Open Source Analytics
Machine learning lends itself well to situations where the design and development of algorithms is against high dimensional data where computational issues are very important – and the Big Data paradigm shift, along with open source tools, is ideally suited for ML to leverage.
The open source language R has become the data-mining tool of choice for machine learners for the following reasons:
- R has very good integration with Hadoop, an area where established commercial statistical tools have frankly been playing catch-up over the past year. (Note: At the time of this writing, some established statistical solution providers were announcing an access interface to Hadoop.)
- Many startups and smaller firms do not have deep pockets and are embracing open source tools such as the R programming language and NoSQL database systems such as MangoDB.
- R is a leading language for developing new statistical methods, and it is a platform for statistical innovation and collaboration across both the corporate world and academia. In my opinion, for the first time in years, the stronghold of established commercial players seems to be potentially threatened; open source tools are better suited for Big Data and will slowly but surely continue to take share away from commercialized statistical packages. In fact, traditional statistical vendors have recognized that R is a force to be reckoned with. In response, many of these vendors have developed hooks into R so users can interface with the R language.
- Based on the resumes I’ve been reading, the next generation of data miners is flocking to R as their go-to tool. Professors in general are comfortable with R; they tend to use R and Excel as part of their curriculum.
- In short, open-source analytics tools and platforms have arrived.
R hasn’t been widely adopted in the corporate world because it used to be considered (and still is to a large extent) not quite “enterprise ready,” but even that is changing as firms such as Revolution Analytics focus on the enterprise capabilities for R.
Despite some hype associated with the topic of Big Data, it is generally acknowledged that Big Data and Distributed Computing are rapidly changing the analytics landscape. Leveraging Hadoop and being well-versed in MapReduce jobs is quickly transitioning from a “nice to know” to a “must do” skill. Here again, machine learning practitioners seamlessly tend to adapt, whereas many traditional statistical modelers seem to face a “who moved my cheese” syndrome. Prerequisites such as being well-versed in Python or Java tend to be second nature to those in the ML camp.
What does this mean for today’s traditional statistical modelers?
Gone are the days when a statistical analyst might have been complacent about a relatively slowly changing world (relative to say a computer science or IT professional who had to strive more to stay current with changing languages and new tools). In order to stay competitive, it would behoove traditional statistical modelers to proactively plunge into professional development mode and take a page from the book of our machine-learning friends.
Specifically, the best-in-class analytical organizations of the future will be those that embrace traditional statistical modeling and machine learning approaches along with established and emerging tools and technology associated with Big Data analytics, including R, Hadoop/HDFS, Map Reduce, Java/Python, Pig, Hive, etc.
The times they are a changin’….
Sameer Chopra (Sameer.Chopra@orbitz.com) is vice president of Advanced Analytics at Orbitz Worldwide, Inc., a leading global online travel company. He has more than 15 years of experience in applying data mining and predictive analytics across various business domains at both Fortune 500 firms and startups. Before joining Orbitz, Chopra led the Marketing Analytics and Web testing team at Intuit’s Small Business Group and served as director of analytics at eBay. He holds a master’s degree in Operations Research from the Massachusetts Institute of Technology.