November 9-12, 2014
2014 INFORMS Annual Meeting
San Francisco, CA
Special ArticlesIBM announces $3 billion research initiative to tackle chip challenges
IBM recently announced it is investing $3 billion over the next five years in two broad research and early stage development programs to push the limits of chip technology needed to meet the emerging demands of cloud computing and big data systems. These investments will push IBM’s semiconductor innovations from today’s breakthroughs into the advanced technology leadership required for the future.Read More
Special ArticlesWSC 2014: Exploring big data through simulation
The Winter Simulation Conference (WSC) has been the premier international forum for disseminating recent advances in the field of system simulation for more than 40 years, with the principal focus being discrete-event simulation and combined discrete-continuous simulation. In addition to a technical program of unsurpassed scope and high quality, WSC provides the central meeting place for simulation researchers, practitioners and vendors working in all disciplines and in industrial, governmental, military, service and academic sectors. WSC 2014 will be held Dec. 7-10 in Savannah, Ga., at the Westin Savannah Harbor Golf Resort & Spa and the adjacent Savannah International Trade & Convention Center.Read More
Special ArticlesINFORMS Annual Meeting in S.F. to bridge data, decisions
The 2014 INFORMS Annual Meeting in San Francisco, set for Nov. 9-12 and whose theme is “Bridging Data and Decisions,” promises to be one of the largest ever, with more than 5,000 technical presentations. Whether you are interested in pressing societal needs, including healthcare, energy and climate change, new developments in supply chain management and logistics, cutting-edge methodologies for optimization, or advances in stochastic processes and risk analysis, you will find hundreds of presentations to match your interests.Read More
The times they are a changin’ for advanced analytics
Statistical modelers urged to embrace machine learning, open-source tools for the road ahead.
By Sameer Chopra
My thesis below addresses the following points:
- While statistical modeling is not going away, analytics groups are advised to leverage machine-learning approaches as well.
- While traditional statistical modeling software packages are not going away, analytics groups need to actively embrace new skill-sets in emerging software such as open-source tools (e.g., R, MangoDB) and Big Data tools (e.g., Hadoop). Big Data is just getting bigger, and new tools are emerging that round out the tool suite of analytics groups.
Statistical Modeling vs. Machine Learning
Since the mid-1990s I have used statistical modeling tools such as SAS as the primary tool for advanced analytics. I would place myself squarely in the camp of “statistical modelers” (vs. my machine learning friends – though I realize some might quibble with this distinction). Over the years I have led teams of statistical analysts who have primarily used such statistical packages as SAS/SPSS/S-Plus, etc. as their go-to analysis tool.
In my current capacity, I am responsible for advanced analytics at Orbitz Worldwide. Advanced analytics is a strategic lever at Orbitz and has the good fortune of executive support at the highest levels. Competing on analytics is feasible only if there is buy-in at the highest levels.
I lead the traditional statistical modelers as well as the chief scientist and the machine learning (ML) crew. At Orbitz, we have found value in incorporating both types of data mining professionals (machine learners and statistical modelers) because many problems are well-suited for both camps. For example, the statistical modelers effectively address areas such as marketing mix analysis, predictive models across online marketing channels, customer lifetime value models, churn models, credit card fraud models, etc. Similarly, the machine learning staff deploys their algorithms in areas leveraging Big Data, where system feedback is leveraged to quickly learn from patterns in order to self-improve – areas such as the Hotel Recommendation Engine and Hotel Sort on the Orbitz Web site.
Conceptually, both camps are “data mining” professionals, so there is a lot of overlap. For instance, both fields do work with some common methods such as decision trees and clustering algorithms. I also find that the camps often use different jargon for the same basic concepts (“weights” vs. “parameters,” “learning” vs. “fitting,” etc.).
However, I find the machine learning area to clearly be of a different cloth – the contrast in tools and approaches between ML and statistical modelers is rather stark. The following are but a few examples to illustrate some differences between the two sides:
- Apart from cosmetic differences in labels used, statistical modeling has a probabilistic approach with a strong emphasis on parametric assumptions, regression diagnostics, inference, hypothesis testing, interpretability of model and so on – areas not important in the ML world.
- On the flip side, ML practitioners regularly use tools such as support vector machines (SVM), tools that are not commonly used by statistical modelers. ML focuses on predictive accuracy and not much on interpretation of models. Note that ML has its roots in artificial intelligence (AI), and practitioners of machine learning usually tend to have a strong computer science background – another key difference.
The comparison sparked the following question: “Which side of this analytics fence lends itself better to the road ahead?” My (likely controversial) response: “At this point in time, machine learning!” In fact, never before has the need for this been as forceful and urgent as it is today. I am not implying that statistical modeling is going away, but I am stating that machine learning is rapidly increasing in relevance and prominence. It makes sense for analytical teams to complement their skill sets by incorporating machine-learning approaches in order to be better positioned for the road ahead.
Not surprisingly, general interest in machine learning has exploded in the past year. Late last year, Stanford University offered a free online course in ML/AI that went viral to the point of having well over 100,000 students register from around the world in a matter of weeks! (This speaks to both the growing interest in ML as well as to a fundamental paradigm shift in the making vis-à-vis the educational method/framework.)
Big Data & Open Source Analytics
Machine learning lends itself well to situations where the design and development of algorithms is against high dimensional data where computational issues are very important – and the Big Data paradigm shift, along with open source tools, is ideally suited for ML to leverage.
The open source language R has become the data-mining tool of choice for machine learners for the following reasons:
- R has very good integration with Hadoop, an area where established commercial statistical tools have frankly been playing catch-up over the past year. (Note: At the time of this writing, some established statistical solution providers were announcing an access interface to Hadoop.)
- Many startups and smaller firms do not have deep pockets and are embracing open source tools such as the R programming language and NoSQL database systems such as MangoDB.
- R is a leading language for developing new statistical methods, and it is a platform for statistical innovation and collaboration across both the corporate world and academia. In my opinion, for the first time in years, the stronghold of established commercial players seems to be potentially threatened; open source tools are better suited for Big Data and will slowly but surely continue to take share away from commercialized statistical packages. In fact, traditional statistical vendors have recognized that R is a force to be reckoned with. In response, many of these vendors have developed hooks into R so users can interface with the R language.
- Based on the resumes I’ve been reading, the next generation of data miners is flocking to R as their go-to tool. Professors in general are comfortable with R; they tend to use R and Excel as part of their curriculum.
- In short, open-source analytics tools and platforms have arrived.
R hasn’t been widely adopted in the corporate world because it used to be considered (and still is to a large extent) not quite “enterprise ready,” but even that is changing as firms such as Revolution Analytics focus on the enterprise capabilities for R.
Despite some hype associated with the topic of Big Data, it is generally acknowledged that Big Data and Distributed Computing are rapidly changing the analytics landscape. Leveraging Hadoop and being well-versed in MapReduce jobs is quickly transitioning from a “nice to know” to a “must do” skill. Here again, machine learning practitioners seamlessly tend to adapt, whereas many traditional statistical modelers seem to face a “who moved my cheese” syndrome. Prerequisites such as being well-versed in Python or Java tend to be second nature to those in the ML camp.
What does this mean for today’s traditional statistical modelers?
Gone are the days when a statistical analyst might have been complacent about a relatively slowly changing world (relative to say a computer science or IT professional who had to strive more to stay current with changing languages and new tools). In order to stay competitive, it would behoove traditional statistical modelers to proactively plunge into professional development mode and take a page from the book of our machine-learning friends.
Specifically, the best-in-class analytical organizations of the future will be those that embrace traditional statistical modeling and machine learning approaches along with established and emerging tools and technology associated with Big Data analytics, including R, Hadoop/HDFS, Map Reduce, Java/Python, Pig, Hive, etc.
The times they are a changin’….
Sameer Chopra (Sameer.Chopra@orbitz.com) is vice president of Advanced Analytics at Orbitz Worldwide, Inc., a leading global online travel company. He has more than 15 years of experience in applying data mining and predictive analytics across various business domains at both Fortune 500 firms and startups. Before joining Orbitz, Chopra led the Marketing Analytics and Web testing team at Intuit’s Small Business Group and served as director of analytics at eBay. He holds a master’s degree in Operations Research from the Massachusetts Institute of Technology.