|
|
 Exploratory analysis has different meanings depending on the context in which it
is used. For instance, the terms
"Exploratory analysis" and "Confirmatory analysis"
are frequently associated with a technique in statistics known as factor analysis.
Don't be misled by this common association! When we
refer to exploratory
analysis, we are referring to
a process used to select combinations
of variables that are likely to be useful in creating a model
that closely reproduce the actual (sales) data.
Here is the problem:
- There are hundreds of variables that might combine to
produce a model that closely reproduces the actual (sales) data.
Variables include weather-related data, time-trends,
holiday proximity, day-of-week variables, 2-way, 3-way, and 4-way
interactions between all of the above (to name just a few!)
- Each variable may be correlated with many other variables. For instance, wind
speed is correlated with wind gust speed, and may also be correlated
with rain and warmer or cooler
temperatures.
- At this point, some statisticians may be thinking,
"Use structural equation modeling or factor analysis to extract
a measure of 'bad weather'." However, this approach
is not appropriate, because...
- While there are hundreds of possible predictor variables,
the number of data
points available for analysis are often few (usually in
the hundreds). Therefore, it is of critical importance
that we avoid simply capitalizing on chance. In other words, models
must be logical, meaningful, and likely to contain actual
predictor variables (as opposed to containing a mishmash of
variables that happen to be correlated with sales by chance).
Take me back to the home page - I already know this
"stuff". :)
Skip the details. Let me generate my own models using the online tool.
Is my online regression analysis done yet?
Our Solution:
-
The exploratory process we use relies on a large amounts of
computational power and the computer's ability to quickly
generate semi-random and random permutations of variable subsets
from the full list of possible variables (see diagram below).
- The process we use
is considered intellectual property of BrainTech, LLC and cannot be disclosed (but
it shouldn't be hard to figure out for those who understand mathematics
and statistics).
- After exploratory
analysis, we are left with a frequency distribution of subsets of variables.
- Using the principle of maximum-likelihood, we select subsets of
variables that are likely to combine to produce a model
that closely fits the actual (sales) data.
* Some individuals have suggested using correlations and residuals
to iteratively select the variables that produces a model
that best fits the data. However, this approach would be similar
to performing exploratory factor analysis with hundreds of variables
using a tiny set of data (not a good thing)!
If
you're lost
or confused, you're probably not alone -- here is an example and a diagram that should (somewhat) clarify the process:
- Perhaps we have the following list of variables: HighTemp, LowTemp, WindSpeed, Rain Amount, Day-of-week (DOW).
Some 2-way Interactions: LowTemp*WindSpeed, LowTemp*DOW, HighTemp*DOW, HighTemp*Rain.
Some 3-way Interactions: LowTemp * Rain * WindSpeed, LowTemp * Rain * DOW. - Simplified version: After exploratory analysis,
we may be left with a distribution
that looks like image to the right.
I meant to read about Fine-Tuning models (easier material)
Is my online regression analysis done yet?
This was fun, but... Take me back to the home page :)
After this point, we select combinations of the most frequently
selected variables and perform
the process again. However, this time the initial weighting
of each variable and order of insertion and tuning is "seeded" (partially determined,
partially random) by the variable's frequency of occurrence. With fewer variables to consider, we build thousands of "mini-models"
that fit a subset of the data and validate each model on the remaining portion of data (cross-validation).
Based on the frequency of selection of each variable during this process (specifically, the
algorithm looks for repeated convergence on the same set of variables),
the final output is one reasonably-sized set of variables (on a good day)
that are likely to produce a relevant, logical, potentially
predictive model.
The offline version of our exploratory analysis tool repeats this process several hundred more times. (The online tool repeats the process three times,
due to cpu constraints). Each
repetition yields a single potential set of variables. These sets are
passed on to the next stage: Fine-Tuning potential Models.
|