Machine learning algorithms are created with the goal of
detecting and describing complex patterns in massive datasets.
By taking the uncertainty out of constructing instruments of
convenience, automated machine learning (AutoML) aims to deliver these
analytical tools to everyone interested in large data research.
"Computational analysis pipelines" is the name
given to these instruments.
While there is still a lot of work to be done in automated
machine learning, early achievements show that it will be an important tool in
the arsenal of computer and data scientists.
It will be critical to customize these software packages to
beginner users, enabling them to undertake difficult machine learning
activities in a user-friendly way while still allowing for the integration of
domain-specific knowledge and model interpretation and action.
These latter objectives have received less attention, but
they will need to be addressed in future study before AutoML is able to tackle
complicated real-world situations.
Automated machine learning is a relatively young field of
research that has risen in popularity in the past ten years as a consequence of
the widespread availability of strong open-source machine learning frameworks
and high-performance computers.
AutoML software packages are currently available in both
open-source and commercial versions.
Many of these packages allow for the exploration of machine
learning pipelines, which can include feature transformation algorithms like
discretization (which converts continuous equations, functions, models, and
variables into discrete equations, functions, and so on for digital computers),
feature engineering algorithms like principal components analysis (which
removes large dimensions of "less important" data while keeping a
subset of "more important" variables), and so on.
Bayesian optimization, ensemble techniques, and genetic
programming are examples of stochastic search strategies utilized in AutoML.
Stochastic search techniques may be used to solve
deterministic issues that have random noise or deterministic problems that have
randomness injected into them.
New methods for extracting "signal from noise" in
datasets, as well as finding insights and making predictions, are currently
being developed and tested.
One of the difficulties with machine learning is that each
algorithm examines data in a unique manner.
That is, each algorithm recognizes and classifies various
patterns.
Linear support vector machines and k-nearest neighbor
algorithms are excellent at detecting linear patterns, whereas k-nearest
neighbor methods are effective at detecting nonlinear patterns.
The problem is that scientists don't know which algorithm(s)
to employ when they start their job since they don't know what patterns they're
looking for in the data.
The majority of users select an algorithm that they are
acquainted with or that seems to operate well across a variety of datasets.
Some people may choose an algorithm because the models it
generates are simple to compare.
There are a variety of reasons why various algorithms are
used for data analysis.
Nonetheless, the approach selected may not be optimal for a
particular data set.
This task is especially tough for a new user who may not be
aware of the strengths and disadvantages of each algorithm.
A grid search is one way to address this issue.
Multiple machine learning algorithms and parameter settings
are applied to a dataset in a systematic manner, with the results compared to
determine which approach is the best.
This is a frequent strategy that may provide positive
outcomes.
The grid search's drawback is that it may be computationally
demanding when a large number of methods, each with several parameter values,
need to be examined.
Random forests are classification algorithms comprised of
numerous decision trees with a number of regularly used parameters that must be
fine-tuned for best results on a specific dataset.
The accepted machine learning approach adjusts the data
using parameters, which are configuration variables.
The maximum number of characteristics that may be used in
the decision trees that are constructed and assessed is a typical parameter.
Automated machine learning may aid in the management of the
complicated, computationally costly combinatorial explosion that occurs during
the execution of specialized investigations.
A single parameter might have 10 distinct configurations,
for example.
Another parameter might be the number of decision trees to
be included in the forest, which could be 10 in total.
Another ten possible parameters might be the minimum amount
of samples that would be permitted in the "leaves" of the decision
trees.
Based on the examination of just three parameters, this example
gives 1000 distinct alternative parameter configurations.
A data scientist looking at ten different machine learning
methods, each with 1000 different parameter values, would have to undertake
10,000 different studies.
Hyperparameters, which are characteristics of the analyses
that are established ahead of time and hence not learnt from the data, are
added on top of these studies.
They are often established by the data scientist using a
variety of rules of thumb or values derived from previous challenges.
Comparisons of numerous alternative cross-validation procedures
or the influence of sample size on findings are examples of hyperparameter
setups.
Hundreds of hyperparameter combinations may need to be
assessed in a typical case.
The data scientist would have to execute a total of one
million analyses using a mix of machine learning algorithms, parameter
settings, and hyperparameter settings.
Given the computer resources available to the user, so many
distinct studies might be prohibitive depending on the sample size of the data
to be examined, the number of features, and the kinds of machine learning
algorithms used.
Using a stochastic search to approximate the optimum mix of
machine learning algorithms, parameter settings, and hyperparameter settings is
an alternate technique.
Until a computational limit is reached, a random number
generator is employed to sample from all potential possibilities.
Before making a final decision, the user manually explores
various parameter and hyperparameter settings around the optimal technique.
This has the virtue of being computationally controllable,
but it has the disadvantage of being stochastic, since chance may not explore
the best combinations.
To address this, a stochastic search algorithm with a heuristic
element—a practical technique, guide, or rule—may be created that can
adaptively explore algorithms and settings while improving over time.
Because they automate the search for optimum machine
learning algorithms and parameters, approaches that combine stochastic searches
with heuristics are referred to as automated machine learning.
A stochastic search could begin by creating a variety of
machine learning algorithm, parameter setting, and hyperparameter setting
combinations at random and then evaluate each one using cross-validation, a
method for evaluating the effectiveness of a machine learning model.
The best of these is chosen, modified at random, and
assessed once again.
This procedure is continued until a computational limit or a
performance goal has been met.
This stochastic search is guided by the heuristic algorithm.
Optimal search strategy development is a hot topic in
academia right now.
There are various benefits to using AutoML.
To begin with, it has the potential to be more computationally
efficient than the exhaustive grid search method.
Second, it makes machine learning more accessible by
removing some of the guesswork involved in choosing the best machine learning
algorithm and its many parameters for a particular dataset.
This allows even the most inexperienced user to benefit from
machine learning.
Third, if generalizability measurements are included into
the heuristic being utilized, it may provide more repeatable outcomes.
Fourth, including complexity metrics into the heuristic
might result in more understandable outcomes.
Fifth, if expert knowledge is included into the heuristic,
it may produce more actionable findings.
AutoML techniques do, however, present certain difficulties.
The first is the risk of overfitting, which occurs when
numerous distinct methods are evaluated, resulting in an analysis that matches
existing data too closely but does not fit or forecast unknown or fresh data.
The more analytical techniques used on a dataset, the more
likely it is to learn the data's noise, resulting in a model that is hard to
generalize to new data.
With any AutoML technique, this must be thoroughly handled.
Second, AutoML is computationally demanding in and of
itself.
Third, AutoML approaches may create very complicated
pipelines including several machine learning algorithms.
This may make interpretation considerably more challenging
than just selecting a single analytic method.
Fourth, this is a very new field.
Despite some promising early instances, ideal AutoML solutions
may not have yet been devised.
~ Jai Krishna Ponnappan
You may also want to read more about Artificial Intelligence here.
See also: Deep Learning.
Further Reading
Feurer, Matthias, Aaron Klein, Katharina Eggensperger, Jost Springenberg, Manuel Blum, and Frank Hutter. 2015. “Efficient and Robust Automated Machine Learning.” In Advances in Neural Information Processing Systems, 28. Montreal, Canada: Neural Information Processing Systems. http://papers.nips.cc/paper/5872-efficient-and-robust-automated-machine-learning.
Hutter, Frank Hutter, Lars Kotthoff, and Joaquin Vanschoren, eds. 2019. Automated Machine Learning: Methods, Systems, Challenges. New York: Springer.