Optimizing random forests: spark implementations of random genetic forests
Main Article Content
Abstract
The Random Forest (RF) algorithm, originally proposed by Breiman et al. (1), is a widely used machine learning
algorithm that gains its merit from its fast learning speed as well as high classification accuracy. However, despite
its widespread use, the different mechanisms at work in Breiman’s RF are not yet fully understood, and there is still
on-going research on several aspects of optimizing the RF algorithm, especially in the big data environment. To
optimize the RF algorithm, this work builds new ensembles that optimize the random portions of the RF algorithm
using genetic algorithms, yielding Random Genetic Forests (RGF), Negatively Correlated RGF (NC-RGF), and
Preemptive RGF (PFS-RGF). These ensembles are compared with Breiman’s classic RF algorithm in Hadoop’s
big data framework using Spark on a large, high-dimensional network intrusion dataset, UNSW-NB15.