The curse of dimensionality is a famous concept in Machine Learning which says that adding more features has to come at the cost of more data otherwise the model might not generalize. Hence, it is preferred to avoid adding features unless you are sure and also be wary of doing feature explosion by one-hot encoding features by pd.get_dummies from categorical features.
Jeremy Howard of famous Fastai discusses this and says these exact words
“The curse of dimensionality is a stupid concept.” – Jeremy Howard
This is opposite to what most of us have taken for granted when it comes to feature engineering.
After watching the video and thinking for a while I came to this intuitive sense, specifically pertaining to the random forest, – if a feature has information, it will be used for splitting. But if it’s just adding noise, no split will happen on the value of that feature. It is also possible that a few splits on useless features might happen out of coincidence. In this case, some splits might be useless and hence we might need to go more depth in tree. By having more trees we can eliminate the noise created by the useless features since adding more trees increases generalisation.
So all in all, create as many features as you want and hypertune n_estimators, max_depth and min_samples_split. But adding more features also increases computation/tree and hence takes more time.
P.S. Morningstar Quantitative Fund Rating is powered by Random Forests.
An AI evangelist and a multi-disciplinary engineer. Loves to read business and psychology during leisure time. Connect with him any time on LinkedIn for a quick chat on AI!