Feature selection is also called variable selection or attribute selection is the process of selecting a subset of relevant features for use in model construction.
Feature selection is different from dimensionality reduction. Both methods seek to reduce the number of features in the dataset, but a dimensionality reduction method do so by creating new combinations of features, where as feature selection methods include and exclude features present in the data without changing them.

Problems, that the feature selection solves

Feature selection methods help you by choosing features that will give you as better accuracy whilst requiring less data. Feature selection methods can be used to identify and remove irrelevant and redundant attributes from data that do not contribute to the accuracy of a predictive model.
Fewer features is desirable because it reduces the complexity of the model, and a simpler model is simpler to understand and explain.

What is ANOVA (Analysis of Variance)

Analysis of Variance is a statistical method, used to check the means of two or more groups that are significantly different from each other. Groups mean differences inferred by analyzing variances.
ANOVA uses variance-based F-test also called omnibus test to check the group mean equality. F-test is a ratio of the between-group variance to the within-group variance. F-test tests non-specific null hypothesis i.e. all group means are equal.
Main types: One-way and two-way ANOVA (way or factor is an independent variable).

  • One Way ANOVA
    • One Way ANOVA tests the relationship between categorical predictor vs continuous response. Here we will check whether there is equal variance between groups of categorical feature continuous response. If there is equal variance between groups, it means this feature has no impact on response and it can not be considered for model training.
  • Two Way ANOVA
    • If we have two predictors, we will use Two way ANOVA.
    • Example: From dataset, there are two factors (independent variables) viz. genotypes and yield in years. Genotypes and years has six and three levels respectively. For this experimental design, there are two factors to evaluate, and therefore, two-way ANOVA method is suitable for analysis. Here, using two-way ANOVA, we can simultaneously evaluate how type of genotype and years affects the yields of plants.
    • From two-way ANOVA, we can tests three hypotheses:
    • 1 effect of genotype on yield.
    • 2 effect of time (years) on yield.
    • 3 effect of genotype and time (years) interactions on yield.

ANOVA Assumptions

  • Assumption of normality:
    Residuals are approximately normally distributed. Non-normality of the dependent variable can cause non-normality of residuals in ANOVA.
  • Homogeneity of variance:
    Variances of dependent variable are roughly equal between treatment groups (can be tested using Levene’s, Bartlett’s test).
  • Assumption of independence:
    Observations are sampled independently from each other (no relation in observations between the groups and within the groups) i.e., each subject should have only one response.
  • The dependent variable should be continuous. If the dependent variable is ordinal or rank, it is more likely to violate the assumptions of normality and homogeneity of variances. If these assumptions are violated, you should consider the non-parametric tests (e.g. Mann-Whitney U test).

How ANOVA works

  1. Check sample sizes: equal number of observation in each group.
  2. Calculate Mean Square for each group.
  3. Calculate Mean Square error (MSE).
  4. Calculate F value (MS of group/MSE).
  5. Calculate p value based on F value and degrees of freedom.

Recommended for you:
Bias and Variance in Machine Learning
Feature Engineering in Machine Learning

Leave a Reply

Your email address will not be published. Required fields are marked *