Feature Selection Phase

I am trying to predict the overall age of an opportunity (creation date - closing date) this is my response variable

lets say an opportunity passes through 3 stages to close

For example: Opp x stayed in

  • stage 1 : 30 days
  • stage 2 : 10 days
  • stage 3: 20 days

At stage 3 I might close it same date or wait some time

so if I waited some time to close, it will be createdon: 22/11/2018 closedon:9/2/2019

There is opp y , where i close it in same date of stage 3, so createdon:22/11/2018 and closedate: 21/1/2019

Summary

+---------+--------+--------+--------+--------+
| OppName | oppAge | stage1 | stage2 | stage3 |
+---------+--------+--------+--------+--------+
| x       |     79 |     30 |     10 |     20 |
| y       |     60 |     30 |     10 |     20 |
+---------+--------+--------+--------+--------+

my question is :

  1. Can I include stage1,2,3 as my independent variables to create a regression model?
  2. They seem to nearly make the model ideal, so is it better to include maybe only stage 1? without 2 3

Work Done ,Added Edits

  • I transformed stages to a categorical nature, for example 30 days to 1 month 30-60 day transformed it to 1-2months and so on

     +-----------+
    |  stage1   |
    +-----------+
    | 1month   |
    | 1~2months |
    | 6~7months |
    +-----------+
    
  • Then I did one-hot encode to the stages like stage 1

  • Then I stopped ,wasn't sure whether to include everything or what?

Topic regression feature-selection

Category Data Science


One-hot encode those three Stage variables instead of including all three of them. And again, include any of those at all, ONLY if they hold relevance so a pre-hand PCA won't be a bad idea overall.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.