sklearn pipeline ValueError: Found input variables with inconsistent numbers of samples

I am receiving the following error. I have check shapes of X and y, and did no find error

from sklearn.model_selection import train_test_split
from sklearn.utils import check_consistent_length

labels = ['non-role','role']
X = df[[POS, NER, DEF, SYN]]
y = df[Label]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, test_size=0.2, shuffle=True)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

print(check_consistent_length(X_train, y_train))

And here is the output:

(25238, 4)

(25238,)

(6310, 4)

(6310,)

None

I was trying to fit in the model:

NB_pipeline = Pipeline([('tfidf-vect', TfidfVectorizer()),('clf', RandomForestClassifier())])
NB_pipeline.fit(X_train, y_train)

But received following error:

ValueError: Found input variables with inconsistent numbers of samples: [4, 25238]

Topic scikit-learn machine-learning

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.