PicklingError: Could not serialize object: TypeError: can't pickle fasttext_pybind.fasttext objects

I built a fasttext classification model in order to do sentiment analysis for facebook comments (using pyspark 2.4.1 on windows). When I use the prediction model function to predict the class of a sentence, the result is a tuple with the form below:

[('__label__positif', '__label__négatif', '__label__neutre', 0.8947999477386475, 0.08174632489681244, 0.023483742028474808)]

but when I tried to apply it to the column "text" I did this :

from pyspark.sql.types import *
from pyspark.sql.functions import udf, col
import fasttext

schema = StructType([
    StructField("pos", StringType(), False),
    StructField("neg", StringType(), False),
    StructField("ntr", StringType(), False),
    StructField("pr_pos", DoubleType(), False),
    StructField("pr_neg", DoubleType(), False),
    StructField("pr_ntr", DoubleType(), False)
])

udf_label = udf(lambda words : predictClass(words), schema)
df = df.withColumn("classe", udf_label(col('text')))

df.select('classe').show()

I get this error: PicklingError: Could not serialize object: TypeError: can't pickle fasttext_pybind.fasttext objects

Topic serialisation data-science-model dataframe pyspark

Category Data Science


In FastText Users FB page a certain Maksym Kysylov answered me " It's not a FastText problem. It's a Spark problem :) When you apply function to Dataframe (or RDD) Spark needs to serialize it and send to all executors. It's not really possible to serialize FastText's code, because part of it is native (in C++). Possible solution would be to save model to disk, then for each spark partition load model from disk and apply it to the data. Something like: df.rdd.mapPartitions(func). And func should: 1. load the model; 2. for record in partition: yield ft.predict(record['text'])." It works for me and I thank him very much !!!

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.