I want to update my code of pyspark. In the pyspark, it must put the base model in a pipeline, the office demo of pipeline use the LogistictRegression as an base model. However,
There is no XGBoost classifier in Apache Spark ML (as of version 2.3). Available models are listed here : https://spark.apache.org/docs/2.3.0/ml-classification-regression.html
If you want to use XGBoost you should do it without pyspark (convert your spark dataframe to a pandas dataframe with .toPandas()
) or use another algorithm (https://spark.apache.org/docs/2.3.0/api/python/pyspark.ml.html#module-pyspark.ml.classification).
But if you really want to use XGBoost with pyspark, you'll have to dive into pyspark to implement a distributed XGBoost yourself. Here is an article where they do so : http://dmlc.ml/2016/10/26/a-full-integration-of-xgboost-and-spark.html