I want to update my code of pyspark. In the pyspark, it must put the base model in a pipeline, the office demo of pipeline use the LogistictRegression as an base model. However,
There is an XBoost Implementation for Spark 2.4 and over here:
https://xgboost.readthedocs.io
Note that this is an external library but it should work easily with spark.
There is no XGBoost classifier in Apache Spark ML (as of version 2.3). Available models are listed here : https://spark.apache.org/docs/2.3.0/ml-classification-regression.html
If you want to use XGBoost you should do it without pyspark (convert your spark dataframe to a pandas dataframe with .toPandas()
) or use another algorithm (https://spark.apache.org/docs/2.3.0/api/python/pyspark.ml.html#module-pyspark.ml.classification).
But if you really want to use XGBoost with pyspark, you'll have to dive into pyspark to implement a distributed XGBoost yourself. Here is an article where they do so : http://dmlc.ml/2016/10/26/a-full-integration-of-xgboost-and-spark.html
There is a maintained (used in production by several companies) distributed XGBoost library as mentioned above (https://github.com/dmlc/xgboost), however to use it from PySpark is a bit tricky, someone made a working pyspark wrapper for version 0.72 of the library, with 0.8 support in progress.
See here https://medium.com/@bogdan.cojocar/pyspark-and-xgboost-integration-tested-on-the-kaggle-titanic-dataset-4e75a568bdb, and https://github.com/dmlc/xgboost/issues/1698 for the full discussion.
Make sure the xgboost jars are in your pyspark jar path.