Implementing Machine Learning Pipelines with Apache Spark

by AI

June 3, 2025

in AI

Reading Time: 7 mins read

127 10

Jayita Gulati
2025-06-03 08:00:00
www.kdnuggets.com

Machine Learning Pipelines with Apache Spark

Image by Editor (Kanwal Mehreen) | Canva

Apache Spark is a tool for working with big data. It is free to use and very fast. Spark can manage large amounts of data that don’t fit in a computer’s memory. A machine learning pipeline is a series of steps to prepare data and train models. These steps include collecting data, cleaning it, selecting important features, training the model, and checking how well it works.

Spark makes it easy to build these pipelines. With Spark, companies can quickly analyze large amounts of data and create machine learning models. This helps them make better decisions based on the information they have. In this article, we will explain how to set up and use machine learning pipelines in Spark.

Components of a Machine Learning Pipeline in Spark

Spark’s MLlib library has many built-in tools. These tools can be linked together to build a complete machine learning process.

Transformers

Transformers change data in some way. They take a DataFrame and return a modified version of it. These are used for tasks like encoding categorical data or scaling numerical features. Examples include StringIndexer (for encoding) and StandardScaler (for scaling). Transformers are reusable and don’t change the original data permanently.

Estimators

Estimators learn from data to create models. They include algorithms like LogisticRegression and RandomForestClassifier. Estimators use a fit method to train on data, and they output a Model object that can make predictions.

Pipeline

A Pipeline is a tool to connect transformers and estimators into a single workflow. By organizing them in sequence, data flows smoothly from one step to the next. Pipelines make it easy to retrain models, repeat processes, and adjust parameters.

Let’s go through a basic example of building a classification pipeline to predict customer churn. In this pipeline, we’ll:

Load the Data: Import the dataset into Spark for processing.
Preprocess the Data: Clean and prepare the data for modeling.
Setup the Model: Prepare the logistic regression model.
Train the Model: Fit a machine learning model to the data.
Evaluate the Model: Check how well the model performs.

Initialize Spark Session and Load Dataset

First, we use SparkSession.builder to set up the session. Then, we load the customer churn dataset. This churn data is about bank customers who have closed their accounts.

from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder.appName("MLPipeline").getOrCreate()

# Load dataset
data = spark.read.csv("/content/Customer Churn.csv", header=True, inferSchema=True)

# Show the first few rows of the dataset
data.show(5)

Data Preprocessing

First, we check the data for any missing values. If there are missing values, we remove those rows to make sure the data is complete. Next, we convert categorical data into numerical format so that the computer can understand it. We do this using methods like StringIndexer and OneHotEncoder. Finally, we combine all the features into a single vector and scale the data.

from pyspark.sql import functions as F
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler, StandardScaler

# Check for missing values
missing_values = data.select([F.count(F.when(F.isnan(c) | F.col(c).isNull(), c)).alias(c) for c in data.columns])

# Drop rows with any missing values
data = data.na.drop()  

# Identify categorical columns
categorical_columns = ['country', 'gender', 'credit_card', 'active_member']

# Create a list to hold the stages of the pipeline
stages = []

# Apply StringIndexer to convert categorical columns to numerical indices
for column in categorical_columns:
    indexer = StringIndexer(inputCol=column, outputCol=column + "_index")
    stages.append(indexer)

    # Apply OneHotEncoder for categorical features
    encoder = OneHotEncoder(inputCols=[column + "_index"], outputCols=[column + "_ohe"])
    stages.append(encoder)

label_column = 'churn'  # The label column
feature_columns = [column + "_ohe" for column in categorical_columns]

# Add numerical columns to the features list
numerical_columns = ['credit_score', 'age', 'tenure', 'balance', 'products_number', 'estimated_salary']
feature_columns += numerical_columns

# Create VectorAssembler to combine all feature columns
vector_assembler = VectorAssembler(inputCols=feature_columns, outputCol="features")
stages.append(vector_assembler)

# Scale the features using StandardScaler
scaler = StandardScaler(inputCol="features", outputCol="scaled_features", withMean=True, withStd=True)
stages.append(scaler)

Logistic Regression Model Setup

We import LogisticRegression from pyspark.ml.classification. Next, we create a logistic regression model by using LogisticRegression().

from pyspark.ml.classification import LogisticRegression
from pyspark.ml import Pipeline

# Logistic Regression Model
lr = LogisticRegression(featuresCol="scaled_features", labelCol=label_column)
stages.append(lr)

# Create and Run the Pipeline
pipeline = Pipeline(stages=stages)

Model Training and Predictions

We split the dataset into training and testing sets. Then, we fit the pipeline model to the training data and make predictions on the test data.

# Split data into training and testing sets
train_data, test_data = data.randomSplit([0.8, 0.2], seed=42)

# Fit the model
pipeline_model = pipeline.fit(train_data)

# Make Predictions
predictions = pipeline_model.transform(test_data)

# Show the predictions
predictions.select("prediction", label_column, "scaled_features").show(10)

Model Evaluation

We import MulticlassClassificationEvaluator from pyspark.ml.evaluation to evaluate our model’s performance. We calculate the accuracy, precision, recall, and F1 score using the predictions from our model. Finally, we stop the Spark session to free up resources.

from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# Accuracy
evaluator_accuracy = MulticlassClassificationEvaluator(labelCol=label_column, predictionCol="prediction", metricName="accuracy")
accuracy = evaluator_accuracy.evaluate(predictions)
print(f"Accuracy: {accuracy}")

# Precision
evaluator_precision = MulticlassClassificationEvaluator(labelCol=label_column, predictionCol="prediction", metricName="weightedPrecision")
precision = evaluator_precision.evaluate(predictions)
print(f"Precision: {precision}")

# Recall
evaluator_recall = MulticlassClassificationEvaluator(labelCol=label_column, predictionCol="prediction", metricName="weightedRecall")
recall = evaluator_recall.evaluate(predictions)
print(f"Recall: {recall}")

# F1 Score
evaluator_f1 = MulticlassClassificationEvaluator(labelCol=label_column, predictionCol="prediction", metricName="f1")
f1_score = evaluator_f1.evaluate(predictions)
print(f"F1 Score: {f1_score}")

# Stop Spark session
spark.stop()

Conclusion

In this article, we learned about machine learning pipelines in Apache Spark. Pipelines help organize each step of the ML process. We started by loading and cleaning the customer churn dataset. Then, we transformed the data and created a logistic regression model. After training the model, we made predictions on new data. Finally, we evaluated the model’s performance using accuracy, precision, recall, and F1 score.

Jayita Gulati is a machine learning enthusiast and technical writer driven by her passion for building machine learning models. She holds a Master’s degree in Computer Science from the University of Liverpool.

Source Link

Transform your cleaning routine with the Shark AI Ultra Voice Control Robot Vacuum! This high-tech marvel boasts over 32,487 ratings, an impressive 4.2 out of 5 stars, and has been purchased over 900 times in the past month. Perfect for keeping your home spotless with minimal effort, this vacuum is now available for the unbeatable price of $349.99!

Don’t miss out on this limited-time offer. Order now and let Shark AI do the work for you!

Start your free Amazon Prime trial
today and unlock unlimited streaming and more!

Help Power Techcratic’s Future – Scan To Support

If Techcratic’s content and insights have helped you, consider giving back by supporting the platform with crypto. Every contribution makes a difference, whether it’s for high-quality content, server maintenance, or future updates. Techcratic is constantly evolving, and your support helps drive that progress.

As a solo operator who wears all the hats, creating content, managing the tech, and running the site, your support allows me to stay focused on delivering valuable resources. Your support keeps everything running smoothly and enables me to continue creating the content you love. I’m deeply grateful for your support, it truly means the world to me! Thank you!

BITCOIN

Bitcoin Logo