Kanwal Mehreen
2025-02-25 08:00:00
www.kdnuggets.com

Image by Author | Canva
Docker is basically a tool that helps data engineers package, distribute, and run applications in a consistent environment. Instead of manually installing stuff (and praying it works everywhere), you just wrap your entire project—code, tools, dependencies into lightweight, portable, and self-sufficient environments called containers. These containers can run your code anywhere, whether on your laptop, a server, or the cloud. For example, if your project needs Python, Spark, and a bunch of specific libraries, instead of manually installing them on every machine, you can just spin up a Docker container with everything pre-configured. Share it with your team, and they’ll have the exact same setup running in no time. Before we discuss the essential commands, let’s go over some key Docker terminology to make sure we’re all on the same page.
- Docker Image: A snapshot of an environment with all dependencies installed.
- Docker Container: A running instance of a Docker image.
- Dockerfile: A script that defines how a Docker image should be built.
- Docker Hub: A public registry where you can find and share Docker images.
Before using Docker, you’ll need to install:
- Docker Desktop: Download and install it from Docker’s official website. You can check if it is installed correctly by running the following command:
- Visual Studio Code: Install it from here and add the Docker extension for easy management.
Here are the essential Docker commands that every data engineer should know:
1. docker run
What It Does: Creates and starts a container from an image.
docker run -d --name postgres -e POSTGRES_PASSWORD=secret -v pgdata:/var/lib/postgresql/data postgres:15
Why It’s Important: Data engineers frequently launch databases, processing engines, or API services. The docker run
command’s flags are critical:
d:
Runs the container in the background (so your terminal isn’t locked).-name:
Name your container. Stop guessing which random ID is your Postgres instance.e:
Set environment variables (like passwords or configs).p:
Maps ports (e.g., exposing PostgreSQL’s port 5432).v:
Mounts volumes to persist data beyond the container’s lifecycle.
Without volumes, database data would vanish when the container stops—a disaster for production pipelines.
2. docker build
What It Does: Turn your Dockerfile into a reusable image.
# Dockerfile
FROM python:3.9-slim
RUN pip install pandas numpy apache-airflow
docker build -t custom_airflow:latest .
Why It’s Important: Data engineers often need custom images preloaded with tools like Airflow, PySpark, or machine learning libraries. The docker build
command ensures teams use identical environments, eliminating “works on my machine” issues.
3. docker exec
What It Does: Executes a command inside a running container.
docker exec -it postgres_db psql -U postgres # Access PostgreSQL shell
Why It’s Important: Data engineers use this to inspect databases, run ad-hoc queries, or test scripts without restarting containers. The -it
flags lets you type commands interactively (without this, you’re stuck in read-only mode).
4. docker logs
What It Does: Displays logs from a container.
docker logs --tail 100 -f airflow_scheduler # Stream last 100 logs
Why It’s Important: Debugging failed tasks (e.g., Airflow DAGs or Spark jobs) relies on logs. The -f
flag streams logs in real-time, helping diagnose runtime issues.
5. docker stats
What It Does: Live dashboard for CPU, memory, and network usage of containers.
docker stats postgres spark_master
Why It’s Important: Efficient resource monitoring is important for maintaining optimal performance in data pipelines. For example, if a data pipeline experiences slow processing, checking docker stats
can reveal whether PostgreSQL is overutilizing CPU resources or if a Spark worker is consuming excessive memory, allowing for timely optimization.
6. docker-compose up
What It Does: Start multi-container applications using a docker-compose.yml
file.
# docker-compose.yml
services:
airflow:
image: apache/airflow:2.6.0
ports:
- "8080:8080"
postgres:
image: postgres:14
volumes:
- pgdata:/var/lib/postgresql/data
Why It’s Important: Data pipelines often involve interconnected services (e.g., Airflow + PostgreSQL + Redis). Compose simplifies defining and managing these dependencies in a single declarative file so you don’t run 10 commands manually. The d flag allows you to run containers in the background (detached mode).
7. docker volume
What It Does: Manages persistent storage for containers.
docker volume create etl_data
docker run -v etl_data:/data -d my_etl_tool
Why It’s Important: Volumes preserve critical data (e.g., CSV files, database tables) even if containers crash. They’re also used to share data between containers (e.g., Spark and Hadoop).
8. docker pull
What It Does: Download an image from Docker Hub (or another registry).
docker pull apache/spark:3.4.1 # Pre-built Spark image
Why It’s Important: Pre-built images save hours of setup time. Official images for tools like Spark, Kafka, or Jupyter are regularly updated and optimized.
9. docker stop / docker rm
What It Does: Stop and remove containers.
docker stop airflow_worker && docker rm airflow_worker # Cleanup
Why It’s Important: Data engineers test pipelines iteratively. Stopping and removing old containers prevents resource leaks and keeps environments clean.
10. docker system prune
What It Does: Clean up unused containers, images, and volumes to free resources.
docker system prune -a --volumes
Why It’s Important: Over time, Docker environments accumulate unused images, stopped containers, and dangling volumes (Docker volume that is no longer associated with any container), which eats disk space and slow down performance. This command reclaims gigabytes after weeks of testing.
a:
Removes all unused images-volumes:
Delete volumes too (careful—this can delete data!).
Mastering these Docker commands empowers data engineers to deploy reproducible pipelines, streamline collaboration, and troubleshoot effectively. Do you have a favorite Docker command that you use in your daily workflow? Let us know in the comments!
Kanwal Mehreen Kanwal is a machine learning engineer and a technical writer with a profound passion for data science and the intersection of AI with medicine. She co-authored the ebook “Maximizing Productivity with ChatGPT”. As a Google Generation Scholar 2022 for APAC, she champions diversity and academic excellence. She’s also recognized as a Teradata Diversity in Tech Scholar, Mitacs Globalink Research Scholar, and Harvard WeCode Scholar. Kanwal is an ardent advocate for change, having founded FEMCodes to empower women in STEM fields.
Transform your cleaning routine with the Shark AI Ultra Voice Control Robot Vacuum! This high-tech marvel boasts over 32,487 ratings, an impressive 4.2 out of 5 stars, and has been purchased over 900 times in the past month. Perfect for keeping your home spotless with minimal effort, this vacuum is now available for the unbeatable price of $349.99!
Don’t miss out on this limited-time offer. Order now and let Shark AI do the work for you!
Help Power Techcratic’s Future – Scan To Support
If Techcratic’s content and insights have helped you, consider giving back by supporting the platform with crypto. Every contribution makes a difference, whether it’s for high-quality content, server maintenance, or future updates. Techcratic is constantly evolving, and your support helps drive that progress.
As a solo operator who wears all the hats, creating content, managing the tech, and running the site, your support allows me to stay focused on delivering valuable resources. Your support keeps everything running smoothly and enables me to continue creating the content you love. I’m deeply grateful for your support, it truly means the world to me! Thank you!
BITCOIN bc1qlszw7elx2qahjwvaryh0tkgg8y68enw30gpvge Scan the QR code with your crypto wallet app |
DOGECOIN D64GwvvYQxFXYyan3oQCrmWfidf6T3JpBA Scan the QR code with your crypto wallet app |
ETHEREUM 0xe9BC980DF3d985730dA827996B43E4A62CCBAA7a Scan the QR code with your crypto wallet app |
Please read the Privacy and Security Disclaimer on how Techcratic handles your support.
Disclaimer: As an Amazon Associate, Techcratic may earn from qualifying purchases.