Homework 1: Docker, SQL and Terraform for Data Engineering Zoomcamp 2026
Question 1.
Understanding Docker images
Run docker with the python:3.13 image. Use an entrypoint bash to interact with the container.
What’s the version of pip in the image?
- 25.3
- 24.3.1
- 24.2.1
- 23.3.1
Solution
You can run python:3.13 with docker and interactive terminal flag (-it)
docker run -it --rm -v $(pwd):/app python:3.13 pip -V
Result: pip 25.3 from /usr/local/lib/python3.13/site-packages/pip (python 3.13)
so the answer is 25.3
Answer
- 25.3
Question 2
Given the docker-compose.yaml, what hostname and port should pgAdmin use to connect to the Postgres database? (1 point)
- postgres:5433
- localhost:5432
- db:5433
- postgres:5432
- db:5432
Solution
Take a look at the docker-compose.yaml file:
...
image: postgres:18
environment:
POSTGRES_USER: "root"
POSTGRES_PASSWORD: "root"
POSTGRES_DB: "ny_taxi"
volumes:
- "ny_taxi_postgres_data:/var/lib/postgresql"
ports:
- "5432:5432"
...
You can see that the hostname is localhost and the port is 5432.
So, the answer is 2. localhost:5432.
Answer
- localhost:5432
Question 3.
For the trips in November 2025, how many trips had a trip_distance of less than or equal to 1 mile? (1 point)
- 7,853
- 8,007
- 8,254
- 8,421
Solution
First, open Jupyter Notebook or Google Colab to run the code.
Use the following code to download the Parquet dataset file:
!wget https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2025-11.parquet
Then, read the .parquet file and display it as a DataFrame:
import pandas as pd
df = pd.read_parquet("green_tripdata_2025-11.parquet")
df
This is a preview of the data, it consists of 21 columns.
| VendorID | lpep_pickup_datetime | lpep_dropoff_datetime | store_and_fwd_flag | RatecodeID | PULocationID | DOLocationID | passenger_count | trip_distance | fare_amount | extra | mta_tax | tip_amount | tolls_amount | ehail_fee | improvement_surcharge | total_amount | payment_type | trip_type | congestion_surcharge | cbd_congestion_fee |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | 2025-11-01 00:34:48 | 2025-11-01 00:41:39 | N | 1.0 | 74 | 42 | 1.0 | 0.74 | 7.20 | 1.00 | 0.5 | 1.94 | 0.0 | NaN | 1.0 | 11.64 | 1.0 | 1.0 | 0.00 | 0.00 |
| 2 | 2025-11-01 00:18:52 | 2025-11-01 00:24:27 | N | 1.0 | 74 | 42 | 2.0 | 0.95 | 7.20 | 1.00 | 0.5 | 0.00 | 0.0 | NaN | 1.0 | 9.70 | 2.0 | 1.0 | 0.00 | 0.00 |
Before we work with the data, it’s important to convert the column we’re using to the correct data type. In this case, we’ll convert lpep_pickup_datetime to a datetime column.
df["lpep_pickup_datetime"] = pd.to_datetime(df["lpep_pickup_datetime"])
To filter the date range, we can use the between() method with two parameters: the start date and the end date (see pandas.Series.between). Save the filtered result into a new variable called df_result.
df_result = df[df["lpep_pickup_datetime"].between("2025-11-01","2025-12-01")]
Because the question asks for trips with a trip_distance of less than or equal to 1 mile, we can filter like this:
df_result[df_result["trip_distance"] <= 1].reset_index()
This filters rows where trip_distance is less than or equal to 1.
We can count the total number of matching trips by using len():
len(df_result[df_result["trip_distance"] <= 1].reset_index())
Answer
8007 Trips
Question 4.
Which pickup day had the longest total trip distance? Only consider trips with a trip_distance less than 100 miles. (1 point)
- 2025-11-14
- 2025-11-20
- 2025-11-23
- 2025-11-25
Solution
First, open Jupyter Notebook or Google Colab to run the code.
Use the following code to download the Parquet dataset file:
!wget https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2025-11.parquet
Next, read the .parquet file and display it as a DataFrame:
import pandas as pd
df = pd.read_parquet("green_tripdata_2025-11.parquet")
There are several ways to complete this task, but I prefer to extract the month, day, and year from the pickup datetime column and store them in separate variables.
df["month"] = df["lpep_pickup_datetime"].dt.month
df["day"] = df["lpep_pickup_datetime"].dt.day
df["year"] = df["lpep_pickup_datetime"].dt.year
Let’s aggregate the data by month and day, summing the trip_distance column, so we get the total trip distance for each date.
df[df["trip_distance"] < 100].groupby(["month", "day"]).agg(total=("trip_distance", "sum")).sort_values("total", ascending=False)
Here is the resulting table of total trip distances (less than 100 miles), grouped by month and day:
| Month | Day | Total Trip Distance |
|---|---|---|
| 11 | 20 | 6377.03 |
| 19 | 6031.56 | |
| 18 | 5976.12 | |
| 6 | 5973.34 | |
| 25 | 5954.80 | |
| 5 | 5841.01 |
The day with the longest total trip distance is 20 November 2025.
Answer
- 2025-11-20
Question 5.
Which was the pickup zone with the largest total_amount (sum of all trips) on November 18th, 2025? (1 point)
- East Harlem North
- East Harlem South
- Morningside Heights
- Forest Hills
Solution
We need one more file for this task. Please download it using the code below:
!wget https://github.com/DataTalksClub/nyc-tlc-data/releases/download/misc/taxi_zone_lookup.csv
Example of the data:
| LocationID | Borough | Zone | service_zone |
|---|---|---|---|
| 1 | EWR | Newark Airport | EWR |
| 2 | Queens | Jamaica Bay | Boro Zone |
| 3 | Bronx | Allerton/Pelham Gardens | Boro Zone |
Let’s import both files:
import pandas as pd
df = pd.read_parquet("./green_tripdata_2025-11.parquet")
df_zone = pd.read_csv("./taxi_zone_lookup.csv")
df_zone = df_zone.dropna()
Next, merge the two tables using PULocationID and LocationID as the join keys:
df_joined = pd.merge(df,df_zone,left_on="PULocationID",right_on="LocationID")
df_joined
This will create one combined table (the main df and df_zone) and store it in the df_joined variable.
Now, let’s filter trips for November 18th, 2025, then group by zone and aggregate the sum of total_amount:
df_joined[
(df_joined["day"] == 18) & (df_joined["months"] == 11) & (df_joined["year"] == 2025)
].groupby("Zone").agg(total=("total_amount", "sum")).sort_values("total", ascending=False)
Lastly, we can sort from the highest to the lowest total_amount.
| Zone | total |
|---|---|
| East Harlem North | 9281.92 |
| East Harlem South | 6696.13 |
| Central Park | 2378.79 |
| Washington Heights South | 2139.05 |
We can see that East Harlem North is the pickup zone that have the highest total amount that user paying.
Answer
- East Harlem North
Question 7.
Which of the following sequences describes the Terraform workflow for:
- Downloading plugins and setting up backend
- Generating and executing changes
- Removing all resources? (1 point)
- terraform import, terraform apply -y, terraform destroy
- teraform init, terraform plan -auto-apply, terraform rm
- terraform init, terraform run -auto-approve, terraform destroy
- terraform init, terraform apply -auto-approve, terraform destroy
- terraform import, terraform apply -y, terraform rm
Solution
Check the Terraform module in this GitHub repo: https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/01-docker-terraform/terraform/1_terraform_overview.md#execution-steps
From this guideline, we can see that:
terraform initis used to initialize the backend, install plugins, and check the configuration.terraform applyis used to apply the plan in the cloud, with the-auto-approveflag.- Lastly, we can use
terraform destroyto remove all resources and stacks from the cloud.
So the answer is 4.
Answer
- terraform init, terraform apply -auto-approve, terraform destroy