6 min read
Zoomcamp
Data Engineering Zoomcamp 2026

Homework 1: Docker, SQL and Terraform for Data Engineering Zoomcamp 2026

Question 1.

Understanding Docker images

Run docker with the python:3.13 image. Use an entrypoint bash to interact with the container.

What’s the version of pip in the image?

  1. 25.3
  2. 24.3.1
  3. 24.2.1
  4. 23.3.1

Solution

You can run python:3.13 with docker and interactive terminal flag (-it)

docker run -it --rm -v $(pwd):/app python:3.13 pip -V

Result: pip 25.3 from /usr/local/lib/python3.13/site-packages/pip (python 3.13)

so the answer is 25.3

Answer

  1. 25.3

Question 2

Given the docker-compose.yaml, what hostname and port should pgAdmin use to connect to the Postgres database? (1 point)

  1. postgres:5433
  2. localhost:5432
  3. db:5433
  4. postgres:5432
  5. db:5432

Solution

Take a look at the docker-compose.yaml file:

...
image: postgres:18
    environment:
      POSTGRES_USER: "root"
      POSTGRES_PASSWORD: "root"
      POSTGRES_DB: "ny_taxi"
    volumes:
      - "ny_taxi_postgres_data:/var/lib/postgresql"
    ports:
      - "5432:5432"
...

You can see that the hostname is localhost and the port is 5432. So, the answer is 2. localhost:5432.

Answer

  1. localhost:5432

Question 3.

For the trips in November 2025, how many trips had a trip_distance of less than or equal to 1 mile? (1 point)

  1. 7,853
  2. 8,007
  3. 8,254
  4. 8,421

Solution

First, open Jupyter Notebook or Google Colab to run the code.

Use the following code to download the Parquet dataset file:

!wget https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2025-11.parquet

Then, read the .parquet file and display it as a DataFrame:

import pandas as pd
df = pd.read_parquet("green_tripdata_2025-11.parquet")
df

This is a preview of the data, it consists of 21 columns.

VendorIDlpep_pickup_datetimelpep_dropoff_datetimestore_and_fwd_flagRatecodeIDPULocationIDDOLocationIDpassenger_counttrip_distancefare_amountextramta_taxtip_amounttolls_amountehail_feeimprovement_surchargetotal_amountpayment_typetrip_typecongestion_surchargecbd_congestion_fee
22025-11-01 00:34:482025-11-01 00:41:39N1.074421.00.747.201.000.51.940.0NaN1.011.641.01.00.000.00
22025-11-01 00:18:522025-11-01 00:24:27N1.074422.00.957.201.000.50.000.0NaN1.09.702.01.00.000.00

Before we work with the data, it’s important to convert the column we’re using to the correct data type. In this case, we’ll convert lpep_pickup_datetime to a datetime column.

df["lpep_pickup_datetime"] = pd.to_datetime(df["lpep_pickup_datetime"])

To filter the date range, we can use the between() method with two parameters: the start date and the end date (see pandas.Series.between). Save the filtered result into a new variable called df_result.

df_result = df[df["lpep_pickup_datetime"].between("2025-11-01","2025-12-01")]

Because the question asks for trips with a trip_distance of less than or equal to 1 mile, we can filter like this:

df_result[df_result["trip_distance"] <= 1].reset_index()

This filters rows where trip_distance is less than or equal to 1.

We can count the total number of matching trips by using len():

len(df_result[df_result["trip_distance"] <= 1].reset_index())

Answer

8007 Trips

Question 4.

Which pickup day had the longest total trip distance? Only consider trips with a trip_distance less than 100 miles. (1 point)

  1. 2025-11-14
  2. 2025-11-20
  3. 2025-11-23
  4. 2025-11-25

Solution

First, open Jupyter Notebook or Google Colab to run the code.

Use the following code to download the Parquet dataset file:

!wget https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2025-11.parquet

Next, read the .parquet file and display it as a DataFrame:

import pandas as pd
df = pd.read_parquet("green_tripdata_2025-11.parquet")

There are several ways to complete this task, but I prefer to extract the month, day, and year from the pickup datetime column and store them in separate variables.

df["month"] = df["lpep_pickup_datetime"].dt.month
df["day"] = df["lpep_pickup_datetime"].dt.day
df["year"] = df["lpep_pickup_datetime"].dt.year

Let’s aggregate the data by month and day, summing the trip_distance column, so we get the total trip distance for each date.

df[df["trip_distance"] < 100].groupby(["month", "day"]).agg(total=("trip_distance", "sum")).sort_values("total", ascending=False)

Here is the resulting table of total trip distances (less than 100 miles), grouped by month and day:

MonthDayTotal Trip Distance
11206377.03
196031.56
185976.12
65973.34
255954.80
55841.01

The day with the longest total trip distance is 20 November 2025.

Answer

  1. 2025-11-20

Question 5.

Which was the pickup zone with the largest total_amount (sum of all trips) on November 18th, 2025? (1 point)

  1. East Harlem North
  2. East Harlem South
  3. Morningside Heights
  4. Forest Hills

Solution

We need one more file for this task. Please download it using the code below:

!wget https://github.com/DataTalksClub/nyc-tlc-data/releases/download/misc/taxi_zone_lookup.csv

Example of the data:

LocationIDBoroughZoneservice_zone
1EWRNewark AirportEWR
2QueensJamaica BayBoro Zone
3BronxAllerton/Pelham GardensBoro Zone

Let’s import both files:

import pandas as pd
df = pd.read_parquet("./green_tripdata_2025-11.parquet")
df_zone = pd.read_csv("./taxi_zone_lookup.csv")
df_zone = df_zone.dropna()

Next, merge the two tables using PULocationID and LocationID as the join keys:

df_joined = pd.merge(df,df_zone,left_on="PULocationID",right_on="LocationID")
df_joined

This will create one combined table (the main df and df_zone) and store it in the df_joined variable.

Now, let’s filter trips for November 18th, 2025, then group by zone and aggregate the sum of total_amount:

df_joined[
    (df_joined["day"] == 18) & (df_joined["months"] == 11) & (df_joined["year"] == 2025)
].groupby("Zone").agg(total=("total_amount", "sum")).sort_values("total", ascending=False)

Lastly, we can sort from the highest to the lowest total_amount.

Zonetotal
East Harlem North9281.92
East Harlem South6696.13
Central Park2378.79
Washington Heights South2139.05

We can see that East Harlem North is the pickup zone that have the highest total amount that user paying.

Answer

  1. East Harlem North

Question 7.

Which of the following sequences describes the Terraform workflow for:

  • Downloading plugins and setting up backend
  • Generating and executing changes
  • Removing all resources? (1 point)
  1. terraform import, terraform apply -y, terraform destroy
  2. teraform init, terraform plan -auto-apply, terraform rm
  3. terraform init, terraform run -auto-approve, terraform destroy
  4. terraform init, terraform apply -auto-approve, terraform destroy
  5. terraform import, terraform apply -y, terraform rm

Solution

Check the Terraform module in this GitHub repo: https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/01-docker-terraform/terraform/1_terraform_overview.md#execution-steps

From this guideline, we can see that:

  1. terraform init is used to initialize the backend, install plugins, and check the configuration.
  2. terraform apply is used to apply the plan in the cloud, with the -auto-approve flag.
  3. Lastly, we can use terraform destroy to remove all resources and stacks from the cloud.

So the answer is 4.

Answer

  1. terraform init, terraform apply -auto-approve, terraform destroy