Quick Facts
- Category: Data Science
- Published: 2026-05-04 02:55:31
- Fedora Linux 44 Launches with GNOME 50 and KDE Plasma 6.6 – Major Desktop Upgrades
- Bridging the Psychedelic Divide: A Guide to Equitable Access and Inclusion in Psychedelic Therapy
- 7 Amiability Lessons from the Vienna Circle for a More Welcoming Web
- How a Popular Open Source Package Was Hijacked to Steal User Credentials
- 10 Fascinating Facts About the Spiral Galaxy NGC 3137
Overview
Data pipelines have traditionally been the domain of software engineers wielding PySpark or Python scripts. However, a new stack — dlt (data load tool), dbt (data build tool), and Trino — allows analysts to build and maintain pipelines using nothing more than YAML configuration files. This guide walks you through replacing complex PySpark pipelines with four YAML files, cutting delivery time from weeks to a single day. By the end, you’ll understand how to set up a pipeline that extracts, loads, transforms, and queries data without writing a single line of Python or Spark code.

Prerequisites
Before diving in, ensure you have:
- Basic familiarity with SQL – dbt relies on SQL for transformations.
- Access to a data warehouse (e.g., Snowflake, BigQuery, Postgres) – Trino will serve as the query engine.
- Python 3.8+ installed (only for installing dlt and dbt; no coding required beyond setup).
- YAML editor – any text editor works.
- A source of data – API, database, or flat files you want to ingest.
This guide assumes you are comfortable running terminal commands and editing configuration files.
Step-by-Step Instructions
1. Setting Up the Tools
Install dlt and dbt using pip (or conda):
pip install dlt dbt-core trino
Verify installations:
dlt --version
dbt --version
trino --version
Create a project directory:
mkdir my_pipeline
cd my_pipeline
2. Configuring the Source – dlt YAML
dlt extracts data from sources and loads it into a destination. Create a file sources.yml:
# sources.yml
sources:
my_api:
type: rest_api
config:
base_url: "https://api.example.com/v1"
endpoint: /data
pagination: true
# Add authentication if needed
auth:
api_key: "${API_KEY}"
This YAML tells dlt to fetch data from an API endpoint with pagination. Replace the URL and API key with your own. dlt supports many source types (databases, cloud storage, etc.).
3. Loading Data – dlt Destination YAML
Create destinations.yml to specify where data goes:
# destinations.yml
destinations:
my_trino:
type: trino
config:
host: localhost
port: 8080
database: my_db
user: analyst
password: "${TRINO_PASSWORD}"
Now define a pipeline in pipeline.yml that links the source and destination:
# pipeline.yml
pipeline:
name: my_first_pipeline
source: my_api
destination: my_trino
tables:
- name: raw_data
primary_key: id
incremental: true
Run the pipeline with a single command:
dlt pipeline run pipeline.yml
Data is now loaded into Trino under the raw_data table.
4. Transforming with dbt
dbt allows analysts to write SQL models. Initialize a dbt project inside your directory:
dbt init my_dbt_project
Edit profiles.yml to point to your Trino instance:
# profiles.yml
my_dbt_project:
outputs:
dev:
type: trino
method: none
server: localhost:8080
database: my_db
schema: analytics
user: analyst
password: "${TRINO_PASSWORD}"
target: dev
Create a transformation model in models/ – for example, aggregated_data.sql:

-- models/aggregated_data.sql
SELECT
EXTRACT(YEAR FROM event_date) AS year,
EXTRACT(MONTH FROM event_date) AS month,
category,
SUM(revenue) AS total_revenue
FROM {{ source('raw_data', 'raw_data') }}
GROUP BY 1,2,3
Run dbt to apply transformations:
dbt run
This creates a table or view in Trino’s analytics schema.
5. Querying with Trino
Now you can query the transformed data using any SQL client connected to Trino. For example:
-- Query from Trino CLI or your BI tool
SELECT * FROM my_db.analytics.aggregated_data
WHERE total_revenue > 100000
ORDER BY year, month;
That’s it – a complete pipeline defined in just four YAML files (sources.yml, destinations.yml, pipeline.yml, and dbt’s profiles.yml plus one SQL model).
Common Mistakes
- Incorrect indentation in YAML – YAML is space-sensitive. Use 2 spaces per level, not tabs.
- Missing environment variables – Never hardcode secrets; use
${VAR}and export them before running. - Pagination not enabled – dlt defaults to single-page fetches. If your API returns many records, enable
pagination: trueor specify a cursor. - Database schema issues – Ensure the schema (
raw_data) exists in Trino before running the dlt pipeline. dlt may create it automatically, but not always. - Trino user permissions – The user must have write access to the destination schema and read access to any sources.
- dbt model referencing wrong source – Verify the source name in
source()matches the table from dlt. Usedbt docs generateto check lineage. - Ignoring incremental loading – Without
incremental: trueinpipeline.yml, dlt will overwrite the entire table daily.
Summary
By replacing PySpark with a stack of dlt, dbt, and Trino, organizations empower analysts to build and maintain data pipelines using YAML and SQL alone. The process reduces delivery time from weeks to one day, eliminates the need for dedicated engineering support, and keeps pipelines version-controlled and auditable. This guide demonstrated a complete end-to-end pipeline with four configuration files, covering extraction, loading, transformation, and querying. Start with a single use case, and scale from there.