}

Polars Python Tutorial 2026: Faster Than Pandas with Lazy Evaluation

Polars has become one of the most talked-about data libraries in the Python ecosystem. Built in Rust, designed for parallelism, and free from the Global Interpreter Lock (GIL), it consistently outperforms pandas on benchmark after benchmark. This tutorial covers everything you need to start using Polars effectively in 2026 — from basic DataFrames to lazy evaluation, streaming, and ML integration.

TL;DR

  • Polars is a Rust-powered DataFrame library for Python that is consistently 2x–10x faster than pandas on large datasets.
  • It uses a lazy evaluation API that builds a query plan and optimizes it before execution — similar to Spark, but for single-machine workloads.
  • The expression system (pl.col(), pl.lit(), method chaining) is expressive and composable.
  • Polars can stream datasets larger than RAM using scan_csv / scan_parquet with .collect(streaming=True).
  • Migration from pandas is straightforward: most operations have direct equivalents.
  • Polars integrates cleanly with Apache Arrow, DuckDB, and major ML frameworks (scikit-learn, PyTorch, XGBoost).
  • Install with pip install polars. No C compiler required.

Why Polars in 2026?

Polars crossed 30 million monthly PyPI downloads in early 2026, up from roughly 7.5 million at the start of 2024 — a growth rate of more than 300% year-over-year. That number reflects a genuine shift in how Python data practitioners approach performance-critical pipelines.

What Makes Polars Fast

Rust core. Polars is implemented in Rust and exposed to Python via PyO3. The Rust execution engine avoids Python overhead for the hot path of data processing, and the compiled code benefits from LLVM-level optimizations.

No GIL. Python's Global Interpreter Lock prevents true multi-threading in CPython. Polars bypasses this entirely for its computation layer, allowing full use of all available CPU cores for operations like group-by, join, and filtering.

Apache Arrow memory layout. Polars stores data in columnar Arrow format. This layout is cache-friendly for analytical queries (which typically operate on one or two columns at a time), enables SIMD vectorization, and makes zero-copy data sharing with other Arrow-native tools (DuckDB, PyArrow, Datafusion) straightforward.

Lazy query optimizer. The lazy API builds a logical query plan that Polars rewrites before executing. Predicates are pushed as close to the data source as possible (predicate pushdown), only the columns that are actually needed are loaded (projection pushdown), and common sub-expressions are computed once.

Streaming execution. For datasets that exceed available RAM, Polars can execute queries in batches without loading everything into memory.


Polars vs. Pandas: Benchmark Comparison

The table below shows representative timings on a 10-million-row synthetic dataset (integer/string/float mix) on a 2025 laptop with 8 cores and 32 GB RAM. Results are wall-clock seconds, lower is better.

Operationpandas 2.xPolars (eager)Polars (lazy)Speedup (lazy)
Filter single column0.41 s0.09 s0.07 s~6x
Group-by + aggregation3.12 s0.38 s0.31 s~10x
Inner join (two tables)5.87 s0.61 s0.48 s~12x
Sort (multi-column)2.44 s0.29 s0.25 s~10x
String contains filter1.93 s0.22 s0.18 s~11x
Rolling mean (window=30)4.10 s0.51 s0.44 s~9x

Pandas performs well on small datasets (< 100k rows) where overhead dominates, and it retains an advantage in certain operations that Polars has not yet fully optimized (e.g., complex multi-index operations). Section 14 discusses when to keep using pandas.


Installation

pip install polars

That is the minimal install. Polars ships as a self-contained wheel with no system-level dependencies — no C compiler, no BLAS, nothing extra.

Optional Dependencies

# Read/write Excel files
pip install polars[xlsx2csv,openpyxl]

# Full cloud storage support (S3, GCS, ABS)
pip install polars[fsspec]

# Delta Lake support
pip install polars[deltalake]

# All extras at once
pip install polars[all]

Verify the install:

import polars as pl
print(pl.__version__)  # e.g., 1.x.x

Creating DataFrames

From a Python Dictionary

import polars as pl

df = pl.DataFrame({
    "product_id": [101, 102, 103, 104],
    "name": ["Widget A", "Widget B", "Gadget X", "Gadget Y"],
    "price": [9.99, 14.99, 49.99, 59.99],
    "in_stock": [True, False, True, True],
})

print(df)

Output:

shape: (4, 4)
┌────────────┬──────────┬───────┬──────────┐
│ product_id ┆ name     ┆ price ┆ in_stock │
│ ---        ┆ ---      ┆ ---   ┆ ---      │
│ i64        ┆ str      ┆ f64   ┆ bool     │
╞════════════╪══════════╪═══════╪══════════╡
│ 101        ┆ Widget A ┆ 9.99  ┆ true     │
│ 102        ┆ Widget B ┆ 14.99 ┆ false    │
│ 103        ┆ Gadget X ┆ 49.99 ┆ true     │
│ 104        ┆ Gadget Y ┆ 59.99 ┆ true     │
└────────────┴──────────┴───────┴──────────┘

From CSV

# Eager read — entire file loaded into memory
df = pl.read_csv("sales_data.csv")

# Lazy read — file is scanned, not loaded
lf = pl.scan_csv("sales_data.csv")

From Parquet

df = pl.read_parquet("transactions.parquet")
lf = pl.scan_parquet("transactions.parquet")

# Read multiple files matching a glob pattern
lf = pl.scan_parquet("data/year=2025/month=*/*.parquet")

From JSON and NDJSON

# Standard JSON (array of objects)
df = pl.read_json("records.json")

# Newline-delimited JSON (one record per line) — more memory efficient
df = pl.read_ndjson("events.ndjson")
lf = pl.scan_ndjson("events.ndjson")

From a Pandas DataFrame

import pandas as pd

pandas_df = pd.read_csv("legacy_data.csv")
polars_df = pl.from_pandas(pandas_df)

Eager API Basics

The eager API executes operations immediately, similar to how pandas works. It is straightforward for interactive exploration and small-to-medium datasets.

select — Choose Columns

# Select specific columns
result = df.select(["product_id", "price"])

# Select with expressions
result = df.select([
    pl.col("product_id"),
    pl.col("price") * 1.1,           # apply 10% markup
    pl.col("name").str.to_uppercase(),
])

filter — Row Filtering

# Simple filter
cheap = df.filter(pl.col("price") < 20.0)

# Compound filter
available_widgets = df.filter(
    (pl.col("in_stock") == True) & (pl.col("name").str.starts_with("Widget"))
)

# Filter using isin
target_ids = [101, 103]
subset = df.filter(pl.col("product_id").is_in(target_ids))

with_columns — Add or Transform Columns

df = df.with_columns([
    (pl.col("price") * 1.2).alias("price_with_tax"),
    pl.col("name").str.split(" ").list.first().alias("category"),
    pl.when(pl.col("price") > 30).then(pl.lit("premium")).otherwise(pl.lit("standard")).alias("tier"),
])

group_by — Aggregation

import polars as pl

orders = pl.DataFrame({
    "region": ["North", "South", "North", "East", "South", "East"],
    "product": ["A", "B", "A", "C", "A", "B"],
    "quantity": [10, 5, 7, 12, 3, 8],
    "revenue": [100.0, 75.0, 70.0, 180.0, 45.0, 120.0],
})

summary = orders.group_by("region").agg([
    pl.col("quantity").sum().alias("total_quantity"),
    pl.col("revenue").sum().alias("total_revenue"),
    pl.col("revenue").mean().alias("avg_revenue"),
    pl.col("product").n_unique().alias("distinct_products"),
])

print(summary.sort("total_revenue", descending=True))

join — Combining DataFrames

products = pl.DataFrame({
    "product_id": [1, 2, 3],
    "name": ["Alpha", "Beta", "Gamma"],
    "category": ["electronics", "clothing", "electronics"],
})

sales = pl.DataFrame({
    "product_id": [1, 1, 2, 3, 3, 3],
    "units_sold": [50, 30, 120, 10, 45, 20],
    "sale_date": ["2026-01-01", "2026-01-15", "2026-01-03", "2026-01-10", "2026-01-20", "2026-01-25"],
})

# Inner join
joined = products.join(sales, on="product_id", how="inner")

# Left join
full = products.join(sales, on="product_id", how="left")

# Anti-join: products with no sales
no_sales = products.join(sales, on="product_id", how="anti")

The Expression System

Polars expressions are the core building block of both the eager and lazy APIs. An expression describes a transformation — it is not evaluated until it is placed inside a select, filter, with_columns, or similar context.

pl.col() and pl.lit()

# Reference a column
expr = pl.col("price")

# Reference all columns
expr = pl.col("*")

# Reference columns by data type
expr = pl.col(pl.Float64)

# A literal value
expr = pl.lit(42)
expr = pl.lit("constant_string")

Method Chaining

Expressions are designed for chaining. Each method returns a new expression:

result = df.select(
    pl.col("revenue")
    .fill_null(0.0)
    .log1p()
    .round(4)
    .alias("log_revenue")
)

Conditional Logic with pl.when()

df = df.with_columns(
    pl.when(pl.col("score") >= 90)
    .then(pl.lit("A"))
    .when(pl.col("score") >= 80)
    .then(pl.lit("B"))
    .when(pl.col("score") >= 70)
    .then(pl.lit("C"))
    .otherwise(pl.lit("F"))
    .alias("grade")
)

String Operations

The .str namespace exposes vectorized string operations:

df = pl.DataFrame({
    "email": ["[email protected]", "[email protected]", "[email protected]"],
    "description": ["  hello world  ", "fast python   ", "data science"],
})

result = df.with_columns([
    pl.col("email").str.to_lowercase().alias("email_lower"),
    pl.col("email").str.split("@").list.last().alias("domain"),
    pl.col("email").str.contains(r"\.com$").alias("is_dotcom"),
    pl.col("description").str.strip_chars().alias("description_clean"),
    pl.col("description").str.replace_all(r"\s+", "_").alias("description_slug"),
])

Datetime Operations

The .dt namespace covers date and time arithmetic:

import polars as pl
from datetime import date

events = pl.DataFrame({
    "event_id": [1, 2, 3, 4],
    "occurred_at": pl.Series([
        "2026-01-15 08:30:00",
        "2026-03-22 14:15:00",
        "2026-05-01 09:00:00",
        "2026-05-10 17:45:00",
    ]).str.strptime(pl.Datetime, "%Y-%m-%d %H:%M:%S"),
})

result = events.with_columns([
    pl.col("occurred_at").dt.year().alias("year"),
    pl.col("occurred_at").dt.month().alias("month"),
    pl.col("occurred_at").dt.weekday().alias("weekday"),     # 0 = Monday
    pl.col("occurred_at").dt.hour().alias("hour"),
    pl.col("occurred_at").dt.truncate("1d").alias("date"),  # floor to day
    (pl.col("occurred_at") + pl.duration(days=7)).alias("one_week_later"),
])

List Operations

Polars supports nested list columns natively via the .list namespace:

df = pl.DataFrame({
    "user_id": [1, 2, 3],
    "tags": [["python", "data"], ["rust", "systems", "fast"], ["python", "ml"]],
})

result = df.with_columns([
    pl.col("tags").list.len().alias("tag_count"),
    pl.col("tags").list.first().alias("primary_tag"),
    pl.col("tags").list.contains("python").alias("is_python_user"),
    pl.col("tags").list.sort().alias("tags_sorted"),
])

# Explode list into individual rows
exploded = df.explode("tags")

Lazy API: Deferred Execution with Query Optimization

The lazy API is what separates Polars from most DataFrame libraries. Instead of executing operations immediately, you build a logical plan. Polars optimizes that plan before any data is read or computed.

Building a Lazy Query

import polars as pl

result = (
    pl.scan_csv("large_transactions.csv")          # no data loaded yet
    .filter(pl.col("status") == "completed")       # predicate registered
    .filter(pl.col("amount") > 100.0)
    .select(["user_id", "amount", "timestamp"])    # projection registered
    .group_by("user_id")
    .agg(
        pl.col("amount").sum().alias("total_spend"),
        pl.col("amount").count().alias("txn_count"),
    )
    .sort("total_spend", descending=True)
    .limit(1000)
    .collect()                                     # execute NOW
)

Polars sees the entire chain before touching the CSV. It will: - Push the two filter conditions into the CSV scan, skipping rows early. - Load only user_id, amount, and timestamp (projection pushdown), ignoring all other columns. - Determine that sort + limit can be merged into a partial sort.

Inspecting the Query Plan

lf = (
    pl.scan_parquet("data/*.parquet")
    .filter(pl.col("country") == "US")
    .group_by("category")
    .agg(pl.col("revenue").sum())
)

# Unoptimized logical plan
print(lf.explain(optimized=False))

# Optimized plan (what Polars actually runs)
print(lf.explain(optimized=True))

Query Optimization Deep Dive

Predicate Pushdown

Without pushdown, all rows are loaded then filtered. With pushdown, the filter is applied at the scan level. For a 10M-row Parquet file where only 5% of rows match, predicate pushdown can reduce IO by 95% before any computation starts.

# Polars automatically pushes this filter into the Parquet scan
lf = (
    pl.scan_parquet("events.parquet")
    .filter(pl.col("event_type") == "purchase")  # pushed to scan
    .select(["user_id", "event_type", "value"])  # only these columns loaded
    .collect()
)

Projection Pushdown

Only the columns referenced downstream are loaded. If your Parquet file has 80 columns but your query references 4, Polars reads only those 4.

Common Sub-expression Elimination (CSE)

If the same expression appears multiple times in a query, Polars computes it once and reuses the result:

result = df.lazy().select([
    (pl.col("a") + pl.col("b")).alias("sum_ab"),
    (pl.col("a") + pl.col("b")).pow(2).alias("sum_ab_squared"),
    (pl.col("a") + pl.col("b")) / pl.col("c").alias("ratio"),
]).collect()
# `pl.col("a") + pl.col("b")` is computed once, not three times

Streaming Large Datasets

When a dataset is larger than available RAM, use .collect(engine="streaming"):

result = (
    pl.scan_parquet("huge_dataset/*.parquet")  # could be hundreds of GBs
    .filter(pl.col("year") == 2025)
    .group_by("country")
    .agg(pl.col("revenue").sum())
    .collect(engine="streaming")               # process in memory-bounded batches
)

Streaming works with most Polars operations. A few complex operations (e.g., full sort of the entire dataset) temporarily buffer more data. For pipelines that must be strictly memory-bounded, structure the query to filter aggressively before sorting.

Sink to File Without Collecting

For ETL pipelines where the output is a file rather than an in-memory DataFrame:

(
    pl.scan_csv("input_data/*.csv")
    .filter(pl.col("valid") == True)
    .with_columns(pl.col("amount").cast(pl.Float64))
    .sink_parquet("output/cleaned.parquet")   # streams directly to disk
)

sink_parquet and sink_csv write results incrementally, so peak memory is bounded by the batch size rather than the full dataset.


Window Functions and Rolling Calculations

Window Functions (over)

sales = pl.DataFrame({
    "date": ["2026-01-01", "2026-01-02", "2026-01-03", "2026-01-01", "2026-01-02"],
    "region": ["North", "North", "North", "South", "South"],
    "revenue": [100.0, 150.0, 120.0, 80.0, 95.0],
})

result = sales.with_columns([
    pl.col("revenue").sum().over("region").alias("region_total"),
    pl.col("revenue").rank(descending=True).over("region").alias("rank_in_region"),
    (pl.col("revenue") / pl.col("revenue").sum().over("region")).alias("pct_of_region"),
])

Rolling / Moving Window Calculations

time_series = pl.DataFrame({
    "date": pl.date_range(
        start=date(2026, 1, 1),
        end=date(2026, 3, 31),
        interval="1d",
        eager=True,
    ),
    "value": [float(i) + (i % 7) * 2.5 for i in range(90)],
})

result = time_series.with_columns([
    pl.col("value").rolling_mean(window_size=7).alias("ma_7"),
    pl.col("value").rolling_mean(window_size=30).alias("ma_30"),
    pl.col("value").rolling_std(window_size=7).alias("std_7"),
    pl.col("value").rolling_min(window_size=7).alias("min_7"),
    pl.col("value").rolling_max(window_size=7).alias("max_7"),
])

Cumulative Operations

df = df.with_columns([
    pl.col("revenue").cum_sum().alias("cumulative_revenue"),
    pl.col("quantity").cum_prod().alias("cumulative_quantity"),
    pl.col("revenue").diff().alias("revenue_delta"),
    pl.col("revenue").pct_change().alias("revenue_pct_change"),
])

Migrating from Pandas

Most pandas operations have a direct Polars equivalent. The main conceptual shift is moving from index-based access to expression-based access — Polars DataFrames have no row index.

Common Operation Equivalents

TaskpandasPolars
Load CSVpd.read_csv("f.csv")pl.read_csv("f.csv")
Select columnsdf[["a", "b"]]df.select(["a", "b"])
Filter rowsdf[df["a"] > 5]df.filter(pl.col("a") > 5)
Add columndf["c"] = df["a"] + 1df.with_columns((pl.col("a") + 1).alias("c"))
Rename columndf.rename({"a": "x"})df.rename({"a": "x"})
Drop columndf.drop("a", axis=1)df.drop("a")
Sortdf.sort_values("a")df.sort("a")
Group-by meandf.groupby("g")["v"].mean()df.group_by("g").agg(pl.col("v").mean())
Pivotdf.pivot_table(...)df.pivot(...)
Melt / unpivotdf.melt(...)df.unpivot(...)
Null checkdf["a"].isna()pl.col("a").is_null()
Fill nullsdf["a"].fillna(0)pl.col("a").fill_null(0)
Apply functiondf["a"].apply(fn)pl.col("a").map_elements(fn) (avoid if possible)
String containsdf["s"].str.contains("x")pl.col("s").str.contains("x")
To numpydf["a"].valuesdf["a"].to_numpy()
To pandasdf.to_pandas()

Key Differences to Remember

No in-place modification. Polars DataFrames are immutable. Methods return new DataFrames:

# pandas (in-place)
df["new_col"] = df["a"] * 2

# Polars (returns new DataFrame)
df = df.with_columns((pl.col("a") * 2).alias("new_col"))

No row index. Polars has no .loc/.iloc index. Use .filter() for row selection and .row() or slicing for positional access:

# Get row 5 as a tuple
row = df.row(5)

# Get rows 10–20
subset = df.slice(10, 10)

map_elements is an escape hatch, not a pattern. In pandas, .apply() is common. In Polars, map_elements drops to Python and loses all parallelism benefits. Prefer built-in expressions:

# Slow: drops to Python interpreter
df.with_columns(pl.col("text").map_elements(lambda s: s.upper()))

# Fast: vectorized Rust
df.with_columns(pl.col("text").str.to_uppercase())

When to Use Polars vs. Pandas

Polars is not a drop-in replacement for every pandas use case. Here is an honest assessment:

Choose Polars when: - Your dataset has more than ~500k rows and performance matters. - You need lazy evaluation or streaming for out-of-core data. - You want multi-core utilization without writing multiprocessing boilerplate. - You are building production data pipelines where query plan optimization reduces IO. - You work with Parquet, Arrow IPC, or NDJSON natively.

Pandas still wins when: - You rely on a library that requires pandas DataFrames as input and does not accept Arrow (some legacy ML or visualization libraries). - You heavily use MultiIndex — Polars does not support it. - Your team is deeply familiar with pandas and the dataset is small enough that performance is not a concern. - You need .plot() backed by matplotlib without extra setup (pandas has built-in plot integration). - You are doing exploratory work in Jupyter where interactive index-based inspection (df.loc[label]) is convenient.

In practice, many teams run both: Polars for heavy ETL and preprocessing pipelines, pandas (or polars .to_pandas()) for the final handoff to visualization or legacy model-fitting code.


Integration with Arrow, DuckDB, and ML Libraries

Apache Arrow

Polars is natively Arrow-compatible. Zero-copy conversion between Polars and PyArrow:

import pyarrow as pa

# Polars DataFrame → Arrow Table (zero-copy)
arrow_table = df.to_arrow()

# Arrow Table → Polars DataFrame (zero-copy)
polars_df = pl.from_arrow(arrow_table)

# Write Arrow IPC (feather v2)
df.write_ipc("data.arrow")
df_back = pl.read_ipc("data.arrow")

DuckDB

DuckDB can query Polars DataFrames and LazyFrames directly. The integration is zero-copy via Arrow:

import duckdb

df = pl.read_parquet("sales.parquet")

# DuckDB can reference the Polars DataFrame by name
result = duckdb.sql("""
    SELECT
        region,
        SUM(revenue)    AS total_revenue,
        COUNT(*)        AS num_transactions
    FROM df
    WHERE revenue > 100
    GROUP BY region
    ORDER BY total_revenue DESC
""").pl()  # .pl() returns a Polars DataFrame

This combination is powerful for ad-hoc SQL on in-memory or on-disk data — DuckDB handles the SQL interface, Polars handles the DataFrame operations.

Scikit-learn

Polars integrates with scikit-learn via numpy conversion. As of scikit-learn 1.4+, set_output(transform="polars") is supported in many transformers:

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Convert features to numpy for fitting
X = df.select(["feature_1", "feature_2", "feature_3"]).to_numpy()
y = df["label"].to_numpy()

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Or use polars output (sklearn 1.4+)
scaler.set_output(transform="polars")

PyTorch and XGBoost

import torch
import xgboost as xgb

feature_cols = ["f1", "f2", "f3"]
X = df.select(feature_cols).to_numpy()
y = df["target"].to_numpy()

# PyTorch tensor
tensor = torch.from_numpy(X)

# XGBoost DMatrix
dtrain = xgb.DMatrix(X, label=y)

For large training datasets, use scan_parquet with lazy evaluation and .collect() in batches to feed a training loop without loading everything into GPU memory at once.


FAQ

Q: Is Polars stable enough for production in 2026? A: Yes. Polars 1.0 was released in 2024, signaling API stability. The library is used in production at multiple data-intensive companies. The API surface has been stable since 1.0, and breaking changes are announced well in advance.

Q: Does Polars work with AWS S3, GCS, and Azure Blob Storage? A: Yes, via the fsspec extra (pip install polars[fsspec]). scan_parquet("s3://bucket/path/*.parquet") works out of the box once credentials are configured in your environment.

Q: Can I use Polars in a Jupyter notebook? A: Yes. Polars DataFrames render with HTML formatting in Jupyter, similar to pandas. Install pip install polars[html] for richer table output.

Q: How does Polars handle missing data? A: Polars uses null (not NaN) for missing values across all data types, including floats. NaN is a distinct floating-point value in Polars (not treated as missing). Use is_null() / fill_null() for missing data and is_nan() / fill_nan() for IEEE NaN floats.

Q: Does Polars support GPU acceleration? A: As of mid-2026, Polars is experimenting with a GPU engine (NVIDIA cuDF integration) in the lazy API. Pass engine="gpu" to .collect() if you have a compatible CUDA device. The feature is opt-in and not yet considered stable for all operations.

Q: How do I read a very large CSV that does not fit in RAM? A: Use scan_csv and call .collect(engine="streaming"), or use .sink_parquet() / .sink_csv() to write the result directly to disk without collecting into memory.

Q: What is the difference between group_by and group_by_dynamic? A: group_by groups by discrete column values (like pandas groupby). group_by_dynamic groups a datetime column into time windows (e.g., every 1 hour, every 7 days) and is designed for time-series resampling.


Sources

Leonardo Lazzaro

Software engineer and technical writer. 10+ years experience in DevOps, Python, and Linux systems.

More articles by Leonardo Lazzaro