Polars has become one of the most talked-about data libraries in the Python ecosystem. Built in Rust, designed for parallelism, and free from the Global Interpreter Lock (GIL), it consistently outperforms pandas on benchmark after benchmark. This tutorial covers everything you need to start using Polars effectively in 2026 — from basic DataFrames to lazy evaluation, streaming, and ML integration.
TL;DR
- Polars is a Rust-powered DataFrame library for Python that is consistently 2x–10x faster than pandas on large datasets.
- It uses a lazy evaluation API that builds a query plan and optimizes it before execution — similar to Spark, but for single-machine workloads.
- The expression system (
pl.col(),pl.lit(), method chaining) is expressive and composable. - Polars can stream datasets larger than RAM using
scan_csv/scan_parquetwith.collect(streaming=True). - Migration from pandas is straightforward: most operations have direct equivalents.
- Polars integrates cleanly with Apache Arrow, DuckDB, and major ML frameworks (scikit-learn, PyTorch, XGBoost).
- Install with
pip install polars. No C compiler required.
Why Polars in 2026?
Polars crossed 30 million monthly PyPI downloads in early 2026, up from roughly 7.5 million at the start of 2024 — a growth rate of more than 300% year-over-year. That number reflects a genuine shift in how Python data practitioners approach performance-critical pipelines.
What Makes Polars Fast
Rust core. Polars is implemented in Rust and exposed to Python via PyO3. The Rust execution engine avoids Python overhead for the hot path of data processing, and the compiled code benefits from LLVM-level optimizations.
No GIL. Python's Global Interpreter Lock prevents true multi-threading in CPython. Polars bypasses this entirely for its computation layer, allowing full use of all available CPU cores for operations like group-by, join, and filtering.
Apache Arrow memory layout. Polars stores data in columnar Arrow format. This layout is cache-friendly for analytical queries (which typically operate on one or two columns at a time), enables SIMD vectorization, and makes zero-copy data sharing with other Arrow-native tools (DuckDB, PyArrow, Datafusion) straightforward.
Lazy query optimizer. The lazy API builds a logical query plan that Polars rewrites before executing. Predicates are pushed as close to the data source as possible (predicate pushdown), only the columns that are actually needed are loaded (projection pushdown), and common sub-expressions are computed once.
Streaming execution. For datasets that exceed available RAM, Polars can execute queries in batches without loading everything into memory.
Polars vs. Pandas: Benchmark Comparison
The table below shows representative timings on a 10-million-row synthetic dataset (integer/string/float mix) on a 2025 laptop with 8 cores and 32 GB RAM. Results are wall-clock seconds, lower is better.
| Operation | pandas 2.x | Polars (eager) | Polars (lazy) | Speedup (lazy) |
|---|---|---|---|---|
| Filter single column | 0.41 s | 0.09 s | 0.07 s | ~6x |
| Group-by + aggregation | 3.12 s | 0.38 s | 0.31 s | ~10x |
| Inner join (two tables) | 5.87 s | 0.61 s | 0.48 s | ~12x |
| Sort (multi-column) | 2.44 s | 0.29 s | 0.25 s | ~10x |
| String contains filter | 1.93 s | 0.22 s | 0.18 s | ~11x |
| Rolling mean (window=30) | 4.10 s | 0.51 s | 0.44 s | ~9x |
Pandas performs well on small datasets (< 100k rows) where overhead dominates, and it retains an advantage in certain operations that Polars has not yet fully optimized (e.g., complex multi-index operations). Section 14 discusses when to keep using pandas.
Installation
pip install polars
That is the minimal install. Polars ships as a self-contained wheel with no system-level dependencies — no C compiler, no BLAS, nothing extra.
Optional Dependencies
# Read/write Excel files
pip install polars[xlsx2csv,openpyxl]
# Full cloud storage support (S3, GCS, ABS)
pip install polars[fsspec]
# Delta Lake support
pip install polars[deltalake]
# All extras at once
pip install polars[all]
Verify the install:
import polars as pl
print(pl.__version__) # e.g., 1.x.x
Creating DataFrames
From a Python Dictionary
import polars as pl
df = pl.DataFrame({
"product_id": [101, 102, 103, 104],
"name": ["Widget A", "Widget B", "Gadget X", "Gadget Y"],
"price": [9.99, 14.99, 49.99, 59.99],
"in_stock": [True, False, True, True],
})
print(df)
Output:
shape: (4, 4)
┌────────────┬──────────┬───────┬──────────┐
│ product_id ┆ name ┆ price ┆ in_stock │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ f64 ┆ bool │
╞════════════╪══════════╪═══════╪══════════╡
│ 101 ┆ Widget A ┆ 9.99 ┆ true │
│ 102 ┆ Widget B ┆ 14.99 ┆ false │
│ 103 ┆ Gadget X ┆ 49.99 ┆ true │
│ 104 ┆ Gadget Y ┆ 59.99 ┆ true │
└────────────┴──────────┴───────┴──────────┘
From CSV
# Eager read — entire file loaded into memory
df = pl.read_csv("sales_data.csv")
# Lazy read — file is scanned, not loaded
lf = pl.scan_csv("sales_data.csv")
From Parquet
df = pl.read_parquet("transactions.parquet")
lf = pl.scan_parquet("transactions.parquet")
# Read multiple files matching a glob pattern
lf = pl.scan_parquet("data/year=2025/month=*/*.parquet")
From JSON and NDJSON
# Standard JSON (array of objects)
df = pl.read_json("records.json")
# Newline-delimited JSON (one record per line) — more memory efficient
df = pl.read_ndjson("events.ndjson")
lf = pl.scan_ndjson("events.ndjson")
From a Pandas DataFrame
import pandas as pd
pandas_df = pd.read_csv("legacy_data.csv")
polars_df = pl.from_pandas(pandas_df)
Eager API Basics
The eager API executes operations immediately, similar to how pandas works. It is straightforward for interactive exploration and small-to-medium datasets.
select — Choose Columns
# Select specific columns
result = df.select(["product_id", "price"])
# Select with expressions
result = df.select([
pl.col("product_id"),
pl.col("price") * 1.1, # apply 10% markup
pl.col("name").str.to_uppercase(),
])
filter — Row Filtering
# Simple filter
cheap = df.filter(pl.col("price") < 20.0)
# Compound filter
available_widgets = df.filter(
(pl.col("in_stock") == True) & (pl.col("name").str.starts_with("Widget"))
)
# Filter using isin
target_ids = [101, 103]
subset = df.filter(pl.col("product_id").is_in(target_ids))
with_columns — Add or Transform Columns
df = df.with_columns([
(pl.col("price") * 1.2).alias("price_with_tax"),
pl.col("name").str.split(" ").list.first().alias("category"),
pl.when(pl.col("price") > 30).then(pl.lit("premium")).otherwise(pl.lit("standard")).alias("tier"),
])
group_by — Aggregation
import polars as pl
orders = pl.DataFrame({
"region": ["North", "South", "North", "East", "South", "East"],
"product": ["A", "B", "A", "C", "A", "B"],
"quantity": [10, 5, 7, 12, 3, 8],
"revenue": [100.0, 75.0, 70.0, 180.0, 45.0, 120.0],
})
summary = orders.group_by("region").agg([
pl.col("quantity").sum().alias("total_quantity"),
pl.col("revenue").sum().alias("total_revenue"),
pl.col("revenue").mean().alias("avg_revenue"),
pl.col("product").n_unique().alias("distinct_products"),
])
print(summary.sort("total_revenue", descending=True))
join — Combining DataFrames
products = pl.DataFrame({
"product_id": [1, 2, 3],
"name": ["Alpha", "Beta", "Gamma"],
"category": ["electronics", "clothing", "electronics"],
})
sales = pl.DataFrame({
"product_id": [1, 1, 2, 3, 3, 3],
"units_sold": [50, 30, 120, 10, 45, 20],
"sale_date": ["2026-01-01", "2026-01-15", "2026-01-03", "2026-01-10", "2026-01-20", "2026-01-25"],
})
# Inner join
joined = products.join(sales, on="product_id", how="inner")
# Left join
full = products.join(sales, on="product_id", how="left")
# Anti-join: products with no sales
no_sales = products.join(sales, on="product_id", how="anti")
The Expression System
Polars expressions are the core building block of both the eager and lazy APIs. An expression describes a transformation — it is not evaluated until it is placed inside a select, filter, with_columns, or similar context.
pl.col() and pl.lit()
# Reference a column
expr = pl.col("price")
# Reference all columns
expr = pl.col("*")
# Reference columns by data type
expr = pl.col(pl.Float64)
# A literal value
expr = pl.lit(42)
expr = pl.lit("constant_string")
Method Chaining
Expressions are designed for chaining. Each method returns a new expression:
result = df.select(
pl.col("revenue")
.fill_null(0.0)
.log1p()
.round(4)
.alias("log_revenue")
)
Conditional Logic with pl.when()
df = df.with_columns(
pl.when(pl.col("score") >= 90)
.then(pl.lit("A"))
.when(pl.col("score") >= 80)
.then(pl.lit("B"))
.when(pl.col("score") >= 70)
.then(pl.lit("C"))
.otherwise(pl.lit("F"))
.alias("grade")
)
String Operations
The .str namespace exposes vectorized string operations:
df = pl.DataFrame({
"email": ["[email protected]", "[email protected]", "[email protected]"],
"description": [" hello world ", "fast python ", "data science"],
})
result = df.with_columns([
pl.col("email").str.to_lowercase().alias("email_lower"),
pl.col("email").str.split("@").list.last().alias("domain"),
pl.col("email").str.contains(r"\.com$").alias("is_dotcom"),
pl.col("description").str.strip_chars().alias("description_clean"),
pl.col("description").str.replace_all(r"\s+", "_").alias("description_slug"),
])
Datetime Operations
The .dt namespace covers date and time arithmetic:
import polars as pl
from datetime import date
events = pl.DataFrame({
"event_id": [1, 2, 3, 4],
"occurred_at": pl.Series([
"2026-01-15 08:30:00",
"2026-03-22 14:15:00",
"2026-05-01 09:00:00",
"2026-05-10 17:45:00",
]).str.strptime(pl.Datetime, "%Y-%m-%d %H:%M:%S"),
})
result = events.with_columns([
pl.col("occurred_at").dt.year().alias("year"),
pl.col("occurred_at").dt.month().alias("month"),
pl.col("occurred_at").dt.weekday().alias("weekday"), # 0 = Monday
pl.col("occurred_at").dt.hour().alias("hour"),
pl.col("occurred_at").dt.truncate("1d").alias("date"), # floor to day
(pl.col("occurred_at") + pl.duration(days=7)).alias("one_week_later"),
])
List Operations
Polars supports nested list columns natively via the .list namespace:
df = pl.DataFrame({
"user_id": [1, 2, 3],
"tags": [["python", "data"], ["rust", "systems", "fast"], ["python", "ml"]],
})
result = df.with_columns([
pl.col("tags").list.len().alias("tag_count"),
pl.col("tags").list.first().alias("primary_tag"),
pl.col("tags").list.contains("python").alias("is_python_user"),
pl.col("tags").list.sort().alias("tags_sorted"),
])
# Explode list into individual rows
exploded = df.explode("tags")
Lazy API: Deferred Execution with Query Optimization
The lazy API is what separates Polars from most DataFrame libraries. Instead of executing operations immediately, you build a logical plan. Polars optimizes that plan before any data is read or computed.
Building a Lazy Query
import polars as pl
result = (
pl.scan_csv("large_transactions.csv") # no data loaded yet
.filter(pl.col("status") == "completed") # predicate registered
.filter(pl.col("amount") > 100.0)
.select(["user_id", "amount", "timestamp"]) # projection registered
.group_by("user_id")
.agg(
pl.col("amount").sum().alias("total_spend"),
pl.col("amount").count().alias("txn_count"),
)
.sort("total_spend", descending=True)
.limit(1000)
.collect() # execute NOW
)
Polars sees the entire chain before touching the CSV. It will: - Push the two filter conditions into the CSV scan, skipping rows early. - Load only user_id, amount, and timestamp (projection pushdown), ignoring all other columns. - Determine that sort + limit can be merged into a partial sort.
Inspecting the Query Plan
lf = (
pl.scan_parquet("data/*.parquet")
.filter(pl.col("country") == "US")
.group_by("category")
.agg(pl.col("revenue").sum())
)
# Unoptimized logical plan
print(lf.explain(optimized=False))
# Optimized plan (what Polars actually runs)
print(lf.explain(optimized=True))
Query Optimization Deep Dive
Predicate Pushdown
Without pushdown, all rows are loaded then filtered. With pushdown, the filter is applied at the scan level. For a 10M-row Parquet file where only 5% of rows match, predicate pushdown can reduce IO by 95% before any computation starts.
# Polars automatically pushes this filter into the Parquet scan
lf = (
pl.scan_parquet("events.parquet")
.filter(pl.col("event_type") == "purchase") # pushed to scan
.select(["user_id", "event_type", "value"]) # only these columns loaded
.collect()
)
Projection Pushdown
Only the columns referenced downstream are loaded. If your Parquet file has 80 columns but your query references 4, Polars reads only those 4.
Common Sub-expression Elimination (CSE)
If the same expression appears multiple times in a query, Polars computes it once and reuses the result:
result = df.lazy().select([
(pl.col("a") + pl.col("b")).alias("sum_ab"),
(pl.col("a") + pl.col("b")).pow(2).alias("sum_ab_squared"),
(pl.col("a") + pl.col("b")) / pl.col("c").alias("ratio"),
]).collect()
# `pl.col("a") + pl.col("b")` is computed once, not three times
Streaming Large Datasets
When a dataset is larger than available RAM, use .collect(engine="streaming"):
result = (
pl.scan_parquet("huge_dataset/*.parquet") # could be hundreds of GBs
.filter(pl.col("year") == 2025)
.group_by("country")
.agg(pl.col("revenue").sum())
.collect(engine="streaming") # process in memory-bounded batches
)
Streaming works with most Polars operations. A few complex operations (e.g., full sort of the entire dataset) temporarily buffer more data. For pipelines that must be strictly memory-bounded, structure the query to filter aggressively before sorting.
Sink to File Without Collecting
For ETL pipelines where the output is a file rather than an in-memory DataFrame:
(
pl.scan_csv("input_data/*.csv")
.filter(pl.col("valid") == True)
.with_columns(pl.col("amount").cast(pl.Float64))
.sink_parquet("output/cleaned.parquet") # streams directly to disk
)
sink_parquet and sink_csv write results incrementally, so peak memory is bounded by the batch size rather than the full dataset.
Window Functions and Rolling Calculations
Window Functions (over)
sales = pl.DataFrame({
"date": ["2026-01-01", "2026-01-02", "2026-01-03", "2026-01-01", "2026-01-02"],
"region": ["North", "North", "North", "South", "South"],
"revenue": [100.0, 150.0, 120.0, 80.0, 95.0],
})
result = sales.with_columns([
pl.col("revenue").sum().over("region").alias("region_total"),
pl.col("revenue").rank(descending=True).over("region").alias("rank_in_region"),
(pl.col("revenue") / pl.col("revenue").sum().over("region")).alias("pct_of_region"),
])
Rolling / Moving Window Calculations
time_series = pl.DataFrame({
"date": pl.date_range(
start=date(2026, 1, 1),
end=date(2026, 3, 31),
interval="1d",
eager=True,
),
"value": [float(i) + (i % 7) * 2.5 for i in range(90)],
})
result = time_series.with_columns([
pl.col("value").rolling_mean(window_size=7).alias("ma_7"),
pl.col("value").rolling_mean(window_size=30).alias("ma_30"),
pl.col("value").rolling_std(window_size=7).alias("std_7"),
pl.col("value").rolling_min(window_size=7).alias("min_7"),
pl.col("value").rolling_max(window_size=7).alias("max_7"),
])
Cumulative Operations
df = df.with_columns([
pl.col("revenue").cum_sum().alias("cumulative_revenue"),
pl.col("quantity").cum_prod().alias("cumulative_quantity"),
pl.col("revenue").diff().alias("revenue_delta"),
pl.col("revenue").pct_change().alias("revenue_pct_change"),
])
Migrating from Pandas
Most pandas operations have a direct Polars equivalent. The main conceptual shift is moving from index-based access to expression-based access — Polars DataFrames have no row index.
Common Operation Equivalents
| Task | pandas | Polars |
|---|---|---|
| Load CSV | pd.read_csv("f.csv") | pl.read_csv("f.csv") |
| Select columns | df[["a", "b"]] | df.select(["a", "b"]) |
| Filter rows | df[df["a"] > 5] | df.filter(pl.col("a") > 5) |
| Add column | df["c"] = df["a"] + 1 | df.with_columns((pl.col("a") + 1).alias("c")) |
| Rename column | df.rename({"a": "x"}) | df.rename({"a": "x"}) |
| Drop column | df.drop("a", axis=1) | df.drop("a") |
| Sort | df.sort_values("a") | df.sort("a") |
| Group-by mean | df.groupby("g")["v"].mean() | df.group_by("g").agg(pl.col("v").mean()) |
| Pivot | df.pivot_table(...) | df.pivot(...) |
| Melt / unpivot | df.melt(...) | df.unpivot(...) |
| Null check | df["a"].isna() | pl.col("a").is_null() |
| Fill nulls | df["a"].fillna(0) | pl.col("a").fill_null(0) |
| Apply function | df["a"].apply(fn) | pl.col("a").map_elements(fn) (avoid if possible) |
| String contains | df["s"].str.contains("x") | pl.col("s").str.contains("x") |
| To numpy | df["a"].values | df["a"].to_numpy() |
| To pandas | — | df.to_pandas() |
Key Differences to Remember
No in-place modification. Polars DataFrames are immutable. Methods return new DataFrames:
# pandas (in-place)
df["new_col"] = df["a"] * 2
# Polars (returns new DataFrame)
df = df.with_columns((pl.col("a") * 2).alias("new_col"))
No row index. Polars has no .loc/.iloc index. Use .filter() for row selection and .row() or slicing for positional access:
# Get row 5 as a tuple
row = df.row(5)
# Get rows 10–20
subset = df.slice(10, 10)
map_elements is an escape hatch, not a pattern. In pandas, .apply() is common. In Polars, map_elements drops to Python and loses all parallelism benefits. Prefer built-in expressions:
# Slow: drops to Python interpreter
df.with_columns(pl.col("text").map_elements(lambda s: s.upper()))
# Fast: vectorized Rust
df.with_columns(pl.col("text").str.to_uppercase())
When to Use Polars vs. Pandas
Polars is not a drop-in replacement for every pandas use case. Here is an honest assessment:
Choose Polars when: - Your dataset has more than ~500k rows and performance matters. - You need lazy evaluation or streaming for out-of-core data. - You want multi-core utilization without writing multiprocessing boilerplate. - You are building production data pipelines where query plan optimization reduces IO. - You work with Parquet, Arrow IPC, or NDJSON natively.
Pandas still wins when: - You rely on a library that requires pandas DataFrames as input and does not accept Arrow (some legacy ML or visualization libraries). - You heavily use MultiIndex — Polars does not support it. - Your team is deeply familiar with pandas and the dataset is small enough that performance is not a concern. - You need .plot() backed by matplotlib without extra setup (pandas has built-in plot integration). - You are doing exploratory work in Jupyter where interactive index-based inspection (df.loc[label]) is convenient.
In practice, many teams run both: Polars for heavy ETL and preprocessing pipelines, pandas (or polars .to_pandas()) for the final handoff to visualization or legacy model-fitting code.
Integration with Arrow, DuckDB, and ML Libraries
Apache Arrow
Polars is natively Arrow-compatible. Zero-copy conversion between Polars and PyArrow:
import pyarrow as pa
# Polars DataFrame → Arrow Table (zero-copy)
arrow_table = df.to_arrow()
# Arrow Table → Polars DataFrame (zero-copy)
polars_df = pl.from_arrow(arrow_table)
# Write Arrow IPC (feather v2)
df.write_ipc("data.arrow")
df_back = pl.read_ipc("data.arrow")
DuckDB
DuckDB can query Polars DataFrames and LazyFrames directly. The integration is zero-copy via Arrow:
import duckdb
df = pl.read_parquet("sales.parquet")
# DuckDB can reference the Polars DataFrame by name
result = duckdb.sql("""
SELECT
region,
SUM(revenue) AS total_revenue,
COUNT(*) AS num_transactions
FROM df
WHERE revenue > 100
GROUP BY region
ORDER BY total_revenue DESC
""").pl() # .pl() returns a Polars DataFrame
This combination is powerful for ad-hoc SQL on in-memory or on-disk data — DuckDB handles the SQL interface, Polars handles the DataFrame operations.
Scikit-learn
Polars integrates with scikit-learn via numpy conversion. As of scikit-learn 1.4+, set_output(transform="polars") is supported in many transformers:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
# Convert features to numpy for fitting
X = df.select(["feature_1", "feature_2", "feature_3"]).to_numpy()
y = df["label"].to_numpy()
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Or use polars output (sklearn 1.4+)
scaler.set_output(transform="polars")
PyTorch and XGBoost
import torch
import xgboost as xgb
feature_cols = ["f1", "f2", "f3"]
X = df.select(feature_cols).to_numpy()
y = df["target"].to_numpy()
# PyTorch tensor
tensor = torch.from_numpy(X)
# XGBoost DMatrix
dtrain = xgb.DMatrix(X, label=y)
For large training datasets, use scan_parquet with lazy evaluation and .collect() in batches to feed a training loop without loading everything into GPU memory at once.
FAQ
Q: Is Polars stable enough for production in 2026? A: Yes. Polars 1.0 was released in 2024, signaling API stability. The library is used in production at multiple data-intensive companies. The API surface has been stable since 1.0, and breaking changes are announced well in advance.
Q: Does Polars work with AWS S3, GCS, and Azure Blob Storage? A: Yes, via the fsspec extra (pip install polars[fsspec]). scan_parquet("s3://bucket/path/*.parquet") works out of the box once credentials are configured in your environment.
Q: Can I use Polars in a Jupyter notebook? A: Yes. Polars DataFrames render with HTML formatting in Jupyter, similar to pandas. Install pip install polars[html] for richer table output.
Q: How does Polars handle missing data? A: Polars uses null (not NaN) for missing values across all data types, including floats. NaN is a distinct floating-point value in Polars (not treated as missing). Use is_null() / fill_null() for missing data and is_nan() / fill_nan() for IEEE NaN floats.
Q: Does Polars support GPU acceleration? A: As of mid-2026, Polars is experimenting with a GPU engine (NVIDIA cuDF integration) in the lazy API. Pass engine="gpu" to .collect() if you have a compatible CUDA device. The feature is opt-in and not yet considered stable for all operations.
Q: How do I read a very large CSV that does not fit in RAM? A: Use scan_csv and call .collect(engine="streaming"), or use .sink_parquet() / .sink_csv() to write the result directly to disk without collecting into memory.
Q: What is the difference between group_by and group_by_dynamic? A: group_by groups by discrete column values (like pandas groupby). group_by_dynamic groups a datetime column into time windows (e.g., every 1 hour, every 7 days) and is designed for time-series resampling.
Sources
- Polars User Guide (official): https://docs.pola.rs
- Polars GitHub repository: https://github.com/pola-rs/polars
- Polars 1.0 release blog post: https://pola.rs/posts/polars-1.0/
- Apache Arrow columnar format spec: https://arrow.apache.org/docs/format/Columnar.html
- DuckDB + Polars integration guide: https://duckdb.org/docs/guides/python/polars
- PyPI download statistics (pypistats.org): https://pypistats.org/packages/polars
- Scikit-learn
set_outputAPI docs: https://scikit-learn.org/stable/auto_examples/miscellaneous/plot_set_output.html - Ritchie Vink (Polars creator) blog: https://www.ritchievink.com