Polars vs Pandas Benchmark in AWS Lambda#

Keywords: Polars, Pandas, AWS, Lambda, Glue, ETL, Demo

Overview#

Polars is a lightning-fast dataframe library built on top of Rust. It also has Python SDK that using the compiled rust code under the hood. It has the following advantage over pandas:

Due to the underlying arrow2 implementation, it is multi-thread friendly that uses multiple CPU cores by default. It is usually 2-5x faster than pandas when reading columnar data format (parquet), 8-10x faster than pandas when reading row oriented dataformat (CSV, JSON).
The memory data model in polars is more compact and efficient. As a result, the memory consumption of polars Dataframe usually is as same as the raw data (sometime even less due to the compression). Comparing to pandas, it usually uses memory that is double size of the raw data (when you have string).
Most column-oriented transformation uses vectorized operation and multiple CPU core out-of-the-box, which results in 4-8 times faster than pandas.
Polars use lazy load and zero-copy technique heavily. Just like transformation and action concepts in Spark, it only execute the transformation when necessary so that there’s no need to copy the data for intermediate step. Comparing to pandas, it usually use 1/4 ~ 1/10 memory than pandas for transformation that depends on the number of intermediate steps.

AWS Glue is a service that allows developer to run ETL job on spark without provisioning any infrastructure. However, the AWS Glue development experience is not very good. By default, the development environment is not interactive, you usually has to wait at least 2-5 minutes to execute even just one line of code. Glue has an option to use Jupyter Notebook for development. However, it requires some setup and the Glue Job runtime and Jupyter Notebook runtime are not exactly the same, so it may cause unexpected behavior in production. The last challenge would be the deployment and DevOps. Most modern application support immutable deployment, version control, blue green deployment, and canary deployment. However, AWS Glue ETL job doesn’t support any of these out of the box.

AWS Lambda is a service that allow developer to run code in a container runtime without provisioning any infrastructure, and also no need to setup the programming language runtime. However, it has a hard limit that can only have 10GB memory and 15 minutes execution time.

Usually, AWS Lambda is not a good option for Big Data ETL Process. However, based on my career experience, most of the necessary data in an ETL job is less than 1M rows. For example, your data lake may have 1B rows of data, but in your ETL job, you may only need 1M rows for calculation.

In this document, I would like to explore the possibility to use AWS Lambda + polars Python library to perform medium size dataset ETL job..

Experimental Design#

Data Schema:

Has 25 columns.
5 columns’ data type is int64, values are between 1 and 100000, example: 397647.
5 columns’ data type is float, values are between 0, 1, example: 0.44934012731611805.
5 columns’ data type is short string, values are uuid string, example: daa03354-c777-4b4a-b649-30998f7bd9e3.
5 columns’ data type is long string, values are lorem ipsum text has 3 ~ 6 sentences, example: Picture wait add environment PM weight music. Type tax chair friend. Data might read value three involve..
5 columns’ data type is timestamp, values are from random datetime in microseconds from 2000-01-01 to 2023-01-01, example: 2008-11-08T14:37:77.638096Z.

Data Files:

We create 100 files.
Each file has 100,000 rows.
Data file format is parquet with snappy compression.
Each file is about 60MB with snappy compressed. 85MB if uncompressed.

Lambda Function:

Python3.9
Memory: 10238 MB (cap is 10GB)
Architecture: x84_64

Python Libraries (release time near 2023-01-01):

polars == 1.7.15
pandas == 1.5.3, pyarrow == 9.0.0

We try to read as many file as we could. If we can fit 10M rows in memory, usually we can handle 1M (1/10) rows’ dataset. It is because we may have to create copy of the data during data transformation.

Experiment Result#

Result sheet description (All the measurement is the average of 10 lambda invocation):

engine: polars or pandas
n_files: how many files we read
n_rows: how many rows we read (we can handle 1/10 of this number in production)
raw size: the total raw parquet file size (no compression)
polars_time: how long it take to read all files with polars
pandas_time: how long it take to read all files with pandas
polars_mem: polars memory usage
pandas_mem: pandas memory usage

n_files	n_rows	raw_size (MB)	polars_time (sec)	pandas_time (sec)	polars_mem (MB)	pandas_mem (MB)
1	100,000	85	0.8	1.3	325	600
10	1,000,000	850	8	13	1,200	1,900
20	2,000,000	1,700	16	25	2,200	3,450
30	3,000,000	2,550	23	40	3,050	4,900
40	4,000,000	3,400	32	51	4,000	6,400
50	5,000,000	4,250	42	63	5,000	7,900
60	6,000,000	5,100	53	85	5,950	9,400
70	7,000,000	5,950	…	OOM	…	OOM
80	8,000,000	6,800	…	OOM	…	OOM
90	9,000,000	7,650	…	OOM	…	OOM
100	10,000,000	8,500	110	OOM	9,500	OOM

Conclusion#

Due to the compact in memory data structure implementation, to dataframe size is similar to the raw data size in polars, but pandas spend 2X more than the raw data size.
Including the S3 file IO times, polars read parquet file 1.5 ~ 2X faster than pandas if dataset has lots of string. In the official benchmark, polars is 8-10X faster than pandas when reading a CSV / JSON.
With Polars, we are able to process 1M rows dataset in Lambda Function. Potentially more because polars use lazy load and zero copy technique to reduce memory usage, it less likely you really need to create copy the data.
With Pandas, we are able to process 650K rows dataset in Lambda Function.
If your ETL job source dataset is less than 1M rows (decrease this number if your average size of the row is larger than this experiment, vice versa), and your ETL job doesn’t requires special write engine like Delta Lake, Hudi, IceBerg, you can consider using Lambda + Polars to do ETL job that originally been down in AWS Glue. And you get these goodies:
- better development experience in Lambda
- easy to test and you can fully test your code in unit test
- better deployment strategy (versioned deployment, blue/green, canary, out-of-the-box)
- easy to orchestrate
- easy to integrate with other service

Additional Thought#

If your input data is not a list of files, it is actually a result of a SQL query, you can use AWS Athena to run the query (there is a 200 concurrent limit), and load the result into Lambda Function.

Code Example## -*- coding: utf-8 -*-

import os
import uuid
import random

import numpy as np
import pandas as pd
import polars as pl
from faker import Faker
from mpire import WorkerPool
from fixa.timer import DateTimeTimer
from s3pathlib import S3Path, context
from boto_session_manager import BotoSesManager


bsm = BotoSesManager(profile_name="awshsh_app_dev_us_east_1")
context.attach_boto_session(bsm.boto_ses)
fake = Faker()

s3dir_root = S3Path(
    f"s3://{bsm.aws_account_id}-{bsm.aws_region}-data"
    "/projects/polars_benchmark_in_aws_lambda/"
).to_dir()
print(f"preview at: {s3dir_root.console_url}")


n_files = 100
n_rows = 100000


def generate_one_file(ith_file: int):
    print(f"working on {ith_file} th file")

    df = pl.DataFrame()
    for id in range(1, 1 + 5):
        col = f"col_{id}"
        df = df.with_columns(
            pl.Series(
                name=col,
                values=np.random.randint(1, 1000000, size=n_rows),
            )
        )

    for id in range(6, 6 + 5):
        col = f"col_{id}"
        df = df.with_columns(
            pl.Series(
                name=col,
                values=np.random.rand(n_rows),
            )
        )

    for id in range(11, 11 + 5):
        col = f"col_{id}"
        df = df.with_columns(
            pl.Series(
                name=col,
                values=[str(uuid.uuid4()) for _ in range(n_rows)],
            )
        )

    for id in range(16, 16 + 5):
        col = f"col_{id}"
        df = df.with_columns(
            pl.Series(
                name=col,
                values=[" ".join(fake.sentences()) for _ in range(n_rows)],
            )
        )

    for id in range(21, 21 + 5):
        col = f"col_{id}"
        start = "{}-{}-{}".format(
            random.randint(2001, 2020),
            random.randint(1, 12),
            random.randint(1, 28),
        )
        df = df.with_columns(
            pl.Series(
                name=col,
                values=pd.date_range(start=start, periods=n_rows, freq="S"),
            )
        )

    s3path = s3dir_root.joinpath(
        "parquet",
        f"{str(ith_file).zfill(9)}.snappy.parquet",
    )
    with s3path.open("wb") as f:
        df.write_parquet(
            f,
            compression="snappy",  # 60MB
            # compression="uncompressed", # 85MB
        )

    s3path = s3dir_root.joinpath(
        "csv",
        f"{str(ith_file).zfill(9)}.csv",
    )
    with s3path.open("wb") as f:
        df.write_csv(f, has_header=True)

    s3path = s3dir_root.joinpath(
        "json",
        f"{str(ith_file).zfill(9)}.json",
    )
    with s3path.open("wb") as f:
        df.write_ndjson(f)


kwargs = [{"ith_file": ith_file} for ith_file in range(1, 1 + n_files)]
with DateTimeTimer():
    with WorkerPool(n_jobs=os.cpu_count()) as pool:
        results = pool.map(generate_one_file, kwargs)
# -*- coding: utf-8 -*-

from datetime import datetime

import boto3
import polars as pl
from pathlib import Path
from s3pathlib import S3Path, context

boto_ses = boto3.session.Session()
context.attach_boto_session(boto_ses)
s3_client = boto_ses.client("s3")

path_tmp_parquet = Path("/tmp/temp.parquet")

def lambda_handler(event, context):
    df_list = list()
    n = 1
    start = datetime.utcnow()
    for ith_file in range(1, 1 + n):
        print(f"read {ith_file} file")
        s3path = S3Path(
            f"s3://807388292768-us-east-1-data"
            f"/projects/polars_benchmark_in_aws_lambda/parquet"
            f"/{str(ith_file).zfill(9)}.snappy.parquet"
        )
        path_tmp_parquet.unlink(missing_ok=True)
        s3_client.download_file(
            s3path.bucket,
            s3path.key,
            str(path_tmp_parquet),
        )
        df = pl.read_parquet(str(path_tmp_parquet))

        # with s3path.open("rb") as f:
        #     df = pl.read_parquet(f)

        df_list.append(df)

        elapsed = int((datetime.utcnow() - start).total_seconds())
        print(f"  done, elapsed {elapsed} seconds")
        # print(df.shape)


# lambda_handler(None, None)
# -*- coding: utf-8 -*-

from datetime import datetime
import awswrangler as wr


def lambda_handler(event, context):
    df_list = list()
    n = 1
    start = datetime.utcnow()
    for ith_file in range(1, 1 + n):
        print(f"read {ith_file} file")

        uri = (
            f"s3://807388292768-us-east-1-data"
            f"/projects/polars_benchmark_in_aws_lambda/parquet"
            f"/{str(ith_file).zfill(9)}.snappy.parquet"
        )
        df = wr.s3.read_parquet(uri)
        df_list.append(df)

        elapsed = int((datetime.utcnow() - start).total_seconds())
        print(f"  done, elapsed {elapsed} seconds")
        print(df.shape)


lambda_handler(None, None)

Demo#

Lambda-Plus-Polars-Another-Powertool-for-ETL