Skip to content

Commit

Permalink
update notebooks (#287)
Browse files Browse the repository at this point in the history
  • Loading branch information
MarcoGorelli authored Jun 11, 2024
1 parent 1958fa7 commit 9229d5b
Show file tree
Hide file tree
Showing 9 changed files with 854 additions and 31 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ Like Ibis, Narwhals aims to enable dataframe-agnostic code. However, Narwhals co
is about as lightweight as it gets, and is aimed at library developers rather than at end users. It also does
not aim to support as many backends, instead preferring to focus on dataframes. So, which should you use?

- If you need a SQL frontend: Ibis!
- If you need a SQL frontend in Python: Ibis!
- If you're a library maintainer and want a lightweight and minimal-overhead layer to get cross-dataframe library support: Narwhals!

Here is the package size increase which would result from installing each tool in a non-pandas
Expand Down
5 changes: 0 additions & 5 deletions docs/overhead.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,8 +18,3 @@ the data sources.
On some runs, the Narwhals code makes things marginally faster, on others
marginally slower. The overall picture is clear: with Narwhals, you
can support both Polars and pandas APIs with little to no impact on either.

A fairly common question we receive is "why not just use Ibis". We believe
that Ibis works well as a SQL frontend, but find [its overhead when translating
dataframe APIs](https://github.com/ibis-project/ibis/issues/9345) to be unacceptably high -
that's why we created something new.
125 changes: 124 additions & 1 deletion tpch/notebooks/q1/execute.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@
}
],
"source": [
"!pip uninstall apache-beam -y && pip install -U pandas polars pyarrow narwhals>=0.9.5 "
"!pip uninstall apache-beam -y && pip install -U pandas polars pyarrow narwhals>=0.9.5 ibis-framework"
]
},
{
Expand Down Expand Up @@ -179,6 +179,46 @@
" )"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2aeb714e",
"metadata": {},
"outputs": [],
"source": [
"def q1_ibis(lineitem: Any, *, tool):\n",
" var1 = datetime(1998, 9, 2)\n",
" lineitem = lineitem.filter(lineitem[\"l_shipdate\"] <= var1)\n",
" lineitem = lineitem.mutate(\n",
" disc_price=lineitem[\"l_extendedprice\"] * (1 - lineitem[\"l_discount\"]),\n",
" charge=(\n",
" lineitem[\"l_extendedprice\"]\n",
" * (1.0 - lineitem[\"l_discount\"])\n",
" * (1.0 + lineitem[\"l_tax\"])\n",
" ),\n",
" )\n",
" q_final = (\n",
" lineitem\n",
" .group_by([\"l_returnflag\", \"l_linestatus\"])\n",
" .aggregate(\n",
" sum_qty=lineitem[\"l_quantity\"].sum(),\n",
" sum_base_price=lineitem[\"l_extendedprice\"].sum(),\n",
" sum_disc_price=(lineitem['disc_price'].sum()),\n",
" sum_charge=(lineitem['charge'].sum()),\n",
" avg_qty=lineitem[\"l_quantity\"].mean(),\n",
" avg_price=lineitem[\"l_extendedprice\"].mean(),\n",
" avg_disc=lineitem[\"l_discount\"].mean(),\n",
" count_order=lambda lineitem: lineitem.count(),\n",
" )\n",
" .order_by([\"l_returnflag\", \"l_linestatus\"])\n",
" )\n",
" if tool == 'pandas':\n",
" return q_final.to_pandas()\n",
" if tool == 'polars':\n",
" return q_final.to_polars()\n",
" raise ValueError(\"expected pandas or polars\")"
]
},
{
"cell_type": "code",
"execution_count": 5,
Expand Down Expand Up @@ -234,11 +274,18 @@
},
"outputs": [],
"source": [
"import ibis\n",
"\n",
"con_pd = ibis.pandas.connect()\n",
"con_pl = ibis.polars.connect()\n",
"\n",
"IO_FUNCS = {\n",
" 'pandas': lambda x: pd.read_parquet(x, engine='pyarrow'),\n",
" 'pandas[pyarrow]': lambda x: pd.read_parquet(x, engine='pyarrow', dtype_backend='pyarrow'),\n",
" 'pandas[pyarrow][ibis]': lambda x: con_pd.read_parquet(x, engine='pyarrow', dtype_backend='pyarrow'),\n",
" 'polars[eager]': lambda x: pl.read_parquet(x),\n",
" 'polars[lazy]': lambda x: pl.scan_parquet(x),\n",
" 'polars[lazy][ibis]': lambda x: con_pl.read_parquet(x),\n",
"}"
]
},
Expand All @@ -252,6 +299,44 @@
"results = {}"
]
},
{
"cell_type": "markdown",
"id": "b2dc4c17",
"metadata": {},
"source": [
"## pandas, pyarrow dtypes, ibis"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d342ce97",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"24 s ± 142 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
]
},
{
"data": {
"text/plain": [
"23.841894793999984"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"tool = 'pandas[pyarrow][ibis]'\n",
"fn = IO_FUNCS[tool]\n",
"timings = %timeit -o q1_ibis(fn(lineitem), tool='pandas')\n",
"results[tool] = timings.all_runs"
]
},
{
"cell_type": "markdown",
"id": "64b20949",
Expand Down Expand Up @@ -542,6 +627,44 @@
"results[tool] = timings.all_runs"
]
},
{
"cell_type": "markdown",
"id": "aa0a2882",
"metadata": {},
"source": [
"## Polars scan_parquet ibis"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c00e7434",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"24 s ± 142 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
]
},
{
"data": {
"text/plain": [
"23.841894793999984"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"tool = 'polars[lazy][ibis]'\n",
"fn = IO_FUNCS[tool]\n",
"timings = %timeit -o q1_ibis(fn(lineitem), tool='polars')\n",
"results[tool] = timings.all_runs"
]
},
{
"cell_type": "markdown",
"id": "37ce6bf3",
Expand Down
143 changes: 142 additions & 1 deletion tpch/notebooks/q2/execute.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@
}
],
"source": [
"!pip uninstall apache-beam -y && pip install -U pandas polars pyarrow narwhals>=0.9.5 "
"!pip uninstall apache-beam -y && pip install -U pandas polars pyarrow narwhals>=0.9.5 ibis-framework "
]
},
{
Expand Down Expand Up @@ -222,6 +222,64 @@
" return nw.to_native(q_final)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5823bdfe",
"metadata": {},
"outputs": [],
"source": [
"from typing import Any\n",
"from datetime import datetime\n",
"import ibis\n",
"\n",
"def q2_ibis(\n",
" region: Any,\n",
" nation: Any,\n",
" supplier: Any,\n",
" part: Any,\n",
" partsupp: Any,\n",
" *,\n",
" tool: str,\n",
") -> Any:\n",
" var1 = 15\n",
" var2 = \"BRASS\"\n",
" var3 = \"EUROPE\"\n",
"\n",
" q2 = (\n",
" part.join(partsupp, part[\"p_partkey\"] == partsupp[\"ps_partkey\"])\n",
" .join(supplier, partsupp[\"ps_suppkey\"] == supplier[\"s_suppkey\"])\n",
" .join(nation, supplier[\"s_nationkey\"] == nation[\"n_nationkey\"])\n",
" .join(region, nation[\"n_regionkey\"] == region[\"r_regionkey\"])\n",
" .filter(ibis._[\"p_size\"] == var1)\n",
" .filter(ibis._[\"p_type\"].endswith(var2))\n",
" .filter(ibis._[\"r_name\"] == var3)\n",
" )\n",
"\n",
" q_final = (\n",
" q2.group_by(\"p_partkey\")\n",
" .agg(ps_supplycost=ibis._[\"ps_supplycost\"].min())\n",
" .join(q2, [\"p_partkey\"])\n",
" .select(\n",
" \"s_acctbal\",\n",
" \"s_name\",\n",
" \"n_name\",\n",
" \"p_partkey\",\n",
" \"p_mfgr\",\n",
" \"s_address\",\n",
" \"s_phone\",\n",
" \"s_comment\",\n",
" )\n",
" .order_by(ibis.desc(\"s_acctbal\"), \"n_name\", \"s_name\", \"p_partkey\")\n",
" .limit(100)\n",
" )\n",
" if tool == 'pandas':\n",
" return q_final.to_pandas()\n",
" if tool == 'polars':\n",
" return q_final.to_polars()\n",
" raise ValueError(\"expected pandas or polars\")"
]
},
{
"cell_type": "code",
"execution_count": 5,
Expand Down Expand Up @@ -277,11 +335,18 @@
},
"outputs": [],
"source": [
"import ibis\n",
"\n",
"con_pd = ibis.pandas.connect()\n",
"con_pl = ibis.polars.connect()\n",
"\n",
"IO_FUNCS = {\n",
" 'pandas': lambda x: pd.read_parquet(x, engine='pyarrow'),\n",
" 'pandas[pyarrow]': lambda x: pd.read_parquet(x, engine='pyarrow', dtype_backend='pyarrow'),\n",
" 'pandas[pyarrow][ibis]': lambda x: con_pd.read_parquet(x, engine='pyarrow', dtype_backend='pyarrow'),\n",
" 'polars[eager]': lambda x: pl.read_parquet(x),\n",
" 'polars[lazy]': lambda x: pl.scan_parquet(x),\n",
" 'polars[lazy][ibis]': lambda x: con_pl.read_parquet(x),\n",
"}"
]
},
Expand All @@ -295,6 +360,82 @@
"results = {}"
]
},
{
"cell_type": "markdown",
"id": "526a038b",
"metadata": {},
"source": [
"## pandas, pyarrow dtypes, via ibis"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7f8b42fe",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"24 s ± 142 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
]
},
{
"data": {
"text/plain": [
"23.841894793999984"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"tool = 'pandas[pyarrow][ibis]'\n",
"fn = IO_FUNCS[tool]\n",
"timings = %timeit -o q2_ibis(fn(region), fn(nation), fn(supplier), fn(part), fn(partsupp), tool='pandas')\n",
"results[tool] = timings.all_runs"
]
},
{
"cell_type": "markdown",
"id": "13c5e9be",
"metadata": {},
"source": [
"## Polars scan_parquet via ibis"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6d15d742",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"595 ms ± 18.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
]
},
{
"data": {
"text/plain": [
"0.5674880569999914"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"tool = 'polars[lazy][ibis]'\n",
"fn = IO_FUNCS[tool]\n",
"timings = %timeit -o q2_ibis(fn(region), fn(nation), fn(supplier), fn(part), fn(partsupp), tool='polars')\n",
"results[tool] = timings.all_runs"
]
},
{
"cell_type": "markdown",
"id": "eb2f8fd9",
Expand Down
Loading

0 comments on commit 9229d5b

Please sign in to comment.