Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

polars support for xlsx_table #1357

Merged
merged 53 commits into from
Jun 3, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
53 commits
Select commit Hold shift + click to select a range
862c7dd
add make_clean_names function that can be applied to polars
Apr 19, 2024
01531cc
add examples for make_clean_names
Apr 20, 2024
0fb440e
changelog
Apr 20, 2024
5e944b2
limit import location for polars
Apr 20, 2024
501d9c6
limit import location for polars
Apr 20, 2024
9506832
fix polars in environment-dev.yml
Apr 20, 2024
1ae8edd
install polars in doctest
Apr 20, 2024
3b1829b
limit polars imports - user should have polars already installed
Apr 20, 2024
52fd80c
use subprocess.run
Apr 20, 2024
2dce78b
add subprocess.devnull to docstrings
Apr 20, 2024
37b3feb
add subprocess.devnull to docstrings
Apr 20, 2024
0953f2d
add subprocess.devnull to docstrings
Apr 20, 2024
d7c71b6
add subprocess.devnull to docstrings
Apr 20, 2024
40b8502
add os.devnull
Apr 20, 2024
4f11d09
add polars as requirement for docs
Apr 20, 2024
54b179c
add polars to tests requirements
Apr 20, 2024
25b39b9
delete irrelevant folder
Apr 20, 2024
a09f34b
changelog
Apr 20, 2024
1b375f8
create submodule for polars
Apr 21, 2024
799532f
fix doctests
Apr 21, 2024
dbce4b9
fix tests; add polars to documentation
Apr 21, 2024
1c642e6
fix tests; add polars to documentation
Apr 21, 2024
407d21b
import janitor.polars
Apr 21, 2024
aedfc65
control docs output for polars submodule
Apr 21, 2024
db9b486
exclude functions in docs rendering
Apr 21, 2024
6a91e67
exclude functions in docs rendering
Apr 21, 2024
7a88078
show_submodules=true
Apr 21, 2024
6d7885e
fix docstring rendering for polars
Apr 21, 2024
944fa02
Expression -> expression
Apr 21, 2024
b9aefaa
Merge dev into samukweku/polars_clean_names
ericmjl Apr 23, 2024
e9c370a
rename functions.py
Apr 23, 2024
ee66d2a
pivot_longer implemented for polars
Apr 29, 2024
959b082
changelog
Apr 30, 2024
3177503
keep changes related only to pivot_longer
Apr 30, 2024
ee899b2
pd -> pl
Apr 30, 2024
8ea9b71
pd -> pl
Apr 30, 2024
d12ae1a
df.pivot_longer -> df.janitor.pivot_longer
Apr 30, 2024
652f3e3
df.pivot_longer -> df.janitor.pivot_longer
Apr 30, 2024
9b9c1a9
pd -> pl
Apr 30, 2024
69c273f
pd -> pl
Apr 30, 2024
b3391e8
add >>> df
Apr 30, 2024
4ffaac5
add >>> df
Apr 30, 2024
1de57bb
keep changes related only to polars pivot_longer
Apr 30, 2024
e495790
add polars support to read_commandline
May 1, 2024
a5c331a
remove irrelevant files
May 1, 2024
4d9c35f
minor edit to docs
May 1, 2024
3b781c1
xlsx_table now supports polars
May 1, 2024
197e619
Merge dev into samukweku/polars_xlsx_tables
ericmjl May 6, 2024
5981268
Merge dev into samukweku/polars_xlsx_tables
ericmjl May 10, 2024
0484715
Merge dev into samukweku/polars_xlsx_tables
ericmjl May 19, 2024
4b99a5f
Merge dev into samukweku/polars_xlsx_tables
ericmjl May 23, 2024
40e8af6
Merge dev into samukweku/polars_xlsx_tables
ericmjl May 27, 2024
5d32cc5
Merge dev into samukweku/polars_xlsx_tables
ericmjl Jun 2, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# Changelog

## [Unreleased]
- [ENH] `xlsx_table` function now supports polars - Issue #1352

- [ENH] Improved performance for non-equi joins when using numba - @samukweku PR #1341
- [ENH] Added a `clean_names` method for polars - it can be used to clean the column names, or clean column values . Issue #1343
Expand Down
50 changes: 43 additions & 7 deletions janitor/io.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
from glob import glob
from io import StringIO
from itertools import chain
from typing import IO, TYPE_CHECKING, Any, Iterable, Union
from typing import IO, TYPE_CHECKING, Any, Iterable, Mapping, Union

import pandas as pd

Expand Down Expand Up @@ -142,21 +142,23 @@ def xlsx_table(
path: Union[str, IO, Workbook],
sheetname: str = None,
table: Union[str, list, tuple] = None,
) -> Union[pd.DataFrame, dict]:
engine: str = "pandas",
) -> Mapping:
"""Returns a DataFrame of values in a table in the Excel file.

This applies to an Excel file, where the data range is explicitly
specified as a Microsoft Excel table.

If there is a single table in the sheet, or a string is provided
as an argument to the `table` parameter, a pandas DataFrame is returned;
as an argument to the `table` parameter, a DataFrame is returned;
if there is more than one table in the sheet,
and the `table` argument is `None`, or a list/tuple of names,
a dictionary of DataFrames is returned, where the keys of the dictionary
are the table names.

Examples:
>>> import pandas as pd
>>> import polars as pl
>>> from janitor import xlsx_table
>>> filename="../pyjanitor/tests/test_data/016-MSPTDA-Excel.xlsx"

Expand All @@ -170,6 +172,20 @@ def xlsx_table(
3 4 Competition
4 5 Long Distance

>>> xlsx_table(filename, table='dCategory', engine='polars')
shape: (5, 2)
┌────────────┬───────────────┐
│ CategoryID ┆ Category │
│ --- ┆ --- │
│ i64 ┆ str │
╞════════════╪═══════════════╡
│ 1 ┆ Beginner │
│ 2 ┆ Advanced │
│ 3 ┆ Freestyle │
│ 4 ┆ Competition │
│ 5 ┆ Long Distance │
└────────────┴───────────────┘

Multiple tables:

>>> out=xlsx_table(filename, table=["dCategory", "dSalesReps"])
Expand All @@ -189,14 +205,16 @@ def xlsx_table(
Args:
path: Path to the Excel File. It can also be an openpyxl Workbook.
table: Name of a table, or list of tables in the sheet.
engine: DataFrame engine. Should be either pandas or polars.
Defaults to pandas

Raises:
AttributeError: If a workbook is provided, and is a ReadOnlyWorksheet.
ValueError: If there are no tables in the sheet.
KeyError: If the provided table does not exist in the sheet.

Returns:
A pandas DataFrame, or a dictionary of DataFrames,
A DataFrame, or a dictionary of DataFrames,
if there are multiple arguments for the `table` parameter,
or the argument to `table` is `None`.
""" # noqa : E501
Expand All @@ -219,6 +237,22 @@ def xlsx_table(
DeprecationWarning,
stacklevel=find_stack_level(),
)
if engine not in {"pandas", "polars"}:
raise ValueError("engine should be one of pandas or polars.")
base_engine = pd
if engine == "polars":
try:
import polars as pl

base_engine = pl
except ImportError:
import_message(
submodule="polars",
package="polars",
conda_channel="conda-forge",
pip_install=True,
)

if table is not None:
check("table", table, [str, list, tuple])
if isinstance(table, (list, tuple)):
Expand All @@ -245,13 +279,15 @@ def _create_dataframe_or_dictionary_from_table(
header_exist = contents.headerRowCount
coordinates = contents.ref
data = worksheet[coordinates]
data = [[entry.value for entry in cell] for cell in data]
if header_exist:
header, *data = data
header = [cell.value for cell in header]
else:
header = [f"C{num}" for num in range(len(data[0]))]
data = pd.DataFrame(data, columns=header)
dictionary[table_name] = data
data = zip(*data)
data = ([entry.value for entry in cell] for cell in data)
data = dict(zip(header, data))
dictionary[table_name] = base_engine.DataFrame(data)
return dictionary

worksheets = [worksheet for worksheet in ws if worksheet.tables.items()]
Expand Down
Loading