-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add read_fwf (fixed-width file) #8312
Comments
FYI, my current solution is the following, but any suggestions would be appreciated: df = pl.read_csv(file, new_columns=["header"]).select(
col("header").str.slice(col_offset, col_len).cast(col_type).alias(col_name)
for col_offset, col_len, col_type, col_name in (column_information)
) Obviously reading everything in as str and converting/parsing is non-optimal, but I can't think of a better way here. |
Interesting to see this workaround. Did you test it against Dask to see if it's quicker? |
Yes adding read fwf would be a great addition |
The PR #10660 causes issues for FWF processing with sporadic commas. AJ000037661 41.2170 47.1670 641.0 SHAKI 37661
AJ000037668 41.1000 47.5000 583.0 VARTASHEN 37668
AJ000037673 41.5000 48.8000 27.0 HACMAS 37673
AJ000037674 41.2000 48.2000 999.0 KRIC 37674
AJ000037675 41.3670 48.5170 552.0 GUBA 37675
AJ000037679 41.1000 49.2000 -26.0 SIASAN' 37679
AJ000037729 40.7000 45.8000 1476.0 KEDABEK 37729
AJ000037734 40.8000 46.0000 404.0 SHAMHOR 37734
AJ000037735 40.7167 46.4167 311.0 GANCA 37735
AJ000037740 40.9830 47.8670 682.0 QABALA 37740
AJ000037742 40.9000 47.3000 313.0 ORDJONIKIDZE,ZERNOSOVHOZ 37742 Right now I'm declaring an unlikely character, and processing similarly to how @mcrumiller is doing so, but that has the obvious vulnerability of occurring even once and mucking things up. Would be a big help to get this feature off the ground. An idea for the function import polars as pl
widths = [(0, 2), (4, 45), (47, 64)]
columns = ["iso_code", "iso_country_name", "date_of_accession"]
pl.read_fwf("countries.txt", widths=widths, columns=columns, has_header=True, dtypes=...) Some points to consider
|
@dominictarro for now you can use from io import StringIO
import polars as pl
from polars import col
def read_fwf(filename, widths, columns, dtypes):
column_information = [(*x, y, z) for x, y, z in zip(widths, columns, dtypes)]
return pl.read_csv(filename, separator="\n", new_columns=["header"], has_header=False).select(
col("header").str.slice(col_offset, col_len).cast(col_type).alias(col_name)
for col_offset, col_len, col_name, col_type in (column_information)
)
text = (
"AJ000037661 41.2170 47.1670 641.0, SHAKI 37661\n"
"AJ000037668 41.1000 47.5000 583.0 VARTASHEN 37668\n"
"AJ000037673 41.5000 48.8000 27.0 HACMAS 37673\n"
"AJ000037674 41.2000 48.2000 999.0 KRIC 37674\n"
"AJ000037675 41.3670 48.5170 552.0 GUBA 37675\n"
"AJ000037679 41.1000 49.2000 -26.0 SIASAN' 37679\n"
"AJ000037729 40.7000 45.8000 1476.0 KEDABEK 37729\n"
"AJ000037734 40.8000 46.0000 404.0, SHAMHOR 37734\n"
"AJ000037735 40.7167 46.4167 311.0 GANCA 37735\n"
"AJ000037740 40.9830 47.8670 682.0 QABALA 37740\n"
"AJ000037742 40.9000 47.3000 313.0 ORDJONIKIDZE,ZERNOSOVHOZ 37742\n"
)
x = StringIO(text)
widths = [(0, 1), (13, 7), (43, 39)]
columns = ["ID", "Price", "Name"]
dtypes = [pl.Utf8, pl.Float32, pl.Utf8]
df = read_fwf(x, widths, columns, dtypes)
print(df)
|
This is a duplicate of #3151. Still a much-needed feature. |
Thanks @DeflateAwning, I'll close as duplicate. |
Problem description
Polars is currently missing a good method for reading fixed-width files, i.e.
read_fwf
. I'm trying to figure out how to utilizeread_csv
to make this work. The problem is that, if you add setseparator=' '
, then in a file with multiple spaces between fields, every other space is read in as a new column.For example, calling
pl.read_csv()
on the followingReturns a single-column dataframe, and requires using
.str.slice()
afterwards to manually split into columns.pandas.read_fwf
allows a specification of tuples supplying (offset, length) for each field. Something like this would be a great addition.The text was updated successfully, but these errors were encountered: