Add read_fwf (fixed-width file) #8312

mcrumiller · 2023-04-17T22:11:25Z

Problem description

Polars is currently missing a good method for reading fixed-width files, i.e. read_fwf. I'm trying to figure out how to utilize read_csv to make this work. The problem is that, if you add set separator=' ', then in a file with multiple spaces between fields, every other space is read in as a new column.

For example, calling pl.read_csv() on the following

Name         Birthdate        Age
Frank        1980-05-10       43
Jess         1965-02-16       68

Returns a single-column dataframe, and requires using .str.slice() afterwards to manually split into columns.

pandas.read_fwf allows a specification of tuples supplying (offset, length) for each field. Something like this would be a great addition.

The text was updated successfully, but these errors were encountered:

mcrumiller · 2023-04-18T00:29:41Z

FYI, my current solution is the following, but any suggestions would be appreciated:

df = pl.read_csv(file, new_columns=["header"]).select(
    col("header").str.slice(col_offset, col_len).cast(col_type).alias(col_name)
    for col_offset, col_len, col_type, col_name in (column_information)
)

Obviously reading everything in as str and converting/parsing is non-optimal, but I can't think of a better way here.

Fbrufino · 2023-06-06T03:52:19Z

FYI, my current solution is the following, but any suggestions would be appreciated:
df = pl.read_csv(file, new_columns=["header"]).select(
    col("header").str.slice(col_offset, col_len).cast(col_type).alias(col_name)
    for col_offset, col_len, col_type, col_name in (column_information)
)
Obviously reading everything in as str and converting/parsing is non-optimal, but I can't think of a better way here.

Interesting to see this workaround. Did you test it against Dask to see if it's quicker?

si1entnight · 2023-08-06T15:25:57Z

Yes adding read fwf would be a great addition

dominictarro · 2023-09-12T19:09:11Z

The PR #10660 causes issues for FWF processing with sporadic commas.

AJ000037661  41.2170   47.1670  641.0    SHAKI                                  37661
AJ000037668  41.1000   47.5000  583.0    VARTASHEN                              37668
AJ000037673  41.5000   48.8000   27.0    HACMAS                                 37673
AJ000037674  41.2000   48.2000  999.0    KRIC                                   37674
AJ000037675  41.3670   48.5170  552.0    GUBA                                   37675
AJ000037679  41.1000   49.2000  -26.0    SIASAN'                                37679
AJ000037729  40.7000   45.8000 1476.0    KEDABEK                                37729
AJ000037734  40.8000   46.0000  404.0    SHAMHOR                                37734
AJ000037735  40.7167   46.4167  311.0    GANCA                                  37735
AJ000037740  40.9830   47.8670  682.0    QABALA                                 37740
AJ000037742  40.9000   47.3000  313.0    ORDJONIKIDZE,ZERNOSOVHOZ               37742

Right now I'm declaring an unlikely character, and processing similarly to how @mcrumiller is doing so, but that has the obvious vulnerability of occurring even once and mucking things up. Would be a big help to get this feature off the ground. An idea for the function

import polars as pl

widths = [(0, 2), (4, 45), (47, 64)]

columns = ["iso_code", "iso_country_name", "date_of_accession"]

pl.read_fwf("countries.txt", widths=widths, columns=columns, has_header=True, dtypes=...)

Some points to consider

Extra whitespace
- Option to automatically strip
File has a header
- Option to ignore header line and use provided field name
Not parsing all ranges
- e.g. only take iso_code and date_of_accession
- Would cause confusion if the user is declaring columns to select by number. Need to be clear about how this works in docs. Or just don't let the user select by number, not sure how attached people are to the feature.
- Could require the number of widths provided to equal the number of columns selected. Would also make it obvious what the positions are.

mcrumiller · 2023-09-12T19:30:55Z

@dominictarro for now you can use separator="\n" which seems to work on your case above pretty well:

from io import StringIO

import polars as pl
from polars import col


def read_fwf(filename, widths, columns, dtypes):
    column_information = [(*x, y, z) for x, y, z in zip(widths, columns, dtypes)]

    return pl.read_csv(filename, separator="\n", new_columns=["header"], has_header=False).select(
        col("header").str.slice(col_offset, col_len).cast(col_type).alias(col_name)
        for col_offset, col_len, col_name, col_type in (column_information)
    )


text = (
    "AJ000037661  41.2170   47.1670  641.0,    SHAKI                                  37661\n"
    "AJ000037668  41.1000   47.5000  583.0    VARTASHEN                              37668\n"
    "AJ000037673  41.5000   48.8000   27.0    HACMAS                                 37673\n"
    "AJ000037674  41.2000   48.2000  999.0    KRIC                                   37674\n"
    "AJ000037675  41.3670   48.5170  552.0    GUBA                                   37675\n"
    "AJ000037679  41.1000   49.2000  -26.0    SIASAN'                                37679\n"
    "AJ000037729  40.7000   45.8000 1476.0    KEDABEK                                37729\n"
    "AJ000037734  40.8000   46.0000  404.0,    SHAMHOR                                37734\n"
    "AJ000037735  40.7167   46.4167  311.0    GANCA                                  37735\n"
    "AJ000037740  40.9830   47.8670  682.0    QABALA                                 37740\n"
    "AJ000037742  40.9000   47.3000  313.0    ORDJONIKIDZE,ZERNOSOVHOZ               37742\n"
)

x = StringIO(text)

widths = [(0, 1), (13, 7), (43, 39)]
columns = ["ID", "Price", "Name"]
dtypes = [pl.Utf8, pl.Float32, pl.Utf8]

df = read_fwf(x, widths, columns, dtypes)

print(df)

shape: (11, 3)
┌─────┬───────────┬───────────────────────────────────┐
│ ID  ┆ Price     ┆ Name                              │
│ --- ┆ ---       ┆ ---                               │
│ str ┆ f32       ┆ str                               │
╞═════╪═══════════╪═══════════════════════════════════╡
│ A   ┆ 41.216999 ┆ HAKI                            … │
│ A   ┆ 41.099998 ┆ RTASHEN                         … │
│ A   ┆ 41.5      ┆ CMAS                            … │
│ A   ┆ 41.200001 ┆ IC                              … │
│ …   ┆ …         ┆ …                                 │
│ A   ┆ 40.799999 ┆ HAMHOR                          … │
│ A   ┆ 40.716702 ┆ NCA                             … │
│ A   ┆ 40.983002 ┆ BALA                            … │
│ A   ┆ 40.900002 ┆ DJONIKIDZE,ZERNOSOVHOZ          … │
└─────┴───────────┴───────────────────────────────────┘

DeflateAwning · 2024-01-04T17:17:16Z

This is a duplicate of #3151. Still a much-needed feature.

mcrumiller · 2024-01-04T17:30:42Z

Thanks @DeflateAwning, I'll close as duplicate.

mcrumiller added the enhancement New feature or an improvement of an existing feature label Apr 17, 2023

mcrumiller closed this as completed Jan 4, 2024

mcrumiller mentioned this issue Jan 11, 2024

rust polars : read fwf, an implementation example in rust #13632

Closed

Julian-J-S mentioned this issue May 4, 2024

Can the separator of the read csv function support regular splitting? #16044

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add read_fwf (fixed-width file) #8312

Add read_fwf (fixed-width file) #8312

mcrumiller commented Apr 17, 2023

mcrumiller commented Apr 18, 2023

Fbrufino commented Jun 6, 2023

si1entnight commented Aug 6, 2023

dominictarro commented Sep 12, 2023

mcrumiller commented Sep 12, 2023 •

edited

Loading

DeflateAwning commented Jan 4, 2024

mcrumiller commented Jan 4, 2024

Add read_fwf (fixed-width file) #8312

Add read_fwf (fixed-width file) #8312

Comments

mcrumiller commented Apr 17, 2023

Problem description

mcrumiller commented Apr 18, 2023

Fbrufino commented Jun 6, 2023

si1entnight commented Aug 6, 2023

dominictarro commented Sep 12, 2023

mcrumiller commented Sep 12, 2023 • edited Loading

DeflateAwning commented Jan 4, 2024

mcrumiller commented Jan 4, 2024

mcrumiller commented Sep 12, 2023 •

edited

Loading