Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add read_fwf (fixed-width file) #8312

Closed
mcrumiller opened this issue Apr 17, 2023 · 7 comments
Closed

Add read_fwf (fixed-width file) #8312

mcrumiller opened this issue Apr 17, 2023 · 7 comments
Labels
enhancement New feature or an improvement of an existing feature

Comments

@mcrumiller
Copy link
Contributor

Problem description

Polars is currently missing a good method for reading fixed-width files, i.e. read_fwf. I'm trying to figure out how to utilize read_csv to make this work. The problem is that, if you add set separator=' ', then in a file with multiple spaces between fields, every other space is read in as a new column.

For example, calling pl.read_csv() on the following

Name         Birthdate        Age
Frank        1980-05-10       43
Jess         1965-02-16       68

Returns a single-column dataframe, and requires using .str.slice() afterwards to manually split into columns.

pandas.read_fwf allows a specification of tuples supplying (offset, length) for each field. Something like this would be a great addition.

@mcrumiller mcrumiller added the enhancement New feature or an improvement of an existing feature label Apr 17, 2023
@mcrumiller
Copy link
Contributor Author

FYI, my current solution is the following, but any suggestions would be appreciated:

df = pl.read_csv(file, new_columns=["header"]).select(
    col("header").str.slice(col_offset, col_len).cast(col_type).alias(col_name)
    for col_offset, col_len, col_type, col_name in (column_information)
)

Obviously reading everything in as str and converting/parsing is non-optimal, but I can't think of a better way here.

@Fbrufino
Copy link

Fbrufino commented Jun 6, 2023

FYI, my current solution is the following, but any suggestions would be appreciated:

df = pl.read_csv(file, new_columns=["header"]).select(
    col("header").str.slice(col_offset, col_len).cast(col_type).alias(col_name)
    for col_offset, col_len, col_type, col_name in (column_information)
)

Obviously reading everything in as str and converting/parsing is non-optimal, but I can't think of a better way here.

Interesting to see this workaround. Did you test it against Dask to see if it's quicker?

@si1entnight
Copy link

Yes adding read fwf would be a great addition

@dominictarro
Copy link

The PR #10660 causes issues for FWF processing with sporadic commas.

AJ000037661  41.2170   47.1670  641.0    SHAKI                                  37661
AJ000037668  41.1000   47.5000  583.0    VARTASHEN                              37668
AJ000037673  41.5000   48.8000   27.0    HACMAS                                 37673
AJ000037674  41.2000   48.2000  999.0    KRIC                                   37674
AJ000037675  41.3670   48.5170  552.0    GUBA                                   37675
AJ000037679  41.1000   49.2000  -26.0    SIASAN'                                37679
AJ000037729  40.7000   45.8000 1476.0    KEDABEK                                37729
AJ000037734  40.8000   46.0000  404.0    SHAMHOR                                37734
AJ000037735  40.7167   46.4167  311.0    GANCA                                  37735
AJ000037740  40.9830   47.8670  682.0    QABALA                                 37740
AJ000037742  40.9000   47.3000  313.0    ORDJONIKIDZE,ZERNOSOVHOZ               37742

Right now I'm declaring an unlikely character, and processing similarly to how @mcrumiller is doing so, but that has the obvious vulnerability of occurring even once and mucking things up. Would be a big help to get this feature off the ground. An idea for the function

import polars as pl

widths = [(0, 2), (4, 45), (47, 64)]

columns = ["iso_code", "iso_country_name", "date_of_accession"]

pl.read_fwf("countries.txt", widths=widths, columns=columns, has_header=True, dtypes=...)

Some points to consider

  • Extra whitespace
    • Option to automatically strip
  • File has a header
    • Option to ignore header line and use provided field name
  • Not parsing all ranges
    • e.g. only take iso_code and date_of_accession
    • Would cause confusion if the user is declaring columns to select by number. Need to be clear about how this works in docs. Or just don't let the user select by number, not sure how attached people are to the feature.
    • Could require the number of widths provided to equal the number of columns selected. Would also make it obvious what the positions are.

@mcrumiller
Copy link
Contributor Author

mcrumiller commented Sep 12, 2023

@dominictarro for now you can use separator="\n" which seems to work on your case above pretty well:

from io import StringIO

import polars as pl
from polars import col


def read_fwf(filename, widths, columns, dtypes):
    column_information = [(*x, y, z) for x, y, z in zip(widths, columns, dtypes)]

    return pl.read_csv(filename, separator="\n", new_columns=["header"], has_header=False).select(
        col("header").str.slice(col_offset, col_len).cast(col_type).alias(col_name)
        for col_offset, col_len, col_name, col_type in (column_information)
    )


text = (
    "AJ000037661  41.2170   47.1670  641.0,    SHAKI                                  37661\n"
    "AJ000037668  41.1000   47.5000  583.0    VARTASHEN                              37668\n"
    "AJ000037673  41.5000   48.8000   27.0    HACMAS                                 37673\n"
    "AJ000037674  41.2000   48.2000  999.0    KRIC                                   37674\n"
    "AJ000037675  41.3670   48.5170  552.0    GUBA                                   37675\n"
    "AJ000037679  41.1000   49.2000  -26.0    SIASAN'                                37679\n"
    "AJ000037729  40.7000   45.8000 1476.0    KEDABEK                                37729\n"
    "AJ000037734  40.8000   46.0000  404.0,    SHAMHOR                                37734\n"
    "AJ000037735  40.7167   46.4167  311.0    GANCA                                  37735\n"
    "AJ000037740  40.9830   47.8670  682.0    QABALA                                 37740\n"
    "AJ000037742  40.9000   47.3000  313.0    ORDJONIKIDZE,ZERNOSOVHOZ               37742\n"
)

x = StringIO(text)

widths = [(0, 1), (13, 7), (43, 39)]
columns = ["ID", "Price", "Name"]
dtypes = [pl.Utf8, pl.Float32, pl.Utf8]

df = read_fwf(x, widths, columns, dtypes)

print(df)
shape: (11, 3)
┌─────┬───────────┬───────────────────────────────────┐
│ ID  ┆ Price     ┆ Name                              │
│ --- ┆ ---       ┆ ---                               │
│ str ┆ f32       ┆ str                               │
╞═════╪═══════════╪═══════════════════════════════════╡
│ A   ┆ 41.216999 ┆ HAKI                            … │
│ A   ┆ 41.099998 ┆ RTASHEN                         … │
│ A   ┆ 41.5      ┆ CMAS                            … │
│ A   ┆ 41.200001 ┆ IC                              … │
│ …   ┆ …         ┆ …                                 │
│ A   ┆ 40.799999 ┆ HAMHOR                          … │
│ A   ┆ 40.716702 ┆ NCA                             … │
│ A   ┆ 40.983002 ┆ BALA                            … │
│ A   ┆ 40.900002 ┆ DJONIKIDZE,ZERNOSOVHOZ          … │
└─────┴───────────┴───────────────────────────────────┘

@DeflateAwning
Copy link
Contributor

This is a duplicate of #3151. Still a much-needed feature.

@mcrumiller
Copy link
Contributor Author

Thanks @DeflateAwning, I'll close as duplicate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

5 participants