Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add field delimiter detection #44

Open
wants to merge 3 commits into
base: develop
Choose a base branch
from

Conversation

PoshAlpaca
Copy link

@PoshAlpaca PoshAlpaca commented Mar 4, 2022

Description

Adds the ability to automatically infer the CSV file's field delimiter. To use field delimiter detection, specify nil for the field delimiter when configuring CSVReader:

let configuration = CSVReader.Configuration()
configuration.delimiters = (field: nil, row: "\n")

Checklist

The following list must only be fulfilled by code-changing PRs. If you are making changes on the documentation, ignore these.

  • Include in-code documentation at the top of the property/function/structure/class (if necessary).
  • Merge to develop.
  • Add to existing tests or create new tests (if necessary).

Progress

  • Change API of Delimiter.Field to more explicitly support inference and allow for a custom list of possible delimiters.
    One option could look like this:
    // infer field delimiter from a list of common delimiters, e.g. ",", ";", "\t"
    configuration.delimiters = (field: .infer, row: "\n")
    // infer field delimiter from a list of custom delimiters
    configuration.delimiters = (field: .infer(options: [",", ":"]), row: "\n")
  • Read CSV data incrementally until delimiters can be inferred with a high degree of confidence.

@PoshAlpaca PoshAlpaca changed the title Add field delimiter detection using row patterns Add field delimiter detection Mar 4, 2022
@PoshAlpaca
Copy link
Author

Hi @dehesa,

I finally got around to opening the PR. This is just a draft at this point but I wanted to hear your thoughts before getting too carried away ;)

The detection method comes from CleverCSV and is also described in this paper. It consists of two parts:

  • Analysis of row patterns (implemented): the correct delimiter will produce the same pattern for every row. In these row patterns C represents a cell (field) and D stands for a field delimiter.
                         ","         ";"
    1,foo,bar            CDCDC       C
    2,funny ;),baz       CDCDC       CDC
    
  • Analysis of field types (not implemented yet): the correct delimiter will produce more fields with sensible, recognizable types, e.g. dates, integers, UUIDs etc., resulting in a higher ratio of recognized types to total fields. This is particularly useful when two different delimiters both produce reasonable row patterns, like in the example below.
                         ","      ";"
    1,funny ;),bar       1 0 0    0 0
    2,funny ;),baz       1 0 0    0 0
                         = 2/6    = 0/4
    

I think the analysis of field types could also be useful for header detection (the header row, if present, should contain only string fields while other rows could contain other types of data).

A few questions that came up:

  • Should we support detecting a limited set of field and row delimiters? This would obviously make the task much easier.
  • How much of the CSV file should we look at to make the guess (I currently hard-coded 50 Unicode scalars)? We could also dynamically read more and more until a good guess can be made. Also, when the user specifies a row delimiter we could say we read a certain number of rows.

@dehesa
Copy link
Owner

dehesa commented Mar 4, 2022

Hey @PoshAlpaca,

Great stuff! I am enjoying reading the paper and skimming through the CleverCSV library. Thank you!

For what I can gather, you will focus on inferring the field delimiters, but not the row delimiters (or both), right? Given those constraints, I quite like your dialect detection and pattern score calculations (although I should check it with more detail). Here go some thoughts:

  • Row pattern analysis seems great and already worth to have it by itself.
  • Field type analysis would be awesome to have, but it can be an iterative addition to further refine/validate the results of the row pattern analysis. It seems hard to get it right.
  • We can start supporting a limited set of fields delimiters and later on add full inferral. Maybe we can rework the Delimiter.Field configuration type to, instead of holding an array of scalars, contain an enum specifying whether the delimiter is given or it is supposed to be inferred. If it is supposed to be inferred, we can give a further option to choose on a user selected possible delimiters, a library given common delimiters, or full-inferral.
  • 50 unicode scalars feels low. It would be better to start reading till we have some level of confidence that the inferral was successfull or we reach a maximum never of characters in which case we throw an error (maybe 1000 characters).
  • All the characters that are read during the inferral should be stored in a buffer. Some input sources are read-once sequences (such as the stdin).
  • Also keep in mind that some CSV contain multiple header files or white lines to separate headers from content.

Small code nitpicks:

  • The library uses 2 spaces indentation.
  • Please call self explicitly.

I know this are a lot of cases/situations to handle. So you can start small and grow from there. The code already looks great!

@PoshAlpaca
Copy link
Author

Hi @dehesa,

Thanks for the feedback and the thoughts!
I wanted to start with detecting field delimiters because that's what the CleverCSV library also does. However I'd definitely also want to add row delimiter detection and header detection as it all goes more or less hand-in-hand.

I've started working on field type analysis already because I was curious to try it out. I was thinking it would actually make sense to do row patterns and field types all in one pass where the CSV data is converted into an abstract representation. And then one can do different calculations and inference on that abstraction, e.g. delimiter detection, header detection etc.
The CleverCSV library does two passes, one for row patterns and one for field types, which seems inefficient.

I like the idea of reworking the API as you've suggested, having the user provide possible delimiters and otherwise working with sensible defaults.

The incremental reading definitely makes sense! I'll have a look at how this could work in a bit.
I'm storing all the read data in the CSVReader's _scalarBuffer property which is passed into CSVReader.inferDelimiters. That's how it's meant to be used, right?

For the multiple header lines and empty lines I'll think about how to best handle these. Currently, when there are multiple header lines, the user has to choose one of them using the new lineNumber header strategy, right?

@PoshAlpaca
Copy link
Author

I've added a suggestion for the API to the PR description. What do you think of it?
I'm not quite sure about the parameter name in .infer(options:). Maybe .infer(choices:) or something else could be better 🤔
Also, what should we do with the nil-literal API? We could interpret it as .infer. But maybe it should be deprecated?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants