-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add field delimiter detection #44
base: develop
Are you sure you want to change the base?
Conversation
Hi @dehesa, I finally got around to opening the PR. This is just a draft at this point but I wanted to hear your thoughts before getting too carried away ;) The detection method comes from CleverCSV and is also described in this paper. It consists of two parts:
I think the analysis of field types could also be useful for header detection (the header row, if present, should contain only string fields while other rows could contain other types of data). A few questions that came up:
|
Hey @PoshAlpaca, Great stuff! I am enjoying reading the paper and skimming through the CleverCSV library. Thank you! For what I can gather, you will focus on inferring the field delimiters, but not the row delimiters (or both), right? Given those constraints, I quite like your dialect detection and pattern score calculations (although I should check it with more detail). Here go some thoughts:
Small code nitpicks:
I know this are a lot of cases/situations to handle. So you can start small and grow from there. The code already looks great! |
Hi @dehesa, Thanks for the feedback and the thoughts! I've started working on field type analysis already because I was curious to try it out. I was thinking it would actually make sense to do row patterns and field types all in one pass where the CSV data is converted into an abstract representation. And then one can do different calculations and inference on that abstraction, e.g. delimiter detection, header detection etc. I like the idea of reworking the API as you've suggested, having the user provide possible delimiters and otherwise working with sensible defaults. The incremental reading definitely makes sense! I'll have a look at how this could work in a bit. For the multiple header lines and empty lines I'll think about how to best handle these. Currently, when there are multiple header lines, the user has to choose one of them using the new |
I've added a suggestion for the API to the PR description. What do you think of it? |
Description
Adds the ability to automatically infer the CSV file's field delimiter. To use field delimiter detection, specify
nil
for the field delimiter when configuringCSVReader
:Checklist
The following list must only be fulfilled by code-changing PRs. If you are making changes on the documentation, ignore these.
develop
.Progress
Delimiter.Field
to more explicitly support inference and allow for a custom list of possible delimiters.One option could look like this: