A Python utility for loading, analyzing, and grouping CSV files based on common columns. This tool is particularly useful when you need to process multiple CSV files and group their data based on a specific column while maintaining the original column structure.
- Load CSV files from a directory or specific file paths
- Group data by any common column across CSV files
- Preserve original column names without modification
- Track source files in the output
- Handle both matched and unmatched files separately
- Export results to organized CSV files
- Python 3.6+
- pandas
- pathlib
- Clone this repository or copy the
csv_analyzer.py
file to your project - Install required dependencies:
pip install pandas
from csv_analyzer import CSVAnalyzerGrouping
# Initialize the analyzer
analyzer = CSVAnalyzerGrouping()
# Load CSV files from a directory
analyzer.load_from_directory("path/to/your/csvs")
# Or load specific CSV files
analyzer.load_from_files(["file1.csv", "file2.csv"])
# Group data by a specific column
result = analyzer.grouped_data_by_column("category")
# Export the results
output_dir = ".tmp"
analyzer.export_matched_data(output_dir, result, "grouped_by_category")
analyzer.export_unmatched_data(output_dir, result)
The tool will create:
- A combined CSV file containing all grouped data with original columns plus a source_file column
- The source_file column will always be the last column in the output
- Original column names are preserved without any aggregation suffixes
For input CSV files containing columns: name,category,link,tag,label,id,x_path
The output will maintain the same structure with source_file added as the last column:
name,category,link,tag,label,id,x_path,source_file
Loads all CSV files from the specified directory.
Loads specific CSV files from the provided file paths.
Groups data by the specified column for files that contain it.
Exports matched (grouped) data to a single combined CSV file.
Exports unmatched data to separate CSV files.
The tool includes comprehensive error handling for:
- Invalid directory paths
- File reading errors
- Grouping operation failures
- Export errors
Each operation provides clear feedback through console messages.
Feel free to submit issues, fork the repository, and create pull requests for any improvements.
This project is licensed under the MIT License - see the LICENSE file for details.