This application processes time-series data from a CSV file, applies a trend scoring algorithm, and outputs the results to a new CSV file. The process involves normalization and smoothing of weekly interest level changes to produce trend scores ranging from 0 to 100.
The script is a modified version of the jupyter notebook code. The script performs several key functions:
-
Data Loading: It starts by loading the input CSV file specified by the user. This file should contain columns for
company_id
,topic_id
,week
, andinterest_level
(It can contain other columns too). -
Data Integrity Checks: To ensure the quality and consistency of the input data, the script uses assertions to validate:
- Each
topic_id
maps to only onecategory_id
. - Each
company_id
maps to only oneindustry_id
. - There are no missing values in critical columns such as
interest_level
andweek
. - The
interest_level
values fall within the expected range (0 to 100). These checks are crucial for preventing errors during the data processing phase and ensuring that the results are reliable.
- Each
-
Data Processing:
- Fills in missing weeks for each
company_id
,topic_id
pair with zeroes to maintain consistency in the time series. - Calculates the weekly change in interest levels.
- Applies Exponentially Weighted Moving Average (EWMA) smoothing to the weekly changes to identify trends.
- Removes the filled missing weeks values (This can be changed according to the task).
- Normalizes these trends into scores ranging from 0 to 100, based on the criteria that scores below 50 indicate a decreasing trend, and scores above 50 indicate an increasing trend.
- Fills in missing weeks for each
-
Output Generation: Finally, the script outputs the processed data with trend scores to a new CSV file at the specified output path.
While the current implementation serves the fundamental requirements of trend scoring, there are several areas where the script could be enhanced in future versions:
-
Dynamic Parameter Tuning: Introduce command-line arguments or a configuration file to allow users to adjust parameters such as the EWMA
halflife
orspan
without modifying the script. This flexibility would enable users to tailor the analysis to their specific needs more easily. -
Parallel Processing: For large datasets, the script could leverage parallel processing techniques to improve performance. This might involve using libraries such as Dask to distribute the workload across multiple cores or even multiple machines.
-
Using Different Library: Right now, the script uses Pandas for data processing, however the script can be rewritten in Polars or Numpy for higher efficiency.
-
Enhanced Logging: While print statements currently provide basic logging functionality, integrating a more sophisticated logging framework would offer greater control over logging levels and outputs. This improvement would facilitate debugging and monitoring the script's execution in production environments.
-
Feature Handling: Give user different options for handling other columns not needed for the algorithm.
These instructions will cover the usage of the Docker application including building the Docker image and running the container to process your data.
- Docker installed on your machine. Visit the official Docker documentation to get Docker for your specific operating system.
-
Clone or download this repository to your local machine.
-
Navigate to the directory containing the Dockerfile and your scripts.
-
Build the Docker image with the following command, replacing
trendscore-processor
with your preferred image name:docker build -t trendscore-processor .
This command reads the Dockerfile in the current directory and builds an image named trendscore-processor
.
To process your CSV file, run the Docker container with the paths to your input and output files as arguments. Ensure these paths are accessible to the Docker container by mounting the directories containing your files.
To process folder_one/input.csv
and save the output to folder_two/output.csv
, use the following command format:
docker run -v /absolute/path/to/folder_one:/data -v /absolute/path/to/folder_two:/output trendscore-processor /data/input.csv /output/output.csv
To process multiple input files, for example: folder_one/input.csv
and folder_two/input.csv
using the same image, we need to run multiple docker run commands. Make sure that the output paths of the commands are different, otherwise the data may be overwritten.
If you encounter any issues while running the Docker container, here are a few things you can try:
-
Check the Docker logs for any error messages. You can do this by running the command
docker logs <container_id>
, replacing<container_id>
with the ID of your Docker container. -
Ensure that the paths to your input and output files are correct and that they are accessible to the Docker container.
-
Make sure that your Docker image was built successfully. If not, you may need to rebuild it.