The current environment file is incomplete as it does not install the tensorflow-text
package required for model training (which should be installed through .[training]).
To run tests locally then you must install tensorflow-text
manually:
pip install tensorflow-text==2.9.0
as version 2.5.0
does not install properly due to python version dependencies
If you observe the following error when running pytest locally then you must update python to >=3.8 for the environment:
E ValueError: Cell is empty
../../../opt/miniconda3/envs/categories_model_test7/lib/python3.7/site-packages/dill/_dill.py:1226: ValueError
Updating the package versions for these two packages in environment.yaml
and training_requirements.txt
does install the environment successfully and all tests also pass.
The categories models use information about a transaction (description, amount, transaction_type (debit/credit), and internal_transaction) to map a category to a transaction.
- Categories models
- 1. Categories models overview
- 2. Preprocessing
- 3. Training algorithm
- 4. Inference (prediction)
- 5. Development guide
Categories Model consists of two models aka use-cases:
- Categories Model: This is the retail categorisation model with consumer (retail) categories
- SME Categories Model: This is the SME categorisation model with Small Medium Enterprise oriented categories The aim of this categorisation is to get a clear and comprehensive view on the main types of expenses and income. We're not trying to set up an exact match to (legal) accounting categories as they differ throughout the different countries where YTS is active.
- Housing: Spending towards housing interior, kitchen, and garden.
- Personal care: Spending towards health (such as pharmacies, drug stores, optician, and dentist) and beauty (such as beauty stores and hairdresser).
- Groceries: Spending towards daily groceries.
- Eating Out: Spending towards restaurants.
- Shopping: Spending towards clothing stores and online marketplaces.
- Travel: Spending towards flights, long distance trains, and hotels.
- Transport: Spending towards public transport, taxi services, rental cars, and parking.
- Bills: Spending towards insurances, subscriptions, interest payments, account fees, and transfer fees.
- Transfers: Money transferred to unknown or private parties and international money transfers.
- Cash: ATM Cash Withdrawals.
- Leisure: Spending towards gaming, streaming services, betting, museums, concert tickets, and sports.
- Internal: Money transferred to an account owned by the sender.
- Income: Money received in the form of salary, refunds, or interest earned.
- Charity: Money transferred to charity organizations, political parties, and religious institutes.
- Coffee: Spending towards coffee shops.
- Drinks: Spending towards bars, pubs, and clubs.
- Education: Spending towards tuition fees, personal development, and book shops.
- Expenses: Business expenses.
- Investments: Payments to and earnings from investments.
- Lunch: Spending at lunch rooms, delis, and cafeterias.
- Gifts: Spending towards flower shops, postcard shops, chocolateries, and gift shops.
- Kids: Spending towards toy stores, pocket money, child care, and schools.
- Takeaway: Spending towards fast food restaurants and food delivery services.
- Petrol: Spending towards gas stations and electric car charging stations.
- Rent: Money paid to land lords and housing associations.
- Utilities: Spending towards energy providers, mobile network operators, and tv providers.
- Vehicle: Spending towards car insurers, car financiers, and car repair companies.
- Pets: Spending towards pet insurers, veterinarians, and pet stores.
- Savings: Money transferred towards saving accounts.
- Mortgage: Payment to a lender who financed real estate.
- General: Transactions that do not fit in any other category.
Cashflow direction level | Category group level | CFA aggregation level | Category in Dutch | SME Category Name | Description |
---|---|---|---|---|---|
Expenses | Financing Expenses | Financing Expenses | Rente en aflossingen | Interest and Repayments | Interest & paid on overdrafts, loans and hire purchase in relation to your business, as well as leasing payments |
Investeringen | Investments | An investment is an asset or item acquired with the goal of generating income or appreciation. | |||
Operating Activities | Other Operating Costs | Eten en drinken | Food and drinks | Expenses related to meals and drinks during eg office lunches/drinks, business travels or meetings. | |
Autokosten | Vehicles and Driving Expenses | Expenses for the use of vehicles for business purposes. Includes business owners driving and employee driving. | |||
Huisvestingskosten | Rent and Facilities | Renting, leasing or purchasing a space to conduct your business is an ongoing expense that you pay monthly, quarterly or annually, depending on the arrangements you make. | |||
Reiskosten | Travel Expenses | Costs associated with traveling for the purpose of conducting business-related activities. | |||
Verkoopkosten | Marketing and Promotion | Marketing, advertisement, or sales of a product or brand. | |||
Overige Operationele Kosten | Other Operating Costs | Operating costs that do not fall in any off the specified categories in the Operating Activities category group (column 2 in this table). | |||
Utilities | Nutsvoorzieningen | Utilities | Water, electricity, internet, telephony, heat etc. For business purposes. | ||
Collection Costs | Incassokosten | Collection Costs | Costs being charged through a collection agency: Lijst met incassobureaus of deurwaarders | ||
Salaries and Labour Costs | Pensioenbijdrage werkgever | Pension Payments | Employers contribution to pension payments | ||
Personeelskosten | Salaries | All compensation costs incurred by an employer in connection with the employment by such employer of applicable personnel, including all payroll and benefits excluding pension payments. | |||
Corporate Savings Deposit | Corporate Savings Deposit | Storting Bedrijfsgerelateerde Spaargelden | Corporate Savings Deposit | Deposit funds in corporate savings account. | |
Equity Withdrawal | Equity Withdrawal | Onttrekking Vermogen | Equity Withdrawal | Withdrawal if funds from the business. This includes private savings deposits, dividend payment and other owners draws. | |
Tax Expenses | Sales Tax | Omzetbelasting | Sales Tax | Tax paid to a governing body for the sales of certain goods and services. | |
Payroll Tax | Loonbelasting | Payroll Tax | Tax that has been withheld from wage and paid to the Tax and Customs Administration. | ||
Corporate Income Tax | Vennootschapsbelasting | Corporate Income Tax | Public and private companies pay corporate income tax on their profits. | ||
Unspecified Tax | Overige belastingen | Unspecified Tax | Tax expenses that do not fall in any other tax category or that could not be identified in any other tax category. | ||
Uncategorised Cost | Other Expenses | Overige uitgaven | Other Expenses | All expenses that do not fall into any other category. | |
Income | Tax Income | Tax Returns | Belasting Ontvangsten | Tax Returns | Funds returned by tax authorities |
Financing Income | Equity Financing | Kapitaal | Equity Financing | Transfer personal funds to a business bank account. | |
Loans | Leningen | Loans | Funds obtained by loans. | ||
Operational Income | Revenue | Omzet | Revenue | The total amount of income generated by the sale of goods or services related to the company's primary operations. | |
Uncategorised Income | Other Income | Overige Ontvangsten | Other Income | All income that does not fall in another category. |
- Preprocessing: Aggregate and convert data sources to training data
- Training: Train StarSpace model core
- Inference (Predict): Predict Python module that includes additional business logic
The picture below represents the solution architecture of the original categories model. The current situation is still valid up to the following differences:
- We do not use sagemaker-baseimages anymore, but use the docker images from datascience-model-commons
- Model artifacts consist of tar files that include
- Tensorflow saved model
- Training metadata
- Predict (inference) code
- The code in the model artifact can be loaded as Python model in categories-serving.
![]() |
---|
Figure 1: solution architecture |
To prepare training data in general five data sources are aggregated (see #2.1):
- users (
country_code
), accounts (account_type
) and transactions tables are joined into one view - test_users (
user_id
) are discarded for non NL users. All NL users were test users within Yolt, and as the app was never publically available in the Netherlands all NL users are test users and all test users are NL users. - feedback (
category
) is joined into the view to have a training target. This comes both from synthetic (generated feedback) and feedback (categories) from users/other parties.
From this dataset a sample is taken that will be split in a train, test and validation dataset.
Both SME and retail categorisation models use the following datasets and data fields:
- YoltApp:
- users:
s3a://yolt-dp-prd-data/cassandra/full_dump/users/user
- accounts:
s3a://yolt-dp-prd-data/cassandra/full_dump/accounts/account
- transactions:
s3a://yolt-dp-prd-data/cassandra/full_dump/datascience/transactions
- test_users:
s3a://yolt-dp-prd-data/cassandra/views/experimental_users
- feedback:
- user_multiple_feedback_applied:
s3a://yolt-dp-prd-data/kafka/views/datascienceEvents__flat_hashed/source__source=categories/type__name=UserMultipleFeedbackApplied
- user_multiple_feedback_created:
s3a://yolt-dp-prd-data/kafka/views/datascienceEvents__flat_hashed/source__source=categories/type__name=UserMultipleFeedbackCreated
- user_single_feedback_created:
s3a://yolt-dp-prd-data/kafka/views/datascienceEvents__flat_hashed/source__source=categories/type__name=UserSingleFeedbackCreated
- user_multiple_feedback_applied:
- historical_feedback:
s3a://yolt-dp-prd-data/static/categories_model/historical_feedback/*
- users:
- YTS:
- users:
s3a://yolt-dp-prd-data-yts/cassandra/full_dump/ycs_users/user
- accounts:
s3a://yolt-dp-prd-data-yts/cassandra/full_dump/ycs_accounts/account
- transactions:
s3a://yolt-dp-prd-data-yts/cassandra/full_dump/ycs_datascience/transactions
- users:
- users (YoltApp + YTS):
COLUMN | VALUE | DESCRIPTION |
---|---|---|
id |
String | Unique Yolt user ID |
country_code |
String | Country |
client_id |
String | ClientID, identifier of client that users belong to e.g. YoltApp |
- accounts (YoltApp + YTS):
COLUMN | VALUE | DESCRIPTION |
---|---|---|
id |
String | Account identifier; join key with users table |
deleted |
Boolean | Indicator whether an account is deleted |
user_id |
String | Unique Yolt user ID |
- test_users (YoltApp):
COLUMN | VALUE | DESCRIPTION |
---|---|---|
user_id |
String | Unique Yolt user ID that belongs to Yolt test users |
- transactions (YoltApp + YTS):
COLUMN | VALUE | DESCRIPTION |
---|---|---|
user_id |
String | Unique Yolt user ID |
account_id |
String | Unique Yolt account ID |
transaction_id |
String | Unique Yolt transaction ID |
pending |
Integer | Referring to "status", 1=REGULAR, 2=PENDING (see PENDING or REGULAR) |
date |
String | Date of the transaction |
description |
String | Transaction description |
bank_counterparty_name |
String | Description part that contains counterparty (this is not the Merchant model output!) |
transaction_type |
String | Transaction type: debit or credit |
internal_transaction |
String | ID for internal transactions |
- historical_feedback (YoltApp):
COLUMN | VALUE | DESCRIPTION |
---|---|---|
user_id |
String | Unique Yolt user ID |
account_id |
String | Unique Yolt account ID |
transaction_id |
String | Unique Yolt transaction ID |
pending |
Integer | Referring to "status", 1=REGULAR, 2=PENDING (see PENDING or REGULAR) |
date |
String | Date of the transaction |
feedback_time |
String | Date when user feedback was given |
category |
String | Category provided by the user |
- user_feedback:
- user_multiple_feedback_applied
- user_multiple_feedback_created
- user_single_feedback_created
COLUMN | VALUE | DESCRIPTION |
---|---|---|
id.userId |
String | Unique Yolt user ID |
id.accountId |
String | Unique Yolt account ID |
id.transactionId |
String | Unique Yolt transaction ID |
id.pendingType |
Integer | Referring to "status", 1=REGULAR, 2=PENDING (see PENDING or REGULAR) |
id.localDate |
String | Date of the transaction |
time |
String | Date when user feedback was given |
fact.category |
String | Category provided by the user |
The /preprocessing
within categories_model
, is both responsible for
- Training Data Generation with rules, defined in
create_training_data.py
,domain_data.py
, andyts_rules.py
. - Preprocessing as ussual, like any other model.
For each preprocessing, the training data is regenerated.
The core is based on StarSpace. An overview of the architecture is in this Miro board and is further detailed in the original paper. An interesting blog providing a simpler overview is located here.
Starspace is a model that embeds concepts in an n-dimensional embedding space. This embedding space can then be used to store transactions and evaluate the (cosine) similarity of concepts; transaction categories in this case.
Our implementation of Starspace for transaction categorization is mostly in categories_model/training/model.py
. The internal method _build_model()
defines the model architecture, which is made up of the following components:
![]() |
---|
Figure 2 StarSpace Core |
- An input layer that takes 1- and 2-grams of the transactions as generated by a HashingVectorizer
- An Embedding layer from Keras, which is where the actual embedding happens. The input dimension of this layer is 60k, meaning that we use a vocabulary of 60k words with a maximum input length of 60. The embedding layer converts this to a dense embedding with a dimension of (32, 1).
- We apply a zero-masked average to the output of this embedding layer and concatenate it with the numeric features like transaction amount, etc.
- These concatenated outputs are fed into a fully connected layer of 32 neurons.
- We apply the custom-written method
LabelEmbeddingSimilarity()
to this. This method computes the similarity of the output to the pre-defined label embeddings by taking the dot product of the two matrices.
After the StarSpace core has trained a set of TensorFlow postprocessing layers is added to the model. The trained model is saved in TensorFlow saved-model format including the postprocessing layers. Each model (Retail/SME) is saved with the corresponding postprocessing layer.
![]() |
---|
Figure 3 Tensorflow Post-processing layers |
When the similarity of two classes as computed by LabelEmbeddingSimilarity
exceeds a pre-defined threshold that differs for retail and SME categories (see table), the corresponding category is assigned as such if there is no category assigned by the business rules in postprocessing.
These are some of the general training parameters as defined in ~/categories_model/config/
.
Parameter | Value |
---|---|
Validation set % | 10% |
Test set % | 10% |
Batch size | 512 |
Max epochs | 40 |
Learning rate | 0.01 |
We employ two different models for retail and SME categories. The differences are mostly in a few parameters and a different set of business rules. The different categories are defined in training/retail.py
and training/sme.py
along with explanation in 4.3 Postprocessing logic. The config/
-folder contains additional parameters that differ between Retail and SME, the most important of which is the threshold that determine the minimal similarity score for a category to be assigned:
Retail | SME |
---|---|
0.65 | 0.45 |
A trained model is stored as a model.tar.gz artifact on s3. This artifact includes:
- The trained StarSpace mode (TensorFlow SavedModel)
- Training metadata (
training_metadata.yaml
) file (the prediction categories can be extracted from that data) - Prediction Python code as Python model.
As the model is offered in a Python module, the model can be loaded and used in categories-model-serving (production inference) and in Jupyter Notebook (analysis).
Currently, the prediction package consists of two layers:
- The prediction code accepts a Pandas Dataframe and offers that to
the Tensorflow model
- For this purpose Pandas is converted into Tenforflow Format
- The tensorflow model includes Tensorflow Based postprocessing logic.
- The TensorFlow predictions are converted to Pandas columns (see Model Output).
categories
category_source
- Additional Pandas based business rules are applied on the TensorFlow predictions including:
- Tax business rules.
![]() |
---|
Figure 4: Inference/model prediction steps |
The model artifact offers a Python predict module that can determine categories from transaction input.
COLUMN | VALUE | DESCRIPTION |
---|---|---|
description |
String | Raw transaction description |
amount |
Float | Raw transaction amount |
internal_transaction |
String | ID for internal transactions |
transaction_type |
String | 'credit' or 'debit' |
account_type |
String | Account type of the transactions account. Values: 'CURRENT_ACCOUNT' or 'SAVINGS_ACCOUNT' |
bank_specific__paylabels |
String | Pay labels extracted from the raw bank_specific field; the extraction is done in categories POD |
bank_specific__mcc |
String | mcc extracted from the raw bank_specific field; the extraction is done in categories POD |
counter_account_type |
String | Account type of transactions counter account. Values: 'CURRENT_ACCOUNT' or 'SAVINGS_ACCOUNT' |
bank_counterparty_name |
String | Part of the transaction description that contains counterparty name (applicable to YTS data) |
bank_counterparty_iban |
String | Used in tax business rules (match Belastingdienst) in SME cagtegories model |
COLUMN | VALUE | DESCRIPTION |
---|---|---|
categories |
String | Array of category predictions of the model |
category_source |
String | Category source: 'ModelPrediction' or 'ModelFallback' |
The model outputs are post-processed with business logic. Overview (in order):
These rules are implemented in the Retail Tensorflow postprocessing rules:
- Transactions on SAVINGS account type are considered Internal
- Transactions on CURRENT_ACCOUNT with empty internal_transaction cannot be Internal
- Debit transactions cannot be Income
- If pay label is present, then category is dictated by Money League team. See: https://yolt.atlassian.net/wiki/spaces/LOV/pages/3910832/Transactions+descriptions+and+labels Note that this table is not necessarily in sync with the code in production!
- If a transaction happens on a savings account (both credit and debit), the category is Internal
- If we know the counter account is a savings account, then the category is Savings
- For a current account, if we do not know the counter account type:
- If predicted category is Savings, then category is Savings
- Else, if internal transaction ID, then category is Internal
- If transaction contains a known MCC code then use that category
- If predicted category model confidence is below threshold, then category is General
These rules are implemented in the SME Tensorflow postprocessing rules:
- Credit transactions cannot have debit categories
- Debit transaction cannor have credit categories
- if the account_type equals "CURRENT_ACCOUNT" and counter_account_type equals "SAVINGS_ACCOUNT" then category should be Savings Deposit for debit transactions
- If the account_type equals "CURRENT_ACCOUNT" and counter_account_type equals "SAVINGS_ACCOUNT" then category should be Savings Withdrawal for credit transactions
- If transaction type equals "debit" and cleaned description is empty then category should be Miscellaneous expenses
- If transaction type equals "credit" and cleaned description is empty then category should be Other Income
Currently the Python based business logic includes Tax business rules.
For the SME categories model we extract tax labels using the datascience_model_commons tax rules library. The resulting labels are mapped using tax_config.yaml:
- Debit-transaction label mapping:
Tax Label | SME Category |
---|---|
Acceptgiro | Unspecified Tax |
Dividendbelasting | Unspecified Tax |
Fijnstoftoeslag | Unspecified Tax |
Inkomsten_belasting | Unspecified Tax |
Inkomstenbelasting | Unspecified Tax |
Douanebelasting | Unspecified Tax |
Overige | Unspecified Tax |
Inkomstenbelasting_gemoedbezwaarden | Unspecified Tax |
Inkomstenbelasting_premievolksverzekering | Unspecified Tax |
Loon_belasting | Payroll Tax |
Loonbelasting | Payroll Tax |
Loonbelasting_naheffing | Payroll Tax |
Motorrijtuigenbelasting | Unspecified Tax |
Motorrijtuigenbelasting_naheffing | Unspecified Tax |
Omzetbelasting | Sales Tax |
Omzetbelasting_naheffing | Sales Tax |
Other | Unspecified Tax |
Toeslag | Unspecified Tax |
Venootschapsbelasting | Corporate Income Tax |
Voorlopige_belasting | Unspecified Tax |
Zorgverzekeringswet | Unspecified Tax |
- Credit-transaction label mapping:
Tax Label | SME Category |
---|---|
Dividendbelasting | Tax Returns |
Loon_belasting | Tax Returns |
Loonbelasting | Tax Returns |
Loonbelasting_naheffing | Tax Returns |
Motorrijtuigenbelasting | Tax Returns |
Omzetbelasting | Tax Returns |
Other | Other Income |
Toeslag | Other Income |
Venootschapsbelasting | Tax Returns |
Voorlopige_belasting | Tax Returns |
Zorgverzekeringswet | Other Income |
Douanebelasting | Tax Returns |
Overige | Other Income |
Inkomstenbelasting | Other Income |
Milieu_investeringsaftrek | Other Income |
- Debit-transaction label mapping:
Tax Label | SME Category |
---|---|
Self_assessment_tax | Unspecified Tax |
Corporate_tax | Corporate Income Tax |
VAT | Sales Tax |
Vehicle_tax | Unspecified Tax |
Council_tax | Unspecified Tax |
- Credit-transaction label mapping:
Tax Label | SME Category |
---|---|
Child_benefit | Other Income |
Child_tax_credit | Unspecified Tax |
Working_tax_credit | Other Income |
Kamaji is a module that is applied after prediction postprocessing business rules. This is applied to any transactions that have a null
category after the probabilistic model and other business rules have been applied.
![]() |
---|
Figure 5: Kamaji from Spirited Away |
Like the multi-armed Kamaji, the code implemented in arms_of_kamaji.py uses the data map defined in mapping_dictionary.py to manually map counterparties to the correct categories. The mapping dictionary contains cherry-picked counterparties (the top most important/frequent counterparties found in the transactions data) to map to categories (e.g. "Eating Out": ["McDonalds"]
).
/
: Project rootyds_categories.yaml
: Retail Categories model YDS cli configyds_sme_categories.yaml
: SME Categories model YDS cli config/categories_model
: This is the main Python package code for categories_model/config
: This contains the configuration files for/predict
: This code is deployed in model serving and contains serving code for- Retail categories model:
retail.py
- SME categories model:
sme.py
- Tax labels to categories mapping:
tax_config.yaml
- Tax labels to categories mapping:
- Retail categories model:
/preprocessing
: All code for preprocessing data into training data/training
: All code for training the tensorflow model
/dags
: The Airflow DAG definition for 'categories_all_train'/resources
: Additional resources connected to categories models and images used in this readme file/notebooks
: for static training data generation
/test
: All pytest unit test code for the categories models/categories_model
: Tests for retail categories model/sme_categories_model
: Test for sme categories model/resources
: all test data used for unit tests and for training the model on dta- Any changes in this data should be synchronized to
s3://yolt-dp-dta-datascience-categories-model/input/
andyolt-dp-dta-datascience-yoltapp-yts-sme-categories-model/input/
:
(!) Please note that currently both retail and sme categories model use the same test data!aws s3 sync /test/resources s3://yolt-dp-dta-datascience-categories-model/input/ aws s3 sync /test/resources s3://yolt-dp-dta-datascience-yoltapp-yts-sme-categories-model/input/
- Any changes in this data should be synchronized to
This project runs in a conda environment. To start developing on your local machine you need to:
- clone this repo to your local computer
- create a conda environment by running
conda env create -f environment.yaml
conda activate categories_model
You should now be able to run pytest:
pytest
To test and run the model on aws please read the yds client documentation on
- Retail categories model uses yds_categories.yaml as configuration file
- SME categories model uses yds_sme_categories.yaml as configuration file
- First prepare/update your conda environment:
- Create (for the first time):
conda env create -f environment.yaml
- Update:
conda env update -f environment.yaml
- For Mac OS X: Prepare Spark & Java
- For testing run
pytest
in your development ide
To test on airflow-dta simply create a branch with a green pipeline. After this pipeline has completed airflow-dta will pick up and deploy code from the latest successful pipeline (any branch).
- Create a git branch and push to gitlab.
- Wait for the pipeline to become green.
- Wait for approx 1 minute to allow airflow-dta to pickup the latest code
- Trigger the model training DAG on Airflow.
Use the yds-cli from your local terminal for aws testing
- Retail categories:
- First run preprocessing:
yds -f -c dta yds_categories_model.yaml run -s preprocessing
this will launch the preprocessing job on aws and will return a six character deploy id - Next, after preprocessing completed run training:
yds -f -c dta yds_categories_model.yaml run -s training -did <deploy id from preprocessing job>
- First run preprocessing:
model_tag
- branch name from which the code is deployeddeploy_id
- pipeline id from which the code is deployedgit_user
- git username of user executing (ggitlab if executed by gitlab+airflow)
├── code
│ └─── <model_tag>
│ └─── <git_user>
│ └─── <deploy_id>
│ ├─── preprocessing
│ │ └─── code files
│ └─── training
│ └─── code files
└── run
├─── preprocessing
│ └─── <model_tag>
│ └─── <git_user>
│ └─── <deploy_id>
│ └─── output files
└─── training
└─── <model_tag>
└─── <git_user>
└─── <deploy_id>
└─── output files
Note that when training a model, we output:
model.tar.gz
- model artifacts together with the metadata:- tensorflow savedmodel
- predict python package code
- metadata yaml file