-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
docs: Simplification of the DFtoVW tutorial (#4693)
* first version of the simplified tutorial * fix typo + rm dedicated section for df creation * rename title * use black linting * use default kernel --------- Co-authored-by: Griffin Bassman <[email protected]>
- Loading branch information
1 parent
5c77d72
commit f204897
Showing
1 changed file
with
307 additions
and
0 deletions.
There are no files selected for viewing
307 changes: 307 additions & 0 deletions
307
python/docs/source/tutorials/python_simplified_dftovw_tuto.ipynb
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,307 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"id": "51f41eaf-f24f-44fc-8178-3270efa46ec4", | ||
"metadata": {}, | ||
"source": [ | ||
"# Simple pandas to vowpalwabbit conversion tutorial" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 1, | ||
"id": "b9a21a43-39ad-4213-9c7f-814bbafd8a54", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"import pandas as pd\n", | ||
"from vowpalwabbit.dftovw import DFtoVW\n", | ||
"from vowpalwabbit import Workspace" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "fc831353-b5aa-4bb0-a928-c47b340397a5", | ||
"metadata": {}, | ||
"source": [ | ||
"### Building simple examples using `DftoVW.from_column_names`" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "c60089f1-ce41-49ee-a3a9-74f0fb2cb34f", | ||
"metadata": {}, | ||
"source": [ | ||
"Let's create the following pandas dataframe:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 2, | ||
"id": "a31118c2-b315-4129-b28a-2ea37d2dae50", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"df = pd.DataFrame(\n", | ||
" [\n", | ||
" {\n", | ||
" \"income\": 0,\n", | ||
" \"age\": 27,\n", | ||
" \"marital-status\": \"Separated\",\n", | ||
" \"education\": \"HS-grad\",\n", | ||
" \"occupation\": \"Handlers-cleaners\",\n", | ||
" \"hours-per-week\": 25,\n", | ||
" },\n", | ||
" {\n", | ||
" \"income\": 1,\n", | ||
" \"age\": 34,\n", | ||
" \"marital-status\": \"Married-civ-spouse\",\n", | ||
" \"education\": \"Bachelors\",\n", | ||
" \"occupation\": \"Prof-specialty\",\n", | ||
" \"hours-per-week\": 40,\n", | ||
" },\n", | ||
" {\n", | ||
" \"income\": 0,\n", | ||
" \"age\": 44,\n", | ||
" \"marital-status\": \"Never-married\",\n", | ||
" \"education\": \"Assoc-voc\",\n", | ||
" \"occupation\": \"Priv-house-serv\",\n", | ||
" \"hours-per-week\": 25,\n", | ||
" },\n", | ||
" {\n", | ||
" \"income\": 1,\n", | ||
" \"age\": 38,\n", | ||
" \"marital-status\": \"Married-civ-spouse\",\n", | ||
" \"education\": \"Bachelors\",\n", | ||
" \"occupation\": \"Prof-specialty\",\n", | ||
" \"hours-per-week\": 60,\n", | ||
" },\n", | ||
" {\n", | ||
" \"income\": 0,\n", | ||
" \"age\": 34,\n", | ||
" \"marital-status\": \"Married-civ-spouse\",\n", | ||
" \"education\": \"HS-grad\",\n", | ||
" \"occupation\": \"Other-service\",\n", | ||
" \"hours-per-week\": 36,\n", | ||
" },\n", | ||
" ]\n", | ||
")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "473e5c72-ab6c-4d72-a466-7352ec604393", | ||
"metadata": {}, | ||
"source": [ | ||
"The user builds the examples using the class method `DftoVW.from_column_names`. The method is called using the dataframe object (`df`) and its various column names. The conversion to vowpal wabbit examples is then performed by calling the `convert_df` method:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 3, | ||
"id": "2be83f6c-ecaa-45cb-bb3f-2f47827d6016", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"['0 | age:27 marital-status=Separated education=HS-grad occupation=Handlers-cleaners hours-per-week:25',\n", | ||
" '1 | age:34 marital-status=Married-civ-spouse education=Bachelors occupation=Prof-specialty hours-per-week:40',\n", | ||
" '0 | age:44 marital-status=Never-married education=Assoc-voc occupation=Priv-house-serv hours-per-week:25',\n", | ||
" '1 | age:38 marital-status=Married-civ-spouse education=Bachelors occupation=Prof-specialty hours-per-week:60',\n", | ||
" '0 | age:34 marital-status=Married-civ-spouse education=HS-grad occupation=Other-service hours-per-week:36']" | ||
] | ||
}, | ||
"execution_count": 3, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"converter = DFtoVW.from_column_names(\n", | ||
" df=df,\n", | ||
" y=\"income\",\n", | ||
" x=[\"age\", \"marital-status\", \"education\", \"occupation\", \"hours-per-week\"],\n", | ||
")\n", | ||
"examples = converter.convert_df()\n", | ||
"examples" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "6109f95e-cd17-485b-947d-8c2c33a5843a", | ||
"metadata": {}, | ||
"source": [ | ||
"Note that the vowpal wabbit format for categorical features is `feature_name=feature_value` whereas for numerical features the format is `feature_name:feature_value`. When using `DFtoVW` class, the appropriate format will be inferred from the dataframe columns types.\n", | ||
"\n", | ||
"We then train the model on these examples:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 4, | ||
"id": "c0269980-78b3-4123-84eb-27e0fba929b4", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"model = Workspace(P=1, enable_logging=True)\n", | ||
"\n", | ||
"for ex in examples:\n", | ||
" model.learn(ex)\n", | ||
"model.finish()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "50470ca2-f33d-495e-a3f9-46ae1a618e6d", | ||
"metadata": {}, | ||
"source": [ | ||
"### Building more complex examples" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "30a526a6-7f8f-48e4-8dca-f9058a0d87fb", | ||
"metadata": {}, | ||
"source": [ | ||
"The class method `DFtoVW.from_column_names` represents a quick and simple way to build the examples, but if the user needs more control over the way the examples are created, she or he can either use the class `Feature` or the class `Namespace` for building features, and any of the label class available (see below) based on the nature of the task. \n", | ||
"\n", | ||
"- When using `Namespace` class (see https://github.com/VowpalWabbit/vowpal_wabbit/wiki/Namespaces for the meaning) the user specifies the name of the namespace with the `name` field, and will pass one or a list of `Feature` object to the `features` field.\n", | ||
"\n", | ||
"- The `Feature` class has a `value` field, which is the name of the column. The user can also rename the feature using the `rename_feature` field or choose to enforce a specific type (`\"numerical\"` or `\"categorical\"`) using `as_type` field.\n", | ||
"\n", | ||
"Regarding the labels, multiple classes are available:\n", | ||
"- `SimpleLabel` for regression\n", | ||
"- `MulticlassLabel` and `Multilabel` for classification\n", | ||
"- `ContextualbanditLabel`.\n", | ||
"\n", | ||
"In the following examples we'll build 2 namespaces based on socio-demographic features and the job features." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 5, | ||
"id": "90a69d90-a0a6-42d4-8867-5d1b0e73f4ec", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"['0 |ns_sociodemo age:27 marital-status=Separated education=HS-grad |ns_job occupation=Handlers-cleaners hours-per-week:25',\n", | ||
" '1 |ns_sociodemo age:34 marital-status=Married-civ-spouse education=Bachelors |ns_job occupation=Prof-specialty hours-per-week:40',\n", | ||
" '0 |ns_sociodemo age:44 marital-status=Never-married education=Assoc-voc |ns_job occupation=Priv-house-serv hours-per-week:25',\n", | ||
" '1 |ns_sociodemo age:38 marital-status=Married-civ-spouse education=Bachelors |ns_job occupation=Prof-specialty hours-per-week:60',\n", | ||
" '0 |ns_sociodemo age:34 marital-status=Married-civ-spouse education=HS-grad |ns_job occupation=Other-service hours-per-week:36']" | ||
] | ||
}, | ||
"execution_count": 5, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"from vowpalwabbit.dftovw import SimpleLabel, Namespace, Feature\n", | ||
"\n", | ||
"ns_sociodemo = Namespace(\n", | ||
" features=[Feature(col) for col in [\"age\", \"marital-status\", \"education\"]],\n", | ||
" name=\"ns_sociodemo\",\n", | ||
")\n", | ||
"ns_job = Namespace(\n", | ||
" features=[Feature(col) for col in [\"occupation\", \"hours-per-week\"]], name=\"ns_job\"\n", | ||
")\n", | ||
"label = SimpleLabel(\"income\")\n", | ||
"\n", | ||
"converter_advanced = DFtoVW(df=df, namespaces=[ns_sociodemo, ns_job], label=label)\n", | ||
"examples_advanced = converter_advanced.convert_df()\n", | ||
"examples_advanced[:5]" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "071326d7-f969-4db1-a73e-3cee225921f4", | ||
"metadata": {}, | ||
"source": [ | ||
"We train the model by also including interactions between the variables of the 2 namespaces:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 6, | ||
"id": "f0ed661f-d9a0-4ebb-93b8-f5747347c7b4", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"model_advanced = Workspace(\n", | ||
" # arg_str=\"--interactions ns_sociodemo:ns_job\", P=1, enable_logging=True\n", | ||
" arg_str=\"--redefine a:=ns_job b:=ns_sociodemo -q ab \",\n", | ||
" P=1,\n", | ||
" enable_logging=True,\n", | ||
")\n", | ||
"\n", | ||
"for ex in examples_advanced:\n", | ||
" model_advanced.learn(ex)\n", | ||
"\n", | ||
"model_advanced.finish()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "5bb2208e-9d0e-44ef-8d91-faccedf41ac0", | ||
"metadata": {}, | ||
"source": [ | ||
"Finally, we can get the estimated weights associated to each namespace and feature:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 7, | ||
"id": "06aabeab-2365-4f86-bf60-7043b0e59190", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"[('ns_job', 'occupation', 0.0),\n", | ||
" ('ns_job', 'hours-per-week', 0.0019117757910862565),\n", | ||
" ('ns_sociodemo', 'age', 0.001858704723417759),\n", | ||
" ('ns_sociodemo', 'marital-status', 0.0),\n", | ||
" ('ns_sociodemo', 'education', 0.0)]" | ||
] | ||
}, | ||
"execution_count": 7, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"[\n", | ||
" (ns.name, feature.name, model_advanced.get_weight_from_name(feature.name, ns.name))\n", | ||
" for ns in [ns_job, ns_sociodemo]\n", | ||
" for feature in ns.features\n", | ||
"]" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3 (ipykernel)", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.10.12" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 5 | ||
} |