Skip to content

Commit

Permalink
(quickstart) imporoved the tutorial introduction with general audienc…
Browse files Browse the repository at this point in the history
…e analogy to spicing food, which should clarify concepts of preventing data and removal of elements; also added analogies throughout the tutorial; per

#2 (comment)
  • Loading branch information
amkrajewski committed Aug 29, 2024
1 parent 7511dd7 commit 3b607c1
Showing 1 changed file with 17 additions and 6 deletions.
23 changes: 17 additions & 6 deletions examples/quickstart.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,11 @@
"source": [
"## Quick Start Guide\n",
"\n",
"The purpose of this guide is to demonstrate some common use cases of `nimCSO` and go in a bit more into the details of how it could be used, but it is not by any means extensive. If something is not covered but you would like to see it here, please do not heasitate to open an issue on GitHub and let use know!"
"`nimCSO` is a high-performance tool implementing several methods for selecting components (data dimensions) in compositional datasets, which optimize the data availability and density for applications such as machine learning. Making said choice is a combinatorically hard problem for complex compositions existing in high-dimensional spaces due to the interdependency of components being present. \n",
"\n",
"It is an interdisciplinary tool applicable to any field where data is composed of a large number of independent components and their interaction is of interest in a modeling effort, including some everyday contexts. For instance, *\"Given 100 spices at the supermarket, which 20, 30, or 40 should I stock in my pantry to maximize the number of unique dishes I can spice according to recipe?\"*. Critically, this is not as simple as frequency-based selection because, e.g., *removing* less common nutmeg and cinnamon from your shopping list will *prevent* many recipes with the frequent vanilla, but won't affect those using black pepper.\n",
"\n",
"The purpose of this guide is to demonstrate some common use cases of `nimCSO` and go in a bit more into the details of how it could be used, but it is not by any means extensive. If something is not covered but you would like to see it here, please do not heasitate to open an issue on GitHub and let us know!"
]
},
{
Expand All @@ -18,7 +22,7 @@
"\n",
"**1.** Install nim and dependencies, but **that's already done for you if you are in the Codespace!**. You can see what was run to get the environment set up in the [`Dockerfile`](../.devcontainer/Dockerfile).\n",
"\n",
"**2.** Create the dataset. For now, let's just use the default one (based on ULTERA Database) that comes with the package. Relative to this notebook, the dataset is located at `../dataList.txt`. Let's have a look at the first few lines of the file to see what it looks like."
"**2.** Create the dataset. For now, let's just use the example one (based on the [ULTERA Database](https://ultera.org)) that comes with the package. Relative to this notebook, the dataset is located at `../dataList.txt`. Let's have a look at the first few lines of the file to see what it looks like. For details, please consult the main `README` or online documentation."
]
},
{
Expand Down Expand Up @@ -187,7 +191,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"You should have seen a neat `help` message that tells you how to use `nimCSO`. Let's start with a \"coverage\" benchmark to see how fast can we check how many datapoints will be removed from the dataset if we remove the first 5 elements of `elementOrder`."
"You should have seen a neat concise `help` message that tells you how to use `nimCSO`. Let's start with a \"coverage\" benchmark to see how fast can we check how many datapoints will be prevented (will have to be removed from the original dataset) if we remove the first 5 elements of `elementOrder`. By our earlier analogy to selecting spices in the supermarket, this corresponds to checking how many recipies you won't be able to properly spice if you decide not to buy the first 5 spices on your shopping list."
]
},
{
Expand Down Expand Up @@ -346,6 +350,8 @@
"source": [
"Let's look at the result! As expected, `N`, `Y`, `C`, and `Re` are removed first (0-4) and then the trend follows for a bit to `Hf` **The first break is `V`, you can notice that it's better to remove either or both `Ta` or `Zr` first, despite the fact that they are nearly 50% more common than `V`!** That's because they often coocur with `Re` and `Hf`, which are not common.\n",
"\n",
"By our earlier analogy to selecting spices in the supermarket, we already removed nutmeg and cinnamon from our shopping list so if we want to maximize the number of recipies we can fulfill, it's better to buy black pepper than more common vanilla.\n",
"\n",
"We can test exactly how much more data we will have if we remove `Ta` insead of `V` by using the `--singleSolution` / `-ss` routine."
]
},
Expand Down Expand Up @@ -487,9 +493,7 @@
"source": [
"### Genetic Search\n",
"\n",
"For cases where the dimensionality of the problem is too high to either brute-force or use the algorithm search, we can still use the `--geneticSearch`/`-gs` routine to find the solution in a reasonable time. Let's try it now!\n",
"\n",
"Please note that the results are stochastic, so you might get different results than ones shown below if you run the command again."
"For cases where the dimensionality of the problem is too high to either brute-force or use the algorithm search, we can still use the `--geneticSearch`/`-gs` routine to find the solution in a reasonable time. Let's try it now!"
]
},
{
Expand Down Expand Up @@ -535,6 +539,13 @@
"!./nimcso -gs"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Please note that the results are stochastic, so you might get different results than ones shown below if you run the command again."
]
},
{
"cell_type": "markdown",
"metadata": {},
Expand Down

0 comments on commit 3b607c1

Please sign in to comment.