diff --git a/modules/distributing/distributing.md b/modules/distributing/distributing.md new file mode 100644 index 00000000..2633d02b --- /dev/null +++ b/modules/distributing/distributing.md @@ -0,0 +1,45 @@ +--- +title: Distributing Software +type: reading +order: 4 +--- + +# Distributing software (10 minutes) + +How do you make it easy for someone else to obtain a copy and get it set up on their computer so that they can use it? + +Modern software contsists of an often large collection of components (libraries, packages) that are combined together to form an application. This whole collection needs to be reproduced on the computer of the user for things to work. There are two ways of doing that: 1) combining them all together on the computer of the developer, and then wrapping everything up into a package, installer, container image, or VM image that is sent to the user, or 2) putting the components that you made yourself on the Internet (as a package), and relying on the user to download the other components (packages) and assembling it all together into a working application. + +## Monolithic applications + +Option 1) works for applications, which are more or less independent. If they're used together, then it's by saving a file from one and opening it in another application. Each application contains all the bits it needs, and is installed on the user's computer in a separate folder, away from everything else. That means that different applications don't get in each other's way, but it's also rather inefficient if many applications use the same component, because you end up with many copies of that component. + +If you do choose option 1), then you still have a choice between making a package, an installer, a container image, or a virtual machine image. A package is an archive (think a ZIP-file, which it often literally is) that contains, in this case, all the components needed by the application. Since it's just a file, a package needs to be installed by a special program called a package manager. The App Store or Play Store on your phone is such a program. + +An installer is itself a computer program, that also contains all the components needed by the application. It gets downloaded by the user, who then runs it, after which it copies all the components from within itself onto the user's computer. It can then run there just like an application installed from a package using a package manager. + +A container image is a special kind of package. It also contains all the parts needed to run a program, but it is run in a special isolated environment called a container. A normal application can access everything else on the computer, including files and parts of other applications. It's set up to use its own components of course, but it could access other things if it wanted or needed to. An application that runs in a container can't do this, it's isolated from everything else except for the operating system. This is an advantage for example if the software runs on a server that is accessible from the Internet, because it provides some security. It also makes it easy to run many copies of the software on many servers, so that you can serve many users. + +A Virtual Machine finally is even more isolated. It contains its own operating system together with the application, so that the running application cannot even access the operating system on the user's computer. This has similar advantages as a container, being more secure, but it's also slower than using containers. + +So these are the different ways option 1), distributing a monolithic application with everything included, can be implemented. As said, this reduces potential compatibility problems, but isn't very efficient because you end up with many copies of everything. + +## Separate packages + +Option 2) is more efficient than option 1), because the user can just install each component once, and then every other component that needs it can use it. There are drawbacks here as well though. First, the user needs to figure out which components are needed for a particular application, and then install them one by one. This puts them in an unpleasant place called "dependency hell". + +Dependency hell was mostly solved by the invention of package managers, which automate the process of downloading and installing the required components. Example are pip, conda, apt, and Homebrew. If each component is put into a package with some metadata that describes which other packages it needs, then the package manager can do all that automatically, at least assuming that everything is Open Source and freely available online, because it cannot go to the shop to buy a license for everything. Still, often everything is Open Source and then this saves a huge amount of work. Dependency hell is not the only problem however. + +Software is continuously developed, and that means that it changes over time. Those changes sometimes change how a component is used by other components, which then need to be updated too. So the user may end up with an older program that only works with an older version of component X, while they also want to used a different newer program that works only with a newer version of X. A good package manager will give an error message in that case, but that doesn't solve the problem. Which version do you install? + +There are again two common solutions to this, distributions and environments. A distribution, like Ubuntu, is made by a group of people who create a collection of packages that are all compatible with each other, meaning that every package in it that uses package X works with the same version of package X, namely the one that's included in the distribution. This takes a significant amount of work, but it's very nice because you only have one version of everything, and maximal space efficiency. Of course there are still updates, but they happen once every six months or several years, and then everything is updated at once. That does mean that you don't get the latest version right away, but also that things just work and don't suddenly break. (Cathedral!) + +Another way to fix the multiple options of X problem is to use environments. An environment is a separate part of the computer into which packages can be installed, in such a way that only packages within the environment are combined. So now you can install one application in one environment with one version of X, and the other application in another environment with another version of X. That costs more disk space, but it's easier to get the latest stuff, and it doesn't require all the work of constantly ensuring everything is compatible. So this makes option 2) look a bit more like option 1) again, although you can still have fewer environments than you have applications. (Bazaar!) + +## Which option to choose when + +Scientific software is often a script, which is basically the topmost component in the whole collection of components. Scripts mostly just tell other components what to do. Since the script isn't used by other components, it can be packaged as an application in either of the above-mentioned ways. Users can the install and run it to *reproduce* the results, but not easily use it in their own script or modify it to do something different but related. + +Sometimes, scientists (or Research Software Engineers!) develop components that are intended for use by others in their scripts, or even in other components. Those need to be packaged as packages for a package manager, because they need to be combined with other packages on the user's computer. (The user is a programmer, in this case!) This allows the software to be *reused* by others in their scripts. + +Finally, for others to be able to modify the software and perhaps contribute some new feature or fixes back to it, the source code of the software needs to be available through a public repository. Package managers and installers don't normally install software in a way that makes it easy to modify, as that's not what they're designed for. To be able to modify the software, you need the source code, in a version control system. So besides in a package or container repository, don't forget to make a public git repository too! \ No newline at end of file diff --git a/modules/distributing/exercise-tracking.md b/modules/distributing/exercise-tracking.md new file mode 100644 index 00000000..8e12c82e --- /dev/null +++ b/modules/distributing/exercise-tracking.md @@ -0,0 +1,19 @@ +--- +title: Dependency tracking +type: exercise +order: 3 +--- + +## Dependency tracking (10 minutes) + +A common place to specify dependencies is in a file called `requirements.txt`, `pyproject.toml` or `environment.yml`. + +Go into a source code repository of a piece of software you know and try to track down dependencies. Try to also find the soruce code of one of the dependencies and see if you can find the dependencies of this dependency. How many layers of this "dependency tree" can you follow? + +You can also use one of the following projects: + +- [ESMValTool](https://research-software-directory.org/software/esmvaltool) +- [LitStudy](https://research-software-directory.org/software/litstudy) +- [Haddock](https://research-software-directory.org/software/haddock3) +- [worcs](https://cjvanlissa.github.io/worcs/index.html) +- [democracy-topic-modelling](https://research-software-directory.org/software/democracy-topic-modelling) \ No newline at end of file diff --git a/modules/distributing/further-reading.md b/modules/distributing/further-reading.md new file mode 100644 index 00000000..5ee47319 --- /dev/null +++ b/modules/distributing/further-reading.md @@ -0,0 +1,7 @@ +--- +title: Further reading +type: reading +order: 5 +--- + +- Blogpost: [Understanding the “Why” of VM’s, Containers, & Virtual Environments](https://medium.com/kitchen-sink-data-science/software-fundamentals-for-machine-learning-series-understanding-the-why-of-vms-containers-89621cf66d23) Blogpost on the difference between \ No newline at end of file diff --git a/modules/distributing/index.md b/modules/distributing/index.md new file mode 100644 index 00000000..4165b359 --- /dev/null +++ b/modules/distributing/index.md @@ -0,0 +1,13 @@ +--- +title: Distributing Software +category: Good Practices +order: 15 +abstract: Software needs to be distributed to be used by others. What are environments, packages and containers and how do they help? +author: eScience Center +thumbnail: "thumbnail-containers.jpg" +visibility: visible +--- + + +Photo by frank mckenna on Unsplash + \ No newline at end of file diff --git a/modules/distributing/info.md b/modules/distributing/info.md new file mode 100644 index 00000000..44aac77f --- /dev/null +++ b/modules/distributing/info.md @@ -0,0 +1,10 @@ +--- +title: Learning objectives +type: info +order: 0 +--- + +Obtain the skills and knowledge necessary to address the following questions: +- What is software distribution and what aspects of it are important for research software? +- Why is it important to think about dependency management? +- What are environments, dependencies, packages and containers? \ No newline at end of file diff --git a/modules/distributing/media/distributing-software-layers.png b/modules/distributing/media/distributing-software-layers.png new file mode 100644 index 00000000..84f36c17 Binary files /dev/null and b/modules/distributing/media/distributing-software-layers.png differ diff --git a/modules/distributing/media/fire.png b/modules/distributing/media/fire.png new file mode 100644 index 00000000..80e9104a Binary files /dev/null and b/modules/distributing/media/fire.png differ diff --git a/modules/distributing/media/shopping-list.png b/modules/distributing/media/shopping-list.png new file mode 100644 index 00000000..6ec6fb66 Binary files /dev/null and b/modules/distributing/media/shopping-list.png differ diff --git a/modules/distributing/media/thumbnail-containers.jpg b/modules/distributing/media/thumbnail-containers.jpg new file mode 100644 index 00000000..db40ca01 Binary files /dev/null and b/modules/distributing/media/thumbnail-containers.jpg differ diff --git a/modules/distributing/slides-distributing.md b/modules/distributing/slides-distributing.md new file mode 100644 index 00000000..cfb3321e --- /dev/null +++ b/modules/distributing/slides-distributing.md @@ -0,0 +1,111 @@ +--- +title: Distributing Software +type: slides +order: 1 +author: Jaro Camphuijsen, Lourens Veen +--- + + + +# Distributing Software + +=== + + + +## Why distribute? + +- For your future self +- For others that might be interested +- For reproducibility +- For reusability + +note: +There are many reasons why you would want to distribute your software. + +=== + + + +## Why can't I just publish and be done? + +- A piece of software never operates in isolation. +- Depends on other software (third party packages, libraries) +- Depends on system software (operating system, drivers, firmware) +- Depends on hardware (your computer and the chips inside, display or printer) +- The world (hardware, software, people) around your software is constantly evolving + +note: +Software by nature always depends on other software and hardware. + +=== + + + +note: Sometimes you enter dependency hell + +=== + + + +## What issues may arise? + +- Many dependencies +- Long chains of dependencies +- Conflicting dependencies +- Circular dependencies +- Package manager dependencies +- Diamond dependency + +... and all of these are changing. + + +=== + + + +## What solutions exist? + +Isolation or specification + +=== + + + +## Isolation + +![Layers of isolation](media/distributing-software-layers.png) + +=== + + + +## Specification + +Let the user (or some tool) solve the probem... + +- requirements.txt +- environment.yml +- pyproject.toml +- package.json +etc... + +note: +Specify the dependencies in a file and let the user build their own environment, container or vm. + +=== + +## Considerations + +- Large amount of isolation enhances reproducibility but decreases flexibility. +- Leaving it up to the user can be done for simple scripts (most research software) + +=== + +## Rules of thumb + +- Simple scripts can use a simple dependency specification +- If other software might depend on this software, package it +- To archive a specific software version and its environment, you could use a container + +=== \ No newline at end of file