Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: describe all data collection #3216

Merged
merged 6 commits into from
Jul 25, 2024
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/explanations.rst
Original file line number Diff line number Diff line change
Expand Up @@ -73,4 +73,4 @@ Other Pro features explained
explanations/what_is_the_daemon.md
explanations/errors_explained.md
explanations/deprecation_policy.rst

explanations/data_collection.rst
89 changes: 89 additions & 0 deletions docs/explanations/data_collection.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
What data does Canonical collect from Ubuntu Pro machines?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably suggest changing this header to "what data does Canonical collect through the Ubuntu Pro Client"

The reason being that the contents of the page are relevant to people who haven't attached yet (and want the info to know if they want to attach), or who have detached and aren't using Pro anymore (can we consider their machine an Ubuntu Pro machine?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two of the sections are data not collected directly through the Ubuntu Pro Client, but the collection happens because you used the pro-client to attach to Pro. So I'm not sure "collect through the Ubuntu Pro Client" is quite accurate.

A detached machine shouldn't be considered an Ubuntu Pro machine - they are in the same bucket as never attached machines for the purpose of this doc.
None of this data is collected for unattached machines. And I don't think the current title would prevent someone wondering about data collection from looking at it. We could rename it to "What data does Canonical collect from Ubuntu machines that are attached to an Ubuntu Pro subscription" but that felt unnecessarily verbose

IDK though, I don't have a strong opinion on the title here, just wanted to list my thoughts before changing it. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that depends on how long we keep data for. Is machine data purged when the machine is detached?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Data is not purged on detach - so a detached machine would have had data collected while it was Ubuntu Pro, and after it is detached, no more data will be collected, but data will exist on the backend for some time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If data isn't purged on detach, then I think there is a difference between detached and never-attached machines since users who care about this stuff want to know what we collect, why, and how long we expect to keep that data for.

I think tweaking the title to say "What data is collected from active Ubuntu Pro machines?" would be enough to satisfy the distinction, especially if we can also provide info on how long it takes before collected info is purged (although I wouldn't consider that a blocker).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that makes sense!

**********************************************************

Some system data is sent to Canonical servers for the purpose of delivering
Ubuntu Pro services in compliance with the terms of the Ubuntu Pro subscriptio
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/subscriptio/subscription

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whoops 🙃

. This data is sent via a few different methods, depending on the service and
the purpose of that particular data element.

This document categorises data collection by method of collection.

APT package downloads
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can be wrong here, but it seems that the APT packages downloads and Livepatch downloads are not directly tied to data collection per se. It seems more like data used per service than a collection of some sorts.

I think there is still value for those sections, but I would not put them under data collection. We could create something like service data needs for them

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's a good idea :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's a good point, this is data that is sent to canonical servers for the purposes of using the services. I'll rework the structure of this a bit to make that more clear.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@s-makin I've restructured the document and headers around this. All the content is the same, but I'm not sure about the new header "Data sent in order to provide service" - do you have any better ideas?

=====================

If you have any of the following services enabled, then the data collection
method described below will be in use whenever downloading packages for one of
these services.
orndorffgrant marked this conversation as resolved.
Show resolved Hide resolved

- ``anbox-cloud``
- ``cc-eal``
- ``cis``
- ``esm-apps``
- ``esm-infra``
- ``fips``
- ``fips-preview``
- ``fips-updates``
- ``realtime-kernel``
- ``ros``
- ``ros-updates``
- ``usg``

Whenever you ``apt install`` a package from a Pro service (or ``apt upgrade``
to a version of a package from a Pro service), ``apt`` will make a GET request
to ``esm.ubuntu.com`` that includes the package name and version, and HTTP
basic auth credentials that are tied to the Ubuntu Pro subscription.

For example, installing the ``hello`` package from ``esm-apps`` will result in
a request that looks something like this:

.. code-block:: text

https://bearer:[email protected]/apps/ubuntu/pool/main/h/hello/hello_2.10-2ubuntu4+esm1_amd64.deb

This request is necessary to download the Pro update and includes the
following data.

- Ubuntu codename (e.g. "Jammy")
- Package name (e.g. "hello")
- Package version (e.g. "2.10-2ubuntu4+esm1")
- Package architecture (e.g. "amd64")

Because this request needs to be authenticated and the authentication token is
tied to a particular Ubuntu Pro subscription, this data is inherently tied to
the Ubuntu Pro subscription that authenticated access to the package.

Livepatch downloads
===================

If you have ``livepatch`` enabled, then the following data is sent in order to
download the correct kernel patches:

- Kernel version (e.g. "6.8.0-38.38-generic")
- Machine architecture (e.g. "amd64")

Similarly to APT package downloads, because this request needs to be
authenticated and the authentication token is tied to a particular Ubuntu Pro
subscription, this data is inherently tied to the Ubuntu Pro subscription that
authenticated access to the package.


Machine activity checks
=======================

Regardless of which services you have enabled, if a machine is attached to an
s-makin marked this conversation as resolved.
Show resolved Hide resolved
Ubuntu Pro subscription, the following data is collected and updated regularly
(default: every 6 hours).

- Distribution (e.g. "Ubuntu")
- Release codename (e.g. "Noble")
- Kernel version (e.g. "6.8.0-38.38-generic")
- Machine architecture (e.g. "amd64")
- Is the machine a desktop? (e.g. "true")
- Virtualisation type (e.g. "Docker")
- Services enabled (e.g. "ros" and "realtime-kernel generic variant")
- When the machine was attached (e.g. "2024-07-24T13:54:07+00:00")
- Version of ``ubuntu-pro-client`` (e.g. "33.2~24.04")

These data elements are collected to ensure machines that are attached to a
particular Ubuntu Pro contract are compliant with the terms of that particular
contract.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could any of this info/data be considered as personally identifiable?

Do we know roughly how long is data kept for?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think any of this counts as personally identifiable on it's own, but it is connected to an Ubuntu Pro account on the backend via a machine id.

Do we know roughly how long is data kept for?

I don't know the answer to this one. Tagging @pandrey2003 and @alnvdl-work

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I won't consider this as a blocker to getting this merged. We can always add a section later at the bottom here about data retention.

Loading