Skip to content

Latest commit

 

History

History
354 lines (245 loc) · 12.5 KB

analyst_guide.md

File metadata and controls

354 lines (245 loc) · 12.5 KB

Analyst guide

This document describes features and restrictions of pg_diffix for users with anonymized access to a database. The banking notebook provides a walkthrough with examples and explanations of various mechanisms that Diffix uses to protect personal data.

Table of Contents

Access levels

Users can have one of the following access levels to a database:

  • direct - Direct (non-anonymized) access to data. Restrictions listed in this document do not apply in direct mode.
  • anonymized_trusted - Anonymized access to data. Prevents accidental release of personal data.
  • anonymized_untrusted - Anonymized access to data. Prevents intentional release of personal data.

Use SELECT diffix.access_level() to see the current access level.

Anonymizing queries

In anonymized access level, queries targeting personal tables are restricted to a limited subset of SQL. A personal table is a relation that contains data of individuals or other protected entities. Administrators identify and mark such tables during configuration.

When selecting data from personal tables, the following SQL features are allowed:

Explicit grouping

Grouping queries have the following form:

SELECT col1, col2, ..., count(...)
FROM personal_table
GROUP BY col1, col2, ...

Zero or more table columns col1, col2, ... may be specified. Numeric and string columns may optionally be generalized.

count(...) is any of the supported aggregate variants. Any number of aggregates may be specified (including none).

Example:

SELECT city, year_of_birth, count(*), diffix.count_noise(*)
FROM customers
GROUP BY city, year_of_birth

Example:

SELECT count(*), count(DISTINCT city)
FROM customers

Implicit grouping

The GROUP BY clause may be omitted:

SELECT col1, col2, ..., colX
FROM personal_table

In this case, the results are grouped implicitly by the selected columns. Bins with insufficient individuals are filtered out, and the remaining bins are repeated for (anonymized) count(*) times.

Semantically this is equivalent to:

SELECT col1, col2, ..., colX, generate_series(1, anon_count)
FROM (
  SELECT col1, col2, ..., colX, count(*) AS anon_count
  FROM personal_table
  GROUP BY col1, col2, ..., colX
) x

Example:

SELECT city, year_of_birth
FROM customers

Input filtering

The WHERE clause may specify one or more simple equality conditions which all have to match simultaneously for a row to be included in the analysis.

A simple equality condition consists of an equality between a raw or generalized column, on the left side, and a constant value, on the right side.

Combining conditions using OR or specifying oher types of conditions is not permitted.

Example:

SELECT count(*)
FROM customers
WHERE city = 'Berlin' AND gender = 'F'

Example:

SELECT city, count(*)
FROM customers
WHERE substring(date_of_birth, 1, 4) = 2000
GROUP BY city

AID columns are not allowed in filtering conditions. Administrators can additionally restrict the usage of certain columns for data filtering by untrusted users.

Unrestricted queries

Queries against personal tables are limited to the SQL subset described in the anonymized queries section. However, queries that do not target personal tables are unrestricted, meaning any SQL can be executed.

If pg_diffix.treat_unmarked_tables_as_public is set to false, then tables not marked as personal or public are not queryable in any way. Use diffix.show_labels() to see which tables are marked as personal or public.

Post processing

Any SQL may be executed on the output of anonymizing subqueries as their output is no longer considered personal.

Example: The following query selects the average number of individuals in each city.

SELECT avg(x.num_individuals)
FROM (
  SELECT city, count(*) as num_individuals
  FROM customers
  GROUP BY city
) x

Projections of grouping expressions in SELECT and the HAVING clause of anonymizing queries are also considered post processing and are accepted.

Example: The following query post processes the grouping labels and filters the output bins.

SELECT 'City: ' || upper(city), count(*)
FROM customers
GROUP BY city
HAVING count(*) > 10

Utility statements

Other SQL statements

Most utility statements are disallowed in anonymized access level. Allowed SQL statements include:

  • EXPLAIN - explain is partially allowed. Options COSTS and ANALYZE are rejected.
  • SET - a subset of parameters can be changed by non-superusers, even in anonymized mode. Note that most pg_diffix parameters require superuser access.

Any statement that alters the database (INSERT, DELETE, UPDATE, DROP, ...) or causes any other persistent state change is forbidden.

COPY statements and other indirect methods of reading data from the database are forbidden.

diffix.show_labels()

Run SELECT * FROM diffix.show_labels() to see the complete list of labels on database objects.

diffix.access_level()

Run SELECT diffix.access_level() to see the current access level of the session. Possible values are direct, anonymized_trusted, and anonymized_untrusted.

Suppress bin

When performing grouping in queries that target personal tables, bins that pertain to too few individuals are suppressed.

Diffix combines the counts of all suppressed rows to a special suppress bin. The suppress bin is the first row of the anonymized query result, unless sorted or post processed by other means. The suppress bin may itself be suppressed. All grouping columns of the suppress bin are set to * for text-typed columns and NULL for other types.

The configuration parameter pg_diffix.compute_suppress_bin controls whether the suppress bin is computed and emitted.

The configuration parameter pg_diffix.text_label_for_suppress_bin specifies the value to use for text-typed grouping labels. Default value is *.

The aggregate diffix.is_suppress_bin(*) returns true for the suppress bin, and false for any other bin. This can be used to identify the suppress bin if NULL values are ambiguous for a grouping label.

Supported functions

Aggregates

The following versions of aggregates are supported:

  • count(*) - count all rows.
  • count(col) - counts non-null occurrences of the given column.
  • count(distinct col) - counts distinct values of the given column.
  • sum(col) - sums values in the given column.
  • avg(col) - calculates the average of the given column.
  • diffix.count_histogram(aid, bin_size=1) - computes a histogram that describes the distribution of rows among entities. See below for details.

Results of these aggregates are anonymized by applying noise as described in the specification.

Each of the count(...), sum(...), avg(...) has an accompanying aggregate, which returns the approximate magnitude of noise added during anonymization (in terms of its standard deviation). These are: diffix.count_noise(...), diffix.sum_noise(...), diffix.avg_noise(...), respectively.

diffix.count_histogram(aid, bin_size)

Returns a 2-dimensional array of shape bigint[][2], where each entry is a pair of [row_count, num_entities]. The row_count represents the number of rows contributed by num_entities distinct protected entities.

Example:

SELECT diffix.count_histogram(account)
FROM transactions;

        count_histogram
--------------------------------
 {{NULL,7},{1,15},{2,13},{4,6}}
(1 row)

The result of the above query can be interpreted as: 15 accounts have made a single transaction (1 row in result bucket), 13 accounts have made 2 transactions (2 rows), 6 accounts have made 4 transactions, and 7 accounts have made some other number of transactions (identified by the NULL count).

The reported num_entities is a noisy value, but not the row_count itself. Bins with insufficient num_entities are merged to a suppress bin of shape {NULL, num_entities} where num_entities is also noisy. The suppress bin may itself be suppressed.

The optional bin_size parameter allows generalizing the bins' row_count to minimize suppression. It acts identically to the diffix.floor_by() function.

The histogram array can be unwrapped to a set of pairs by using diffix.unnest_histogram().

Restrictions: The aid parameter must be a reference to a column tagged as an AID (identifier of a protected entity).

In untrusted mode, bin_size is restricted to a money style number: 1, 2, or 5 preceeded by or followed by zeros ⟨... 0.1, 0.2, 0.5, 1, 2, 5, 10, ...⟩.

Numeric generalization functions

diffix.floor_by(col, K)

Rounds column to width K and aligns to the lower edge of the interval. Equivalent to floor(col / K) * K.

Restrictions: In untrusted mode, K is restricted to a money style number: 1, 2, or 5 preceeded by or followed by zeros ⟨... 0.1, 0.2, 0.5, 1, 2, 5, 10, ...⟩.

diffix.round_by(col, K)

Rounds column to width K and aligns to the closest edge of the interval. Equivalent to round(col / K) * K.

Restrictions: Not allowed in untrusted mode.

diffix.ceil_by(col, K)

Rounds column to width K and aligns to the upper edge of the interval. Equivalent to ceil(col / K) * K.

Restrictions: Not allowed in untrusted mode.

width_bucket(operand, low, high, count)`

Default Postgres function that returns the bucket number of the column value in an equal-width histogram.

Restrictions: Not allowed in untrusted mode.

String generalization functions

substring(text_column, start, count)

Default Postgres function.

Restrictions: In untrusted mode, only start = 1 is allowed.

Type casts

When selecting or generalizing columns of personal tables, the following type conversions are allowed:

  • All numeric types may be converted to other numeric types. Numeric types are: smallint, integer, bigint, float, double, and numeric (decimal). When converting from a real type to an integer type, implicit rounding occurs (away from zero). Such casts count as generalizing expressions and no further generalizations are allowed.
  • Date/time types may be converted to text. Date/time types are: date, time, timestamp, timetz, and timestamptz. The DateStyle configuration parameter determines the output format of the string.

Example:

SELECT substring(date_of_birth::text, 1, 4) AS year, count(*)
FROM customers
GROUP BY 1

Utility functions

diffix.is_suppress_bin(*)

Aggregate that returns true only for the suppress bin, false otherwise.

diffix.unnest_histogram(histogram)

Unnests a 2-dimensional array into a result set of 1-dimensional arrays.

Example:

SELECT diffix.unnest_histogram(diffix.count_histogram(account)) AS bins
FROM transactions;

   bins
----------
 {NULL,7}
 {1,15}
 {2,13}
 {4,6}
(4 rows)