-
Notifications
You must be signed in to change notification settings - Fork 0
/
demoConfig.py
2023 lines (2019 loc) · 56.6 KB
/
demoConfig.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# We use python here instead of JSON so that we have the convenience of writing queries verbatim.
exampleList = [
{
"heading": "Welcome",
"description": '''
<p class="desc">
Welcome to the Diffix for PostgreSQL Training App. Use this app to understand the capabilities and limitations of pg_diffix, the PostgreSQL extension that implements Diffix Fir. Follow the prepared queries in the topics on the left, or write your own queries.
<p class="desc">
For more information, visit the Open Diffix project website at <a target=_blank href="https:/open-diffix.org">open-diffix.org</a>, or contact us at [email protected].''',
"dbname": "",
"mode": "trusted",
"diffix": {
"sql": ""
},
"native": {
"sql": ""
}
},
{
"heading": "Using the app",
"description": '''
<p class="desc">
A series of examples are listed on the left. Each example provides SQL queries for both pg_diffix and the native data. The blue SQL window below on the left is for pg_diffix, while the green one on the right is for the raw (non-anonymized) output. You may modify the queries or write new ones. Note that the SQL syntax for pg_diffix and native SQL can be different.
<p class="desc">
For users new to the system, it is useful to take the examples in the order provided.
<p class="desc">
The app displays the results of a cached query. Click "Run" to re-execute the query for both pg_diffix and native, or to execute any changes you make to the SQL.
<p class="desc">
This app has access to several different databases; <a target=_blank href="https://www.gda-score.org/resources/databases/czech-banking-data/">banking0</a>, <a target=_blank href="https://www.gda-score.org/resources/databases/usa-census-database/">census</a>, <a target=_blank href="https://www.gda-score.org/resources/databases/database-2/">scihub</a>, and <a target=_blank href="https://www.gda-score.org/resources/databases/database-1/">taxi</a>. You must select the appropriate database from the pull-down menu if you write a query.
<p class="desc">
The app indicates how many rows are in each answer, and the query execution time for each. However, the app displays only the first 100 rows of data
<p class="desc">
In addition to the query results for both pg_diffix and native queries, the app usually displays the absolute and relative error between the noisy pg_diffix and correct native answers. The error is not displayed in cases where there is no matching column value between the cloak and the native output for the displayed rows.
''',
"dbname": "",
"mode": "trusted",
"diffix": {
"sql": ""
},
"native": {
"sql": ""
}
},
{
"heading": "<u>PART 1: BASICS</u>",
"description": '''
<p class="desc">
This first set of examples introduces the basic pg_diffix features, and the concepts of data distortion (noise and suppression). This is enough to get started with pg_diffix and produce simple but useful anonymized data.
''',
"dbname": "",
"mode": "trusted",
"diffix": {
"sql": ""
},
"native": {
"sql": ""
}
},
{
"heading": "Schema exploration",
"description": '''<p class="desc">The PostgreSQL commands for listing tables and columns work in pg_diffix.''',
"dbname": "",
"mode": "trusted",
"diffix": {
"sql": ""
},
"native": {
"sql": ""
}
},
{
"heading": "Tables",
"description": '''
<p class="desc">
pg_diffix accepts common PostgreSQL methods of listing tables. Note that '\d' works in psql.
''',
"dbname": "banking0",
"mode": "trusted",
"diffix": {
"sql": '''
SELECT tablename
FROM pg_catalog.pg_tables
WHERE schemaname != 'pg_catalog' AND
schemaname != 'information_schema';
'''
},
"native": {
"sql": '''
SELECT tablename
FROM pg_catalog.pg_tables
WHERE schemaname != 'pg_catalog' AND
schemaname != 'information_schema';
'''
}
},
{
"heading": "Columns",
"description": '''
<p class="desc">
pg_diffix accepts common PostgreSQL methods of listing columns. Note that '\d table_name' works in psql.
''',
"dbname": "banking0",
"mode": "trusted",
"diffix": {
"sql": '''
SELECT column_name, data_type
FROM information_schema.columns
WHERE table_name = 'accounts'
'''
},
"native": {
"sql": '''
SELECT column_name, data_type
FROM information_schema.columns
WHERE table_name = 'accounts' '''
}
},
{
"heading": "Basic queries",
"description": '''<p class="desc">
pg_diffix allows a tiny but useful subset of SQL. pg_diffix allows analysts to build multi-column histograms of counts with generalization.
''',
"dbname": "",
"mode": "trusted",
"diffix": {
"sql": ""
},
"native": {
"sql": ""
}
},
{
"heading": "Counting distinct values",
"description": '''
<p class="desc">Count the number of bank accounts.
''',
"dbname": "banking0",
"mode": "trusted",
"diffix": {
"sql": '''
SELECT count(DISTINCT account_id)
FROM accounts'''
},
"native": {
"sql": '''
SELECT count(DISTINCT account_id)
FROM accounts'''
}
},
{
"heading": "Counting events (in time series data)",
"description": '''<p class="desc">Count the number of rides in the taxi database (one ride per row).''',
"dbname": "taxi",
"mode": "trusted",
"diffix": {
"sql": '''
SELECT count(*)
FROM jan08'''
},
"native": {
"sql": '''
SELECT count(*)
FROM jan08'''
}
},
{
"heading": "Histogram of values",
"description": '''<p class="desc">Count the number of clients in each Client District.''',
"dbname": "banking0",
"mode": "trusted",
"diffix": {
"sql": '''
SELECT acct_district_id AS district,
count(DISTINCT client_id1)
FROM accounts
GROUP BY 1
ORDER BY 1'''
},
"native": {
"sql": '''
SELECT acct_district_id AS district,
count(DISTINCT client_id1)
FROM accounts
GROUP BY 1
ORDER BY 1'''
}
},
{
"heading": "2D Histogram (heat map)",
"description": '''
<p class="desc">
Histogram of counts of individuals by number of marriages per 5-year age group.
<p class="desc">
Note the use of floor() to generalize the age column.
<p class="desc">
<font color="red">Note query takes around 1/2 minute</font>
''',
"dbname": "census0",
"mode": "trusted",
"diffix": {
"sql": '''
SELECT diffix.floor_by(age,5) AS age,
marrno AS marriages,
count(*)
FROM uidperhousehold
GROUP BY 1,2
ORDER BY 1,2
'''
},
"native": {
"sql": '''
SELECT floor(age/5)*5 as age,
marrno AS marriages,
count(*)
FROM uidperhousehold
GROUP BY 1,2
ORDER BY 1,2
'''
}
},
{
"heading": "Sums",
"description": '''
<p class="desc">
Histogram over type of operation of sum of transactions amounts.
''',
"dbname": "banking0",
"mode": "trusted",
"diffix": {
"sql": '''
SELECT operation,
sum(amount)
FROM transactions
GROUP BY 1
ORDER BY 2
'''
},
"native": {
"sql": '''
SELECT operation,
sum(amount)
FROM transactions
GROUP BY 1
ORDER BY 2
'''
}
},
{
"heading": "Averages",
"description": '''
<p class="desc">
Histogram over type of operation of average of transactions amounts.
''',
"dbname": "banking0",
"mode": "trusted",
"diffix": {
"sql": '''
SELECT operation,
avg(amount)
FROM transactions
GROUP BY 1
ORDER BY 2
'''
},
"native": {
"sql": '''
SELECT operation,
avg(amount)
FROM transactions
GROUP BY 1
ORDER BY 2
'''
}
},
{
"heading": "A common mistake",
"description": '''
<p class="desc">
SELECT * FROM table LIMIT X
''',
"dbname": "",
"mode": "trusted",
"diffix": {
"sql": ""
},
"native": {
"sql": ""
}
},
{
"skip": False,
"expectErr": True,
"heading": "SELECT * ... LIMIT X",
"description": '''
<p class="desc">
TODO: One of the first things an analyst may do when presented with a new database is:
<p class="desc">
<span style="font-family:'Courier New'">SELECT * ... LIMIT X</span>
<p class="desc">
This gives the analyst an immediate impression of what data he or she is dealing with.
<p class="desc">
pg_diffix cannot return any useful information with this query, because it filters out any information related to one or a few users. Rather than attempt to run the query (which would take a very long time), pg_diffix recognizes that the answer would contain nothing and returns an error message to that effect.
<p class="desc">
''',
"dbname": "scihub",
"mode": "trusted",
"diffix": {
"sql": '''
SELECT *
FROM sep2015
ORDER BY uid
LIMIT 10
'''
},
"native": {
"sql": '''
SELECT *
FROM sep2015
LIMIT 10
'''
}
},
{
"heading": "Distortion",
"description": '''
<p class="desc">
Diffix Fir distorts data in two ways. First, it adds noise to counts. Second, it suppresses output bins that pertain to too few protected entities (e.g. individual persons).
<p class="desc">
Read more
<a target=_blank href="
https://www.open-diffix.org/blog/diffix-elm-automates-what-statistics-offices-have-been-doing-for-decades
">here</a>.
''',
"dbname": "",
"mode": "trusted",
"diffix": {
"sql": ""
},
"native": {
"sql": ""
}
},
{
"heading": "Sticky noise",
"description": '''
<p class="desc">
pg_diffix has a unique way of adding noise which we call "sticky noise". Sticky means that the same query produces the same noise. Try re-running this query, and you will see that you get the same noisy answer every time.
<p class="desc">
Note that the absolute noise (the "abs" column in red) is relatively small; rarely more than plus or minus 5. This is always the case when counting persons (the protected entity).
''',
"dbname": "banking0",
"mode": "trusted",
"diffix": {
"sql": '''
SELECT acct_district_id,
count(DISTINCT client_id1)
FROM accounts
GROUP BY 1
ORDER BY 1
'''
},
"native": {
"sql": '''
SELECT acct_district_id,
count(DISTINCT client_id1)
FROM accounts
GROUP BY 1
ORDER BY 1
'''
}
},
{
"heading": "Proportional noise",
"description": '''
<p class="desc">
pg_diffix adds enough noise to hide the influence of individual users. When counting the number of distinct persons (or whatever the protected entity is), then each person contributes exactly one to the count, and so the amount noise is both small and predictable.
<p class="desc">
When counting the number of rows for time-series data, then some persons contribute more than others. The amount of noise inserted by pg_diffix increases to effectively hide heavy contributors.
<p class="desc">
In this query, noise levels are much higher; absolute error is easily plus or minus 1500. Relative error varies substantially from less than a percent to 10% or more.
''',
"dbname": "banking0",
"mode": "trusted",
"diffix": {
"sql": '''
SELECT cli_district_id1, count(*)
FROM transactions
GROUP BY 1
ORDER BY 1
'''
},
"native": {
"sql": '''
SELECT cli_district_id1, count(*)
FROM transactions
GROUP BY 1
ORDER BY 1
'''
}
},
{
"heading": "Suppression (text)",
"description": '''
<p class="desc">
This example queries for the number of clients with each last name and displays them in descending order.
<p class="desc">
Simply adding noise to an answer is not enough to preserve anonymity. If there is only one user in the database with a given last name, then merely displaying this last name would single out that person and therefore be considered personal data by GDPR criteria.
<p class="desc">
The native answer here shows that there are over 3000 distinct last names in the database. pg_diffix, however, reveals only a fraction of these names: those that are shared by enough clients. The remaining names are hidden.
<p class="desc">
To inform the analyst that last names have been suppressed, and to give an indication of how much data has been suppressed, pg_diffix places all of the suppressed rows in a bucket labeled
 <span style="font-family:'Courier New'">*</span> 
and then displays the anonymized aggregate for that bucket.
<p class="desc">
For this query, essentially what happens is that all suppressed last names are replaced with the value
 <span style="font-family:'Courier New'">*</span> 
and then displayed as though
 <span style="font-family:'Courier New'">*</span> 
is a last name. From this we see that there are nearly 4000 clients whose last names have been suppressed.
''',
"dbname": "banking0",
"mode": "trusted",
"diffix": {
"sql": '''
SELECT lastname1,
count(DISTINCT client_id1)
FROM accounts
GROUP BY 1
ORDER BY 2 DESC
'''
},
"native": {
"sql": '''
SELECT lastname1,
count(DISTINCT client_id1)
FROM accounts
GROUP BY 1
ORDER BY 2 DESC
'''
}
},
{
"heading": "Suppression (numbers)",
"description": '''
<p class="desc">
This query similarly has substantial suppression, but this time displaying numbers instead of text.
<p class="desc">
In this case, rather than return
 <span style="font-family:'Courier New'">*</span> 
as the default symbol for identifying the suppression bucket, pg_diffix returns
 <span style="font-family:'Courier New'">NULL</span> 
(which here is displayed as 'None' because of the python implementation of this training program). pg_diffix can't return
 <span style="font-family:'Courier New'">*</span> 
for numbers because
 <span style="font-family:'Courier New'">*</span> 
is a string and therefore the wrong type.
<p class="desc">
Note however that NULL values may represent both suppressed data and true NULL entries in the data.
''',
"dbname": "taxi",
"mode": "trusted",
"diffix": {
"sql": '''
SELECT pickup_latitude,
count(*)
FROM jan08
GROUP BY 1
ORDER BY 2 DESC
'''
},
"native": {
"sql": '''
SELECT pickup_latitude,
count(*)
FROM jan08
GROUP BY 1
ORDER BY 2 DESC
'''
}
},
{
"heading": "Generalization",
"description": '''
<p class="desc">
The key to avoiding excessive suppression or noise is to generalize data (or select fewer columns). Larger bins are less likely to be suppressed, and the relative noise is also less.
<p class="desc">
Fundamentally, Diffix Fir lets the analyst trade-off precision for distortion. Data with less generalization or more columns is more precise, but suffers from increased relative noise and more suppression. Data with more generalization or fewer columns has less distortion, but is also less precise.
<p class="desc">
pg_diffix supports text and numeric columns.
<p class="desc">
The generalization function for text is:
<p class="desc">- substring(col,offset,length)</p>
<p class="desc">
The generalization functions for numeric are:
<p class="desc">- diffix.floor_by(col,bin_size)</p>
<p class="desc">- diffix.round_by(col,bin_size)</p>
<p class="desc">- diffix.ceil_by(col,bin_size)</p>
<p class="desc">- width_bucket(col,low,high,count)</p>
''',
"dbname": '',
"mode": "trusted",
"diffix": {
"sql": ''
},
"native": {
"sql": ''
}
},
{
"heading": "Heatmap example",
"description": '''
<p class="desc">
This query counts the number of taxi rides in each hour in a square grid of roughly 1km by 1km. This query could be used to build a set of space/time heatmaps showing how the number of taxi rides differs in different locations, and changes over the course of the day. (See
<a target=_blank href="
https://taxi-heatmap.open-diffix.org/
">here</a> for a demo of just this kind of heatmap.)
<p class="desc">
This example demonstrates the use of both text and numeric generalization.
<p class="desc">
The native column type for the pickup_datetime column is datetime, so it must be cast as text for the substring() function. (The native query could have used date_trunc(), but we use substring() here as well to produce identical values.)
<p class="desc">
The amount of suppression is relatively small (slightly over 5000 of 440K rows, or roughly 1%)
<p class="desc">
The 99% of the original data that is not suppressed constitutes less than half of the native output bins (around 2000 output bins from Diffix versus over 4500 from the native query). In other words, the 1% of suppressed data is spread over a long tail of bins with very small counts.
''',
"dbname": "taxi",
"mode": "trusted",
"diffix": {
"sql": '''
SELECT substring(cast(pickup_datetime
AS text),1,13) AS hour,
diffix.round_by(pickup_latitude,0.01)
AS lat,
diffix.round_by(pickup_longitude,0.01)
AS lon,
count(*)
FROM jan08
GROUP BY 1,2,3
ORDER BY 1,2,3
'''
},
"native": {
"sql": '''
SELECT substring(cast(pickup_datetime
AS text),1,13) AS hour,
diffix.round_by(pickup_latitude,0.01)
AS lat,
diffix.round_by(pickup_longitude,0.01)
AS lon,
count(*)
FROM jan08
GROUP BY 1,2,3
ORDER BY 1,2,3
'''
}
},
{
"heading": "Example 2",
"description": '''
<p class="desc">
This query counts the number of transactions for 11 equal-sized buckets for the column amount in the range from 0 to 80000.
''',
"dbname": "banking0",
"mode": "trusted",
"diffix": {
"sql": '''
SELECT width_bucket(amount,0,80000,10),
count(*)
FROM transactions
GROUP BY 1
ORDER BY 1
'''
},
"native": {
"sql": '''
SELECT width_bucket(amount,0,80000,10),
count(*)
FROM transactions
GROUP BY 1
ORDER BY 1
'''
}
},
{
"heading": "<u>PART 2: ADVANCED</u>",
"description": '''
<p class="desc">
This part of the training application introduces and number of additional query features and shows how those features can be used to run a variety of PostgreSQL statistical operations.
<p class="desc">
This part also digs further into the effects of noise and suppression. Finally, it shows how a user can view a variety of Diffix configuration parameters, and describes other aspects of pg_diffix operation.
<p class="desc">
These examples complete the description of the features and limitations of pg_diffix.
''',
"dbname": "",
"mode": "trusted",
"diffix": {
"sql": ""
},
"native": {
"sql": ""
}
},
{
"heading": "Post-processing",
"description": '''
<p class="desc">
Nested queries are possible with pg_diffix, but only the inner-most query is anonymized. Correspondly, the inner-most query is very limited in its SQL syntax, mainly what is described in Part 1, plus a few additional functions described in Part 2.
<p class="desc">
The outer query or queries, however, have no SQL restrictions. These outer queries can act as a kind of post-processing for the purpose of formatting data, computing aggregates, or applying logic filters. Examples using post-processing are sprinkled throughout this part.
<p class="desc">
Note that there are a number of functions that are effectively post-processing even without a nested query. These include ORDER BY, LIMIT, and HAVING.
''',
"dbname": "",
"mode": "trusted",
"diffix": {
"sql": ""
},
"native": {
"sql": ""
}
},
{ "heading": "Data formatting",
"description": '''
<p class="desc">
Here we give a simple data formatting example. In this query, the age column is post-processed to make it clear what the age ranges are.
<p class="desc">
<font color="red">Note query takes several 10s of seconds.</font>
''',
"dbname": 'census0',
"mode": "trusted",
"diffix": {
"sql": '''
SELECT concat(age,'-',age+4) AS age,
num_persons FROM
( SELECT diffix.floor_by(age,5) AS age,
count(*) AS num_persons
FROM uidperhousehold
GROUP BY 1
ORDER BY 1
) t
'''
},
"native": {
"sql": '''
SELECT concat(age,'-',age+4) AS age,
num_persons FROM
( SELECT diffix.floor_by(age,5) AS age,
count(*) AS num_persons
FROM uidperhousehold
GROUP BY 1
ORDER BY 1
) t
'''
}
},
{
"heading": "WHERE filter",
"description": '''
<p class="desc">
The queries shown in Part 1 of this training app do not filter data: histograms from all of the data are built from each query. This can lead to very slow queries for large tables. To improve performance, pg_diffix supports WHERE clauses with AND logic and simple expressions.
<p class="desc">
WHERE clauses in pg_diffix are implemented as a pre-processing step (that is, it occurs before anonymization). The rows of data that pass the WHERE filter are presented to Diffix as a table over which Diffix operates.
''',
"dbname": "",
"mode": "trusted",
"diffix": {
"sql": ""
},
"native": {
"sql": ""
}
},
{
"skip": False,
"expectErr": True,
"heading": "Without WHERE",
"description": '''
<p class="desc">
Suppose we would like to count the total number of distinct IP addresses for SciHub queries originating in the city of Aachen, Germany. We can get that without the WHERE clause by requesting a histogram over all cities, and post-processing for just Aachen.
<p class="desc">
This takes several 10s of seconds.
<p class="desc">
(Note that HAVING is a post-processing filter. It operates on the anonymized output of the city histogram.)
''',
"dbname": "scihub",
"mode": "trusted",
"diffix": {
"sql": '''
SELECT city, count(DISTINCT uid)
FROM sep2015
GROUP BY 1
HAVING city = 'Aachen'
'''
},
"native": {
"sql": '''
SELECT city, count(DISTINCT uid)
FROM sep2015
GROUP BY 1
HAVING city = 'Aachen'
'''
}
},
{
"skip": False,
"expectErr": True,
"heading": "With WHERE",
"description": '''
<p class="desc">
Here is the same query, but this time pre-filtering for city. Note that the query runs much faster (less than 2 seconds, versus several 10s of seconds without the WHERE).
''',
"dbname": "scihub",
"mode": "trusted",
"diffix": {
"sql": '''
SELECT city, count(DISTINCT uid)
FROM sep2015
WHERE city = 'Aachen'
GROUP BY 1
'''
},
"native": {
"sql": '''
SELECT city, count(DISTINCT uid)
FROM sep2015
WHERE city = 'Aachen'
GROUP BY 1
'''
}
},
{
"skip": False,
"expectErr": True,
"heading": "With generalization",
"description": '''
<p class="desc">
WHERE expressions are limited to the forms 'col = val' or 'gen_func() = val', where 'gen_func()' is a generalization function. This example shows the latter case using 'diffix.floor_by()'.
''',
"dbname": "banking0",
"mode": "trusted",
"diffix": {
"sql": '''
SELECT count(*)
FROM loans
WHERE diffix.floor_by(amount,10000) = 20000
'''
},
"native": {
"sql": '''
SELECT count(*)
FROM loans
WHERE diffix.floor_by(amount,10000) = 20000
'''
}
},
{
"heading": "Histograms of Counts",
"description": '''
<p class="desc">
Suppose a query showed that 100 clients had 100000 transactions. We would like to know how those transactions were distributed among the 100 clients. Did all 100 clients have 1000 transactions each? Did 10 clients have 9100 transactions each while the other 90 had 100 each?
<p class="desc">
In normal SQL, one way to generate a count distribution would be with a nested query:
<pre>
SELECT cnts, count(*) FROM (
SELECT client, count(*) AS cnts
FROM table GROUP BY 1 ) t
</pre>
In Diffix Fir, this query would produce garbage, because in anonymizing the inner query, everything would be suppressed. While a future version of Diffix may handle this properly by postponing anonymization to the outer query, Diffix Fir has a special function, 'diffix.count_histogram()', to compute a histogram of counts.
''',
"dbname": "",
"mode": "trusted",
"diffix": {
"sql": ""
},
"native": {
"sql": ""
}
},
{
"heading": "count_histogram()",
"description": '''
<p class="desc">
This query gives the number of accounts that have had so many loans. From this, we see that about 20% of the accounts have had two loans, while the rest only one (unless a very small number of accounts have had more than two loans and have been suppressed, though the native result shows us that this is not the case).
<p class="desc">
The Diffix SQL on the left implements the Native SQL on the right. The Diffix SQL is rather complex and needs explanation.
<p class="desc">
The innermost query executes the 'diffix.count_histogram()' function. It produces an anonymized result in the form of an ARRAY (actually, an ARRAY of ARRAYs). The special function 'diffix.unnest_histogram()' returns a table of ARRAYs each with two values. The first value is the count of events, and the second value is the number of protected entities (e.g. individuals) with the corresponding count.
<p class="desc">
Note that 'diffix.count_histogram()' only takes columns tagged as AID columns as its argument (i.e. columns that identify the protected entity).
''',
"dbname": "banking0",
"mode": "trusted",
"diffix": {
"sql": '''
SELECT unnested[1] AS num_loans,
unnested[2] AS num_accounts
FROM (
SELECT
diffix.unnest_histogram(cnt_hist)
AS unnested
FROM (
SELECT
diffix.count_histogram(account_id)
AS cnt_hist
FROM loans
) t1
) t2
ORDER BY num_loans
'''
},
"native": {
"sql": '''
SELECT num_loans,
count(*) as num_accounts
FROM (
SELECT account_id,
count(*) AS num_loans
FROM loans
GROUP BY 1
) t
GROUP BY 1
ORDER BY num_loans
'''
}
},
{
"heading": "With generalization",
"description": '''
<p class="desc">
The 'diffix.count_histogram()' function can also generalize counts. This is done with a second argument which acts identically to the `diffix.floor_by()` generalization function.
<p class="desc">
This query gives the number of accounts that have had so many transactions, generalized as bins of 100 transactions.
''',
"dbname": "banking0",
"mode": "trusted",
"diffix": {
"sql": '''
SELECT unnested[1] AS num_trans,
unnested[2] AS num_accounts
FROM (
SELECT
diffix.unnest_histogram(cnt_hist)
AS unnested
FROM (
SELECT
diffix.count_histogram(account_id,100)
AS cnt_hist
FROM transactions
) t1
) t2
ORDER BY num_trans
'''
},
"native": {
"sql": '''
SELECT (floor(num_trans/100)*100)::integer
AS num_trans,
count(*) as num_accounts
FROM (
SELECT account_id,
count(*) AS num_trans
FROM transactions
GROUP BY 1
) t
GROUP BY 1
ORDER BY num_trans
'''
}
},
{
"heading": "As 2D histogram",
"description": '''
<p class="desc">
As expected, the 'diffix.count_histogram()' function also works in conjunction with other selected columns.
<p class="desc">
This query returns a histogram of the number of rides taken by drivers for each of the two taxi vendors.
''',
"dbname": "taxi",
"mode": "trusted",
"diffix": {
"sql": '''
SELECT vendor_id,
unnested[1] AS num_rides,
unnested[2] AS num_drivers
FROM (
SELECT vendor_id,
diffix.unnest_histogram(cnt_hist)
AS unnested
FROM (
SELECT vendor_id,
diffix.count_histogram(hack,5)
AS cnt_hist
FROM jan08
GROUP BY 1
) t1
) t2
ORDER BY 1,2
'''
},
"native": {
"sql": '''
SELECT vendor_id,
(floor(num_rides/5)*5)::integer
AS num_rides,
count(*) as num_drivers
FROM (
SELECT vendor_id, hack,
count(*) AS num_rides
FROM jan08
GROUP BY 1,2
) t
GROUP BY 1,2
ORDER BY 1,2
'''
}
},
{
"heading": "Microdata (synthetic data)",
"description": '''
<p class="desc">
While Diffix Fir can and should be thought of as producing aggregates, technically it is capable of producing microdata (i.e. one output row per input row).
If a query has no aggregate function and corresponding GROUP BY, pg_diffix automatically assumes an aggregate of 'count(*)', and then simply expands each implied aggregate into its corresponding number of rows by repeating the set of values 'count(*)' times.
<p class="desc">
Note that this microdata is also a form of synthetic data.
''',
"dbname": "",
"mode": "trusted",
"diffix": {
"sql": ""
},
"native": {
"sql": ""
}
},
{
"heading": "With count(*)",
"description": '''
<p class="desc">
By way of example, this query has the count(*) aggregate.
<p class="desc">
Note that for clarity the queries are filtered to have only three output bins, each with small counts.
''',
"dbname": "banking0",
"mode": "trusted",
"diffix": {
"sql": '''
SELECT acct_district_id AS act,
duration AS dur,
count(*)
FROM loans
GROUP BY 1,2
HAVING acct_district_id IN (1,52,54)
AND duration = 60
ORDER BY 3
'''
},
"native": {
"sql": '''
SELECT acct_district_id AS act,
duration AS dur,
count(*)
FROM loans
GROUP BY 1,2
HAVING acct_district_id IN (1,52,54)
AND duration = 60
ORDER BY 3
'''
}
},
{
"heading": "Without count(*)",
"description": '''
<p class="desc">
This is the same query, but this time without the 'count(*)' aggregate and corresponding GROUP BY.
<p class="desc">
The microdata produced by pg_diffix tracks the original data very closely. The only difference is that the pg_diffix microdata may have a few more or less rows for each combination of values than the original data.
''',
"dbname": "banking0",
"mode": "trusted",
"diffix": {
"sql": '''
SELECT act_d, dur_d
FROM (
SELECT acct_district_id AS act_d,
duration AS dur_d
FROM loans
) t
WHERE act_d IN (1,52,54) AND dur_d = 60
ORDER BY 1,2
'''
},
"native": {
"sql": '''
SELECT act_n, dur_n
FROM (
SELECT acct_district_id AS act_n,
duration AS dur_n