-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathdesign_principles.tex
2454 lines (2133 loc) · 138 KB
/
design_principles.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\chapter{Design Principles}
\label{chap:design_principles}
This chapter presents the design
principles of SDS using real-world observations of contemporary
system-of-systems applications.
It describes the components that make up an SDS system
and shows how they work together to enforce end-to-end semantics while
preserving organizational autonomy.
\section{Overview}
The need for SDS systems is guided by the real-world needs of three sets of stakeholders in
wide-area applications today: its users, its organizations, and its developers.
\subsection{Users}
The \textbf{users} are the authoritative origins of all data in the application.
Data is produced by users for other users to consume. This is unsurprising at
first glance, since the point of having wide-area applications at all is so users can
collaborate without having to be physically present
(i.e. by communicating data to one another across the Internet). However, the
key insight here is that conventional wide-area applications such as Web
applications do not treat users as authoritative data origins at the protocol
level. At the protocol level, application servers are the authoritative data origins for
all user data.
It is only by \emph{social convention} that users are led to believe and
expect that they are the authoritative data origins. This is reflected in how
users talk about the data they create---for example, a user would say ``my
Facebook profile'' when referring to the profile data Facebook hosts, instead of the more accurate
statement ``the downstream replica of my profile data that I stored in Facebook's servers and
expect Facebook's servers to faithfully share on my behalf.''
This thesis argues for enforcing this
social convention at the protocol layer (i.e. programmatically, beneath the application)
by separating the responsibility for hosting
and serving a user's data from the responsibility of hosting and running
application code.
The fact that users assume that they are both
the data's authoritative origins and the data consumers
means that users have certain expectations regarding how applications store
their data. These expectations can be arbitrarily specific to the data, the application, and the
computer(s) through which they read and write it. For example, a user would
expect an online tax-filing application to prevent their tax form data from being read by anyone
besides themselves and the government, and would expect it to retain copies of their
filings for at least three years. As another example, a user would expect a
ride-hailing application to be accessible only through their mobile phone, and
would expect that their travel history and driver ratings would be inaccessible
to the driver.
\subsubsection{Data-hosting Policies}
Successful applications empower users to convey their expectations to
applications and other users in the form of a \textbf{data-hosting policy}.
The data-hosting policy is a machine-readable
description of how the user expects the application and other users to interact
with her data. Successful applications provide the means for users to translate their
expectations into data-hosting policies, and enforce the users'
policies on their behalf.
The data-hosting policy can take many forms, depending on the application.
For example, a social media application like Facebook allows
users to encode some aspects of a data-hosting policy in a privacy settings page.
The settings are stored in Facebook, and Facebook (ostensibly) enforces them.
As another example, a cloud administration tool like the Google Cloud
Console~\cite{google-cloud} gives its users the ability to define programmatic
hooks and scripts for hosting, retaining, and deleting log data.
This thesis is concerned about the enforcement of data-hosting policies.
Users need to be able to translate their expectations on data storage into
policies that they can enforce without having to rely on applications or storage
providers. Today, users have no technical recourse if the application
simply decides to ignore their policies; they are instead left with external
remediation options like boycotting the application or taking legal action
against the developers. Specific to system-of-systems applications,
developers are not in a position where they can plausibly enforce a user's data-hosting
policies end-to-end. This is because in order to do so,
both the application and the storage
providers must recognize and enforce the users' policies. However, in practice storage
providers are not even guaranteed to be aware that the policies exist,
since users do not interact with storage providers and do not have a direct business
relationship with them.
If users cannot rely on the application or the developer's chosen storage providers
to enforce their data-hosting policies, then they they are left with three
(non-exclusive) options:
\begin{enumerate}
\item Do not use the application. This is not a feasible option for
most users.
\item Only use the application if it will store the user's data on the user's
chosen storage providers, instead of the developer's. That
way, the user can ostensibly select storage providers that will enforce
their policies alongside the application.
\item Carry out policy enforcement on a trusted computer or computers
independent of the application and storage providers.
\end{enumerate}
This thesis argues for taking the second and third options in system-of-systems
application design. Users should be able to select which storage providers
host their application-specific data, and choose which computers to trust
with enforcing their data-hosting policies.
Applications and storage providers should not be in a position to make either
decision for the user, unless the user explicitly allows them to do so.
\subsection{Organizations}
An \textbf{organization} is the set of computers that enforce a user's
data hosting policy. Each organization adheres to a single
policy, and uses it to constrain how the application and other users are allowed to
interact with the user's data.
The fact that policies are application-specific means that organizational boundaries
are also specific to the application, since they pertain to the types of data being loaded
and stored in the application. For example:
\begin{itemize}
\item A user's personal devices constitute a single organization in the context of a
social media application. This is a single organization because all
devices adhere to the same data-hosting policy: they load, manipulate,
and store the user's account and profile data. Organizations do not
overlap in this application---a user Alice's devices are a wholly
separate organization from a user Bob's devices.
\item A lab's workstations constitute a single organization in the context of
a Web BLAST~\cite{web-blast} deployment. This is a single organization
because all workstations adhere to the same access controls: only lab members
can access unpublished data, and only lab members and site
administrators can access user-specific state like home directories.
Workstations additionally retain BLAST computation results for their users
in a shared directory accessible to all lab members, so expensive results
can be reused.
\item The set of personal and work devices belonging to a team of programmers
at a software company constitute a single organization in
the context of a shared version control system (VCS).
Each programmer can access the VCS from any of their devices, but only the
devices belonging to programmers on the same team can commit new changes.
Data is never overwritten or deleted---the commit history is preserved
forever. Devices outside of the company are forbidden from reading and writing.
\end{itemize}
Users choose which organization(s) to trust with policy enforcement when they
use the application. The organization mediates all of its user's interactions
with their data in order to apply the user's policy on the data before the data
is received by the application or other users.
\subsection{Developers}
The \textbf{developers} create and maintain the application code. They have to
keep it running despite any breaking changes in the underlying storage
providers, and they have to ensure that each user's data-hosting policy is
enforced.
The fact that developers lack control over the storage infrastructure leads to the problem statement. Developers
put users in the position of having to trust third-party infrastructure
to adhere to their data-hosting policies (even though the infrastructure is not
guaranteed to be aware of this), and developers put themselves in the difficult
position of having to trust that their underlying services will not change their
storage semantics in a way that breaks the application.
Neither of these
positions have proved tenable in practice. User data gets misappropriated by the storage
providers through breaches of trust like data leaks or data loss. Developers
find themselves having to patch their applications over and over whenever they
change storage providers or the storage providers change their APIs.
This thesis argues that this problem can be solved by creating a data storage
protocol layer (i.e. SDS) in-between applications and storage services. It is sufficient
for the layer to do the following:
\begin{itemize}
\item Treat users as the authoritative origins for all data in a protocol
layer beneath the applications. Then, each application and each user
can identify which application data originated from which user.
\item Identify and enforce organizational boundaries and policies in a protocol layer
beneath the applications. Then, organizations can take unilateral action
in specifying and enforcing their policies without cooperation from the
application.
\item Give developers a way to specify their desired end-to-end semantics in
a protocol layer beneath the applications, but above the storage services.
Then, the developers can adapt the \emph{entire ecosystem} of applications
to changes in a single storage provider with a
single patch on the protocol layer,
instead of having to patch each application separately.
\end{itemize}
The design principles for wide-area software-defined storage are rooted in
observations of three ``tussle spaces''~\cite{david-clark-tussle-spaces}.
These are (1) the cloud services that host and serve the raw
bytes, (2) the end-to-end storage semantics, and (3) the trust
relationships between organizations, their users, and cloud services.
A well-designed SDS system helps application developers efficiently accommodate tussles
in all three of these domains.
\subsection{Semantic Tussle Spaces}
It may not be obvious that end-to-end storage semantics warrant their own tussle
space, distinct from the cloud services and applications. Why not simply
design applications to be portable? Is there a system-of-systems application
development methodology that allows applications to be written once, and be made
to run on any services with only a small amount of work?
This thesis argues that focusing only on application portability is
inefficient---it takes a lot of work to build portable system-of-systems
applications with today's methodologies.
Today, the cost of porting $m$ applications to $n$ services
would require $O(mn)$ patches. This is true even if developers share their
patches, since getting a patch to work with one application can require completely
re-writing it to work with another application.
It is unlikely that this situation will improve on its own,
since developers are incentivized to ship code that \emph{works
today} as opposed to code that is portable to unspecified systems at unspecified
times in the future. Moreover, the business models of cloud services
depend on customers continuously paying for the service, which removes the
incentive to help make applications portable to their competitors.
Even if portability was a desirable and achievable design goal from the get-go,
getting $m$ applications to adopt a new service's behavior would still at best
require $O(m)$ man-hours, since each application would need to be modified.
SDS reframes the problem of portability as a problem with isolating
both the individual service's semantics and the desired end-to-end storage
semantics from the application. By treating the set of application
end-to-end semantics as their own tussle space, SDS frees the developer from
having to port the application to each service. Instead, a developer simply
ports the service to the SDS system, and the SDS system overlays the desired
end-to-end semantics ``on top'' of them. Then, all current and future SDS
applications would be able to use the service \emph{without} modification.
The amount of work to port $m$ applications to $n$ services with a SDS system is reduced to
$O(m+n)$.
\subsection{Trust Tussle Spaces}
Trust relationships are not static, and system-of-systems applications need a
way to accommodate changes in trust. However, the application needs a way to do
so without compromising \emph{any} organization's autonomy.
The two approaches to managing trust relationships today---federations and open-membership
architectures---do not fully accommodate trust tussles. They
either sacrifice organizational autonomy (federations) or sacrifice
the flexibility needed to accommodate new trust models (open-membership
architectures).
In federations, each organization promises to adhere to a
``common ground'' data-hosting policy that allows them to interoperate.
This way, users that trust one organization can trust other member organizations
and their users to preserve their policies.
For example, the operators of a set of organizations may agree to use a
single-sign-on (SSO) system to authorize computers from different organizations
to access sensitive data. As another example, a set of organizations may agree
to use a common data format and API for sharing data with one another (such as
putting their data servers behind an API endpoint that emulates a widely-used
storage provider like Amazon S3).
While federations help organizations accommodate
tussles in trust relationships, they impose high and unfair coordination costs
that impinge on one or more organizations' decision-making. The problem is
that organization administrators must regularly coordinate to adapt
to changing trust relationships (imposing a high cost), and do so in a way that
favors certain organizations over others (removing fairness). For example,
federations governed by in-person meetings exclude individuals who cannot
travel easily or live in different timezones.
As another example, federations whose coordination occurs in English
penalizes non-English-speaking participants. The unfairness of the
coordination cost distribution is fundamentally a social problem, and is beyond
the scope of this thesis to address.
Open-membership architectures attempt to accommodate tussles in trust
relationships in a more fair way by embedding all of the coordination logic to do so in the
application protocol itself. The rationale is
that this reduces the need for organization administrators to coordinate
out-of-band. Instead, the act of participating in the system gives
each organization the ability to set its own policies for interacting with other
nodes. Examples systems that follow this architecture include peer-to-peer file sharing
(like BitTorrent~\cite{bittorrent}, Shark~\cite{shark}, and Vanish~\cite{vanish})
and cryptocurrencies (like Bitcoin~\cite{bitcoin} and Ethereum~\cite{ethereum}).
In both examples, peers have the power to unilaterally choose which other peers to contact, and
unilaterally decide which messages to send and receive from other peers. For
example, BitTorrent allows users to whitelist other peers when sharing a file to
ensure that it only reaches the desired users.
The difficulty with the open-membership approach to accommodating trust tussels is that
it makes it difficult to upgrade the application beyond the scope of the
protocol. This makes it hard to accommodate new types of trust relationships.
For example, the BitTorrent protocol does not provide a mechanism for helping
users identify peers who will continue to seed their files, even if the user is
willing to compensate the peers for doing so. A user who wants to identify
and pay peers to seed their files cannot use BitTorrent alone---they must use
some out-of-band mechanism to find, select, and compensate seeders. In order to
accommodate this use-case in-band, the BitTorrent protocol would need to be
upgraded.
The requirement that trust management be performed in-band in the application
protocol means that developers forgo the ability to significantly change the
protocol once deployed. Attempting to introduce a backwards-incompatible change to the
application is tantamount to creating a whole new application. For example,
the Bitcoin Cash cryptocurrency~\cite{bcash} split off from the Bitcoin
cryptocurrency due to a disagreement in the system's block size (a one-line code
change) after over two years of infighting.
The SDS approach to accommodating tussles in trust relationships is to leverage
an open-membership system to \emph{bootstrap} trust between
users and organizations (Section~\ref{sec:chap2-ssi}). Users and
organizations leverage the open-membership system to exchange public keys, and
establish end-to-end confidential and authenticated communication channels.
This lets users and organizations establish trust relationships unilaterally while
avoiding the high-overhead coordination problems of a
federation (i.e. in order to preserve organizational autonomy).
It also helps developers avoid getting locked into an un-upgradeable platform,
since the nature of the trust relationships is decoupled from the
open-membership system used to establish them.
\subsection{Design Objectives}
Applications not only need to work with existing cloud services, but also with
any \emph{future} cloud services that may be developed after the application is
built and deployed. The developer must be able to use any services they want,
with minimal switching costs. This leads to the first design objective for a SDS
system:
\\
\\
\noindent{\textbf{Objective 1}}: \emph{Once developed, an application must be
able to use any current or future
cloud service to host data without changing its end-to-end storage semantics.}
\\
\\
At the same time, a developer may want to stop using a storage system that was
previously in use. The data must nevertheless remain accessible under the
terms of the data-hosting policies of the user(s) that wrote it.
For example, the application developer may discover that the business logic needs stronger
consistency guarantees than the cloud services can offer. The developer cannot
simply move to a different service on a whim, since all of the data is hosted
on the current services. At the same time, the developer cannot be expected
to rewrite the application to keep using it with its weak consistency model.
This leads to the second design objective for a SDS system:
\\
\\
\noindent{\textbf{Objective 2}}: \emph{Once chosen to host data, a cloud service
must remain usable by the application regardless of any future changes to the
application's end-to-end storage semantics.}
\\
\\
All the while, the trust relationships between users and
their chosen cloud services determine how applications are permitted to
interact with each user's data. If users' organizations can communicate securely,
it can be shown that users only need to trust cloud services with
keeping their data available. Other policies can be enforced
in software outside of the services (Section~\ref{sec:aggregation-driver-model}).
However, this leaves open the question of how users establish
trust in one another in the first place. They must
establish trust relationships \emph{outside} of the application, since they need
their organizations to trust one another
before any cross-organization data interactions can occur. The developer
cannot expect organizations to read or write data from untrusted services or
organizations, since this infringes on their autonomy.
This leads to the third SDS design objective:
\\
\\
\noindent{\textbf{Objective 3}}: \emph{Users and their organizations
must be able to establish trust in one another independent of the
applications and cloud services that host user data.}
\\
\\
If this objective is met, then it becomes possible for organizations to
securely identify with whom they will share data. Once they can do this,
each organization's users can programmatically define non-trivial data-hosting policies
for the organization to enforce.
Organizations do not need the application developer to be aware
of their trust relationships. Organizations only need the developer to
ensure that their programmatic data-hosting policies (which encompass their trust
relationships) get enforced.
Identifying and authenticating other organizations and their users
is the first step to implementing policy-enforcement mechanisms.
The second step is to ensure that the organization can unilaterally
designate which organization(s) can be trusted to run them.
Once these preconditions are met, then it is up to the SDS system to ensure that the right
policy enforcement mechanisms are invoked by the right organizations during a read or write.
This leads to the final SDS design objective:
\\
\\
\noindent{\textbf{Objective 4}}: \emph{An organization's data-hosting policies
must be enforced independently of applications and cloud services.}
\\
\\
The remainder of this chapter shows how these objectives sculpt the design space
for SDS systems. It concludes by distilling the design space into a set of
design principles for SDS system design and implementation.
\section{Requirements}
\begin{figure}[h]
\centering
\includegraphics[width=0.9\textwidth,page=2]{figures/dissertation-figures}
\caption{Logical representation of a wide-area SDS system. The SDS system
is a cross-organization intermediate layer that connects services to
applications via distinct interfaces.}
\label{fig:chap2-sds-overview}
\end{figure}
At a high-level, a SDS system is a logical ``hub'' between applications and
services that spans multiple organizations (Figure~\ref{fig:chap2-sds-overview}).
The hub takes reads and writes from the application, processes them
according to application-defined semantics and user policies, and loads and stores the resulting
data to the underlying storage systems.
It necessarily offers two interfaces: a \emph{service interface} through which it interacts with
services on the applications' behalf, and an \emph{application interface} through
which applications interact with data and define their desired storage
semantics.
\subsection{Service Interface}
Fundamentally, a storage service can be read-only, read/write, or write-only.
CDNs and public datasets are read-only storage services, and cloud storage is a
read/write storage service. Write-only services are of little concern to the
users of system-of-systems applications, since they do not provide a way to
interact with the data once written.
This means SDS systems concern themselves with read-only and read/write
services. Cloud services can be further distinguished by whether
or not they can host authoritative replicas of user data---that is, replicas
that the user explicitly places and designates as originating from themselves.
Public datasets and cloud storage are capable of hosting authoritative
replicas. However, CDNs are not---they can only host copies of authoritative
replicas.
The user can leverage any combination of services to host their data. However,
the application developer cannot be expected to anticipate every possible
combination. The SDS system must instead provide some way to automatically
``aggregate'' the user's services, so applications can read and
write user data regardless of their configuration.
Aggregating services is not trivial, since different services that fulfill
similar roles can have different semantics. Depending what combination of
services, the configuration can have different end-to-end semantics than
individual services provide. For example, a user that uses a CDN to read
copies of data from cloud storage will observe weaker data consistency than
had she simply read directly from cloud storage.
What this means is that the SDS system needs a \emph{minimum viable model} for each
kind of service. The more minimal the model is, the more diverse the set of
supported storage systems can be. In order to help aggregate services for the
application, the SDS system must take all necessary steps to make each of the
user's services conform to the model.
For cloud storage. the minimum viable model must account for the fact that
different cloud storage providers have different consistency models.
Fortunately, every cloud storage provider in existence promises that if the user
writes data once, they and other users will eventually be able to read it.
This implies that the SDS system can safely assume that \textbf{cloud storage is
at least a write-once read-many medium}. Even if it supports multiple writes to the
same record (most do), no assumptions can be safely made about how readers will
observe these writes.
Regarding datasets, data can be removed from a dataset by the provider, in which
case eventually all subsequent reads will fail. Data can be added to a dataset,
and eventually all subsequent reads to the new data will succeed. Users cannot modify the
dataset, since they do not have write access to the dataset provider's servers.
Therefore, the minimum viable model is that \textbf{datasets are a
read-only medium to users}.
Using CDNs poses a challenge to applications because their usage alters the
end-to-end consistency guarantees of the application. Writes to upstream
authoritative replicas may not be immediately reflected in the CDN's replicas.
Moreover, the user cannot control the CDN's schedule of cache evictions---the CDN can cache data as long
as it wants. However, the minimum viable model for cloud storage means that the
SDS system can ``trick'' the CDN into fetching and serving fresh data. This is
possible because when the application executes a logical write to an existing
record, the SDS will create a new authoritative data replica in cloud storage. A subsequent read on
that data through the CDN will result in a cache miss, since as far as the CDN
can tell it has been asked to fetch new data (instead of a modification to an
existing record). This means that the minimum viable model for CDNs is that
\textbf{CDNs are a write-through cache for users}.
These minimum viable models suggest an aggregation strategy for the SDS system:
\begin{itemize}
\item \textbf{Treat all cloud storage as a write-once read-many medium}. The SDS
system must make it so that the user's set of cloud storage
services will appear to the application as a single write-once read-many
storage medium. The SDS system must ensure that a given record is written
no more than once, and the SDS system must handle the details of routing
the application's reads and writes to the correct underlying storage system.
\item \textbf{Treat all datasets as read-only medium}. The SDS system must
make it so that all of the user's datasets appear to the application as a
single read-only storage medium. The SDS system must route the
application's reads to the correct dataset.
\item \textbf{Treat CDNs as a write-through cache}. The SDS system must make
it so that the set of the user's CDNs appear as a write-coherent
cache. A write from the application must always be considered ``fresh''
by the CDN, regardless of its caching policy.
\end{itemize}
\begin{figure}[h]
\centering
\includegraphics[width=0.9\textwidth,page=3]{figures/dissertation-figures}
\caption{Service and aggregation drivers in an SDS system. Aggregation
drivers span multiple organizations and route application reads and writes to
one or more service drivers.}
\label{fig:chap2-driver-overview}
\end{figure}
To interact with services and aggregate them on behalf of applications, the SDS
system would realize these models by means of a \emph{service driver}.
Logically speaking, service drivers run at the service-facing ``bottom'' of the
SDS ``hub'' (Figure~\ref{fig:chap2-driver-overview}).
They handle only the data meant to be hosted on the service. The SDS system may
instantiate multiple copies of the service drivers in order to handle higher
load or keep applications isolated from one another.
\subsection{Application Interface}
Developers need to be able to preserve their application's
end-to-end storage semantics across an aggregation of services
in a multi-user setting. When an application reads or writes, the SDS system
must use the developer's prescribed rules to handle it. The SDS service will handle reads by
translating an application-level read into requests for data from its service
drivers, and it will handle writes by translating the application-given write
request and write data into requests to store data via its service drivers.
Since each user chooses their own service providers,
the only opportunity to apply end-to-end semantics is in this
application-to-service-driver translation step.
What kinds of semantics should a SDS system support? Since storage semantics
are application-specific, the SDS system must support arbitrary rule sets
supplied by the developer. This implies that SDS systems must be
programmable---the developer must be able to give the SDS system a program that
is evaluated on each read and write to carry out the sequence of steps to
transform application-given requests into requests to service drivers.
To enable this, SDS offers a separate type of driver called an ``aggregation
driver.''
Since each application has its own storage semantics,
there is one aggregation driver per application. Logically speaking, it runs at
the ``top'' of the SDS ``hub'' (Figure~\ref{fig:chap2-driver-overview})
and mediates all requests between users and
service drivers. Note that this thesis does not
distinguish between users and the application clients
they run.
The aggregation driver is executed to handle each
read and write. Since reads and writes to a particular piece of data are
subject to a particular data-hosting policy,
the SDS system executes reads and writes in terms of \emph{which user} issues the
interaction, \emph{which operation} is requested, \emph{which data
record} is affected, and \emph{which network host} is originating the request
(the network host being indicative of which organization originated the
request).
The high-level idea behind having two driver classes is that once a service has an appropriate service driver,
it can be ``plugged into'' the SDS system such that existing aggregation drivers
can use it immediately. An aggregation driver implements the application's desired end-to-end storage
semantics by translating
application-level requests into requests understood by the service driver. These
requests are issued such that their execution
by service drivers delivers the desired end-to-end behavior. This reframes the
costs of porting applications to services:
\begin{itemize}
\item For the cost of writing only the application-specific
aggregation drivers, a new application can be made
compatible with all existing and future services with no modification.
\item For the cost of writing only the service-specific SDS driver, a new
service can be made compatible with all existing and future applications.
\end{itemize}
In other words, the cost of porting $m$ applications to $n$ services can be
reduced from $O(mn)$ to $O(m+n)$.
To realize this cost savings, many applications will share an SDS system. Aggregation and service drivers
will be \emph{decoupled} from the applications---they will be
developed independently of one another, and independently of the
application itself. Both types of drivers can be re-used by new applications.
\subsection{Data and Control Planes}
This thesis intentionally uses the term ``routing'' to describe the act of
translating an application-given read or write from the wide-area (i.e. a user's
client) into requests to service providers. This is because one facet of
processing reads and writes is that the SDS system needs to ensure that
the user's data-hosting policies are enforced when they execute.
As argued earlier, the user cannot rely solely on
the storage providers to do this, nor can the user rely solely on the
application.
The user must instead be able to unilaterally choose which organizations
will process their reads and writes, since only the user is in a position to
determine which organizations will enforce their data-hosting policies.
When a user reads or writes, the request and
associated data must pass through the user's trusted organizations. This way,
the organizations mediate the reads and writes, and apply the user's policies
to constrain how their data will be processed. For example, a user may require
that the photos they share in an SDS-powered photo-sharing application pass
through their personal server en route to cloud storage, where they will be
encrypted before being stored. As another example, a user may require other
users to pay to read the content they produce.
Trusting organizations to enforce data-hosting policies introduces a routing
concern that SDS systems must fulfill. Reads and writes to a user's data
must be routed through the sequence of organizations that the user trusts,
before reaching the storage providers (on write) or other users (on read).
What this means for SDS systems is that they must empower the user to determine
which routes the reads and writes to their data are allowed to take. Users
must be able to early-bind their routing decisions to their data, since their
routing decisions must continue to apply to their
data long after they create it. The SDS system must execute a
source routing protocol when processing reads and writes to a user's data, since
the SDS system must honor the user's routing decisions instead of making routing
decisions on its own (i.e. in order to ensure that the user's data-hosting
policy is enforced by the right organizations).
The fact that the SDS system is concerned with both sharing data between users
and applying user-given routing decisions on how the data is delivered implies
that SDS systems have both a control plane and a data plane.
The \emph{data plane}'s job
is to ensure all-to-all connectivity between users and services.
The SDS data plane handles two distinct responsibilities. First, it moves the
raw bytes between users and services, but without concerning itself with application-specific semantics or
user-specific hosting policies. It does so via the
service drivers, and handles tasks such as
on-the-wire data formatting, data serialization and deserialization, data
transmission, and so on.
The other data plane responsibility is to maintain an inventory of the
set of records, the set of organizations, and the set of services that a SDS
user can ostensibly interact with. Users rely on this service to
make source routing decisions and discover available data.
This is implemented by a SDS data plane
subsystem called the \emph{metadata service}
(Section~\ref{sec:metadata-service}).
The \emph{control plane} implements each application's
storage semantics and user-given policies by acting as a governor for the data plane.
It runs an application's aggregation driver
to mediate all users' interactions with the data plane (including the data
inventory in the metadata service) in such a way that
users decide which network paths reads and writes take without
affecting the end-to-end storage semantics the driver enforces.
Because each user expects to share data with other users (subject to some
policy), the data plane is effectively shared by all applications and all
services, and must implement a common data-sharing interface via a fully-connected
bidirectional communication graph.
Every node in an SDS-powered application must be able to send and receive data-plane
messages to every other node, since ostensibly each user must be able to share
data with each other user (whether or not they actually do so in the application
is another matter). The control-plane defines the behavior of the
system insofar as what messages get sent while processing application I/O, and how they are
transformed and routed to and from the underlying services and other users.
\section{Data Plane}
User data can be arbitrarily large. However, data gets cached in CDNs, and
large singular records can cause cache thrashing. To contend with this,
the SDS data plane organizes data into units called \emph{chunks}. Chunks form
the basis of all data within SDS, and constitute a ``data plane narrow waist'' between
a multitude of service drivers below and a multitude of aggregation drivers
above. Chunks have the following properties in SDS:
\begin{itemize}
\item Every piece of data in SDS is made of one or more chunks.
\item Each chunk is immutable.
\item Each chunk has a globally-unique identifier.
\end{itemize}
In order to achieve all-to-all data availability, the
data plane must ensure that each chunk belonging to a particular application
is addressable and ostensibly resolvable by every user connected to it.
If the aggregation driver logic allows it, each user
can potentially resolve and download chunks created by each other user.
As will be shown, the aggregation driver and the users' trust relationships with
each other constrain which users resolve which data.
\begin{figure}[h]
\centering
\includegraphics[width=0.9\textwidth,page=4]{figures/dissertation-figures}
\caption{The narrow waist in the SDS data plane. The aggregation driver
translates application-level storage requests into operations on manifests
and chunks, and service drivers implement simple \textit{create},
\textit{update}, and \textit{delete} operations on chunks using existing
service interfaces.}
\label{fig:chap2-narrow-waist}
\end{figure}
At the service driver level, the SDS
system provides operations to \texttt{create}, \texttt{read}, and
\texttt{delete} chunks. Service drivers execute the requisite protocols
and data transformations to
marshal chunks back and forth to their respective services. CDN and dataset
service drivers only implement \texttt{read}, while cloud storage drivers
implement all three.
The data the application stores for a user can take any structure, but at the
end of the day the application will store user data as a set of one or more
named sequences of bytes (called \emph{records} in this thesis). Since records
can be arbitrarily large and must be able to be resolved by any user, SDS
systems must implement an addressing scheme that resolves a record identifier to
its sequence of chunks.
The minimally viable way to address records is to introduce one layer of
indirection---the data plane identifies which chunks belong to the same record,
in addition to identifying each chunk.
At a layer above the service drivers but beneath aggregation drivers, SDS
groups chunks that belong to the same record by using two specialized
chunk types: a \emph{block} and a \emph{manifest}. A block is simply a data
container with a known length. A manifest identifies a sequence of blocks, and
in doing so represents the entire record. Together, blocks and manifests
constitute the ``narrow waist'' of an SDS system's data plane
(Figure~\ref{fig:chap2-narrow-waist}), since they serve as the common
interchange format for a user's data. This construction is similar to the
inode and block construction seen in conventional filesystem designs that is
used to represent a user's files.
This record model is minimally viable because blocks
and manifests provide just enough information define a
set of generic operations for manipulating application data, in a way that
does not mandate a particular data representation or access interface and is
consistent with the minimum viable model for cloud storage, CDNs, and datasets.
Specifically, the block-and-manifest construction allows
the SDS system to define data-plane operations on
application data in terms of the chunks that make them up:
\begin{itemize}
\item \textbf{Reading data}. To read a piece of application data, a SDS node locates
its manifest, fetches it, and then fetches the blocks listed within it.
\item \textbf{Creating data}. To write a new piece of data, a SDS node replicates
its set of chunks and a manifest that contains them.
\item \textbf{Updating data}. Modifying an existing
piece of application data is done by creating blocks with the modified data,
creating a new manifest with the ``latest'' sequence of blocks, and deleting
blocks that contain overwritten data.
\item \textbf{Deleting data}. Deleting the data is done by
deleting its manifest and blocks. Subsequent reads on the manifest and
blocks will fail.
\end{itemize}
These operations are what allow the SDS system to implement end-to-end
guarantees with higher-level aggregation drivers without having to interface
directly with services. Data plane clients (i.e. aggregation drivers) translate data
operations into one or more of these operations.
A key advantage of this protocol is that it gives service drivers insight as to whether or not a
chunk is a block or a manifest, as well as insight on which
record is being processed. Developers are encouraged to exploit this
information in practice to implement
service drivers to transparently carry out both chunk-level and application
data-level optimizations like de-duplication, compression, batch-writes,
defragmentation, and so on. Users are encouraged to exploit this in practice
because a stream of chunks passing through an organization can be recognized as
belonging to a particular application record, which allows the organization to
apply the correct policy on the request to read or write it.
\subsection{Data Discovery and Indexing}
\label{sec:metadata-service}
Manifests provide a way to resolve a record's data, but application endpoints
still need a way to find users' records' manifests.
This requires the SDS system to build and maintain a global
chunk inventory so other users can discover manifests (and thus records).
Because manifest are chunks and are accessable under write-once read-many semantics,
the SDS system must ensure that any time a user creates, updates, or deletes
data, a new manifest will be created for the record and it will have a globally unique
identifier. This grants each record snapshot consistency---each manifest
uniquely identifies the state of a record in-between writes.
In order to read a record, a reader first
needs to discover the record's ``current'' manifest identifier, where the notion
of ``current'' is defined by the application's storage semantics (i.e. by the
aggregation driver). Once it knows it, the reader, must then resolve the
identifier to the manifest, and then resolve each block it needs from the
manifest to the block data. Since both manifests and blocks are chunks,
and since chunks have globally-unique identifiers that any application endpoint
can resolve to chunk data, a SDS
system must provide a system-wide discovery service that maps chunk identifiers to the set of
organization hosts and service providers that can serve its data.
This service is called the metadata service.
\begin{figure}[h]
\centering
\includegraphics[width=0.9\textwidth,page=5]{figures/dissertation-figures}
\caption{SDS Metadata Service. The MS resolves names to their current
manifests, and allows gateways to update the name/manifest binding.
Manifests are stored in the underlying cloud services, and
point to the set of blocks that make up the record.}
\label{fig:chap2-metadata-service}
\end{figure}
The \emph{Metadata Service} (MS) helps users discover the
availability of new records and new chunks. It also helps users
announce the existence of chunks they create, and
identify which organizations and services that can serve a chunk
(Figure~\ref{fig:chap2-metadata-service}).
There only needs to be one MS per SDS instance, and
applications can share the MS as part of sharing the SDS deployment (i.e. the MS
can be designed in a way such that it can be multiplexed across applications).
To resolve reads, the MS must implement at least two indexes:
an index over the set of manifests, and an index over the set of organization
hosts and services that can serve blocks. Then, once a reader has obtained the
manifest, it can decode the manifest to find the block IDs and resolve them to
their data by using the host and services index.
Since there can be multiple users in system-of-systems applications that write
to the same records, a key ease-of-programming
feature the MS must provide developers is an immutable record identifier
for each manifest. This means that the MS's manifest index must be realized as
a naming system---it binds an immutable name to a record's manifest identifier.
Once users learn the record's name, they must be able to resolve it to the
``current'' manifest identifier. In both Syndicate and Gaia, the record
name may be an arbitrary string, but other designs are possible (such as a
DID~\cite{decentralized-identifiers}).
\subsubsection{Name Consistency}
The consistency model of the MS's name/manifest identifier mappings determines the \emph{default}
consistency model for the user's data.
In Syndicate, for example, the MS offers
per-name sequential consistency. Once a writer successfully updates the manifest
identifier for a name, all subsequent reads on the name will return the new
identifier.
In order to support a wide array of application storage semantics, a
SDS system must allow applications to realize different consistency
models by allowing the developer to programmatically determine precisely
when to update the manifest identifier and precisely when to resolve a name to a
manifest identifier as part of an on-going write or read.
This is enabled through the aggregate driver programming model,
described in Section~\ref{sec:aggregation-driver-model}.
\subsubsection{Service Discovery}
The other responsibility of the MS is to provide an index over the set of
organization hosts and storage services that can resolve chunks. This index
must also be visible system-wide in order for application endpoints to query
organizations and services for chunks.
Unlike the record name index, the consistency model of the service
index must be atomic and linearized with respect to reads and writes.
All reads and writes must occur under the same system-wide view of this
index, and once an index view-change executes, all subsequent reads and writes
execute in the new view. Put another way, each read and write belongs to
exactly one view, and there is at most one view in the system at any point in
time.
Preserving this index's consistency model is necessary to ensure that the user's
data-hosting policies are preserved when the service providers or organizations
change. These changes can happen when the user changes which storage
provider(s) host replicas of their data, and can change when the user's trust
relationships with other organizations change. The protocols are described in
detail in Section~\ref{sec:view-changes}.
%The MS also plays a role in deploying service and aggregation drivers. The
%developer uploads new code to the MS, and the MS ensures that the new drivers
%are used to service all subsequent read and write requests. This is described
%in detail in Section~\ref{sec:view-changes}.
\subsubsection{Metadata Policy Enforcement}
Due to the roles the MS plays in a SDS system, it is important to consider
which organization or organizations run it. The design of the MS must not infringe on
each organization's autonomy---both it and the underlying infrastructure
running it must respect all data hosting policies.
This requirement allows for two possible MS designs. On the one hand,
the MS can be designed to be distributed across each organization such that each
organization controls the service discovery and naming for its data and
services. In this design, organizational autonomy is preserved because each
organization mediates all access to its metadata and service discovery
information. This is the design strategy taken by
Gaia's MS.
On the other hand, the MS can be designed such that each organization
places no more trust in its ability to enforce data hosting policies
than it already does in its chosen cloud services. In other words, the
MS could run in an external cloud service, and would only be trusted
with data availability. This is the design strategy taken by Syndicate's MS.
\section{Control Plane}
The control plane governs the data plane in two ways: it applies
the application-given rules for processing reads and writes as their data moves
between users and storage providers (i.e. preserving storage semantics),
and it allows each organization to choose which other organizations
are trusted to execute these rules, based on their users' policies
(i.e. preserving organizational autonomy).
The control plane handles these two concerns by deploying the application's
service and aggregation drivers across the organizations that use the
application, and by allowing users
to select the routes reads and writes take through the drivers.
The aggregation driver has so far been characterized a program running in the
SDS's logical ``hub''
that mediates all interactions with the application's data. The aggregation
driver is on the read and write paths for all of its application's endpoints, including
both ``front-end'' processes on users' computers and
``back-end'' processes running on application servers.
It is tempting to use this logical model as the aggregation driver design by
running it within a developer-chosen organization, such as a cloud computing
provider. This is the approach taken to implementing storage semantics today
in most Web applications---the logic that takes user-initiated reads and writes and
translates them into reads and writes to underlying storage is addressed via
the application's server-side processes. However, since users cannot trust
application servers or storage provider servers with
enforcing their data-hosting policies, this approach must be avoided in SDS
system designs.
The consequence for the SDS control plane design space is that the control plane's
execution is necessarily distributed across the set of organizations. This
implies a distributed aggregation driver model, where each organization runs
one or more service driver instances and one or more aggregation driver
instances which coordinate to execute reads and writes.
The key to preserving both storage semantics and organizational
autonomy is to allow users to select which instances will be used to process
their data: users choose which instances to trust with read and write
processing, and the SDS system ensures that their choices yield a driver
execution trace compatible with the end-to-end storage semantics.
To achieve this, all SDS systems provide two logical control-plane
constructs: the volume and the gateway. Using these two constructs, the control
plane realizes the following properties:
\begin{itemize}
\item \textbf{Scalability}. The control plane can service a scalable number
of concurrent requests by distributing them across the users'
organizations.
\item \textbf{Multiplexability}. The SDS system can be shared across many
applications, organizations, and users. Each application is given the
illusion that it is the only application interacting with the system
(i.e. applications to not interact via SDS).
\item \textbf{User-determined Source Routing}. Users decide which driver
instances process their reads and writes for each record they create.
In doing so, the system recognizes users as the authoritative sources for
their data at the protocol level, instead of by social convention.
\item \textbf{Driver Agility}. Drivers can be replaced and changed at
runtime without affecting ongoing reads and writes. Each user can change
which drivers are used to service reads and writes to their data.
\item \textbf{Fault Tolerance}. Using the user's source-routes for their
data, the SDS system can recover from driver fail-stop conditions by
routing reads and writes to other driver instances that are permitted by
the user's source-routes. In doing so, the user defines how the system
handles faults when processing requests to their data.
\end{itemize}
\subsection{Volumes}
A \emph{volume} is a logical collection of
application data that is accessed through a fixed set of service and aggregation
driver instances. Each driver instance runs within a gateway
(described in the next section), and has a
network address that allows users to send it read and write
requests.
Volumes allow the SDS system to be multiplexed across users, applications, and
organizations. Each record belongs to exactly one volume, and each running
driver instance belongs to exactly one volume.
A volume has a designated ``owner'' user that has the power to unilaterally
add and remove records and driver instances on-the-fly. Volumes can
nevertheless be shared across users, applications and organizations.
Volumes bind their owner's data-hosting policy to their records. This is
achieved by ensuring that the volume owner has the power both to add and remove
service and aggregation driver instances at runtime, as well as both add and remove users
who can send them requests. Organizations run instances of driver implementations, and the SDS
system executes a view-change protocol (Section~\ref{sec:view-changes}) to
ensure that (1) all of the volume's users know which driver instances to
contact, and (2) all of the volume's driver instances know which users are
allowed to read and/or write to them.
This arrangement means that the volume owner has direct control
over their trust relationships with other organizations and their users.
The application only provides a view of the volume data, and has no
say in which organizations and users the volume owner trusts.
The volume owner only allows a service or aggregation driver instance
to process reads and writes to
volume records if she trusts the organization running the driver to faithfully
execute its code. Similarly, the volume owner only allows a user to
interact with her volume's driver instances if she trusts the user. The SDS
system design may provide her with additional access control mechanisms to
constrain how other users interact with her drivers (and thus the volume data).
For example, a lab's PI may want to store lab data to Amazon S3 and retain an
access log for all requests for a year. She does so by instantiating a service
driver for loading and storing chunks to S3, and an aggregation driver that
accepts reads and writes, logs them, and forwards them to the S3 service driver.
She needs all reads and writes to pass through the aggregation driver, so the
log will be maintained.