-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathsyndicate_sds.tex
1216 lines (1042 loc) · 68.5 KB
/
syndicate_sds.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\chapter{Exemplar Systems}
\label{chap:syndicate_sds}
In order to validate the design principles of software-defined storage, this
thesis presents the design and implementation of two separate SDS systems.
Both systems currently enjoy use in production environments.
The first system, called Gaia, implements a global key/value store with programmable semantics for its
\texttt{get} and \texttt{put} operations. It is a minimalist SDS system---it
provides just enough functionality to ensure that applications and
users can interact with their data under a fixed set of storage
semantics running at a layer above the third-party services.
The second system, called Syndicate, implements
a full POSIX filesystem interface with programmable semantics for most
filesystem operations. It is much more featureful than Gaia, and is designed
to port existing scientific computing workloads to commodity third-party
services.
Despite being designed for different use-cases, both Syndicate and Gaia allow
developers to preserve end-to-end storage semantics while respecting
organizational autonomy. Both systems achieve this by providing one or more
gateway implementations that run in separate organizations, but
coordinate through a shared untrusted metadata service. While the systems have
different gateway and MS designs and implementations, they nevertheless adhere to
the same design principles.
\section{Gaia: a Key/Value SDS System}
Gaia is a global key/value store designed for users to host their data
for decentralized Web applications. In this thesis, ``decentralized''
applications are applications where all of the business logic runs on the
users' computers. For example, a decentralized todo-list
application~\cite{blockstack-todo} would fetch the user's application state from
the storage providers of their choice, allow the user to interact with the items
once loaded, and would store the resulting state back to the storage providers
when the user is done. Unlike conventional Web applications,
there is no ``application server'' that runs business logic on the user's data.
Gaia is designed for Web programming environments, and offers two
modes of operation. The first mode, called ``single-reader mode,'' offers
behavior similar to what Web developers today expect from HTML5
\texttt{localStorage}~\cite{w3c-localstorage}. The Web code can load a value
given a key and store a \emph{(key, value)} pair, with the expectation that only this
instance of the code will be able to interact with the key. For example,
the aforementioned to-do list application would use this mode to ensure that a user can
only read and write their own data, and other users cannot interact with it at
all.
The second mode, called ``multi-reader mode,'' offers one-writer many-readers
semantics. Only a user may write to their own keys, but any user may discover
and read their keys. For example, a blogging application built on Gaia would
use Gaia's multi-reader mode to allow a user to publish blog posts, and allow
other users to read them.
The main contribution of Gaia is to give Web developers a secure and
reliable way to outsource data-hosting to users. Gaia ensures that each user
securely and automatically discovers each other users' volumes and certificate
graphs for each application they use prior to loading the data.
In doing so, Gaia offers end-to-end
data authenticity and confidentiality while using untrusted commodity cloud infrastructure
to host and serve application data.
Despite being a minimalist SDS system with a simple API, it is used today on
production workloads for Blockstack~\cite{blockstack} applications. Many
non-trivial applications rely on it for storage, including a shared document
editor~\cite{graphite-docs}, a cryptocurrency portfolio
manager~\cite{coins}, a microblogging
platform~\cite{publik}, and
an end-to-end encrypted Web chat application~\cite{stealthy.im}.
\subsection{Motivation}
In conventional Web applications, users and would-be developers are severely
constrained in what they are able to do with their data. This is because in
conventional Web applications, the business logic and the storage logic both run
in the application's servers. As such, all authoritative replicas are hosted
outside of the users' organizations, and any computations that may be performed
over them are mediated by the application's servers. Users and would-be
developers need \emph{permission} to access, modify, and extend their data's
storage.
The motivation for creating Gaia is to allow Web applications to be written in a
way that decouples business logic from storage logic. Users and developers
ought to be able to control where their data is hosted and how reads and writes
on it are carried out. At the same time, making this change should not require
Web application developers to significantly re-think the way they build
applications---at most, they should only have to change the Javascript calls in
their application frontends to direct reads and writes to the users' chosen
storage providers, instead of the application servers. Achieving this would
yield three main benefits: (1) users can keep their data in the
event that its developer stops maintaining the application, (2) multiple
applications can interoperate by reading from each other's volumes, and
(3) developers can avoid the need to host user data, or any ``hard state'' for
their applications.
These benefits are realized by first observing that in many cases,
a Web application's data interaction model is \emph{already} centered around
individual user activity. Users can read and write their own data, but can
only read other users' data.
In applications that present ``shared-write'' views of data, like a comment section on a blog or
a shared Google Document page~\cite{google-docs}, the business logic
attributes each write to a specific user, and then ``merges'' their
writes to present a consistent view. Gaia exploits this property of Web
applications by bundling up all of a user's data into its own volume, and giving
the volume owner the ability to grant other users read access to it.
\begin{figure}[h]
\centering
\includegraphics[width=0.9\textwidth,page=16]{figures/dissertation-figures}
\caption{Gaia versus traditional Web applications. In traditional Web
applications (left), application clients' reads and writes are mediated
through a shared application server. In Gaia, reads and writes are processed
by a sequence of one or more Gaia nodes before being loaded and stored to
commodity storage as chunks. Gaia nodes, in turn, run the users' gateways.}
\label{fig:chap3-gaia-vs-traditional-web}
\end{figure}
Even though a Web application has a single logical database that holds all of
its users state, this observation about access patterns allows Gaia to reformulate
the global database as a collection of single-writer multi-reader user-specific
databases (Figure~\ref{fig:chap3-gaia-vs-traditional-web}).
It becomes the application client's job to translate a set of reads across
the users' databases into a consistent view, whereas this had traditionally been
done by the application's global database. In SDS terms, both the
Web applications and each users' computers form separate organizations, and each
organization sets a policy that allows writes only from within the organization
and allows reads from zero or more other organizations.
This reformulation of application storage gives Gaia the ability to separate each
users' piece of the Web application's global state into its own volume, thereby
placing it under the control of the user's organization.
The application only needs to be able to read from a user's database piece in order to
present other users with a view of its data. The database need not reside on
servers that the application developer chooses. What this means in SDS terms is
that there is one volume per $(application,user)$ pair, and that only the user may
write to the volume and control its access semantics. Application developers
are simply considered users of their own application.
A user may optionally make a volume readable to another set of users, or to the
world. For example, users of a blogging application would make their volumes
world-readable. As another example, the application developer's volume would
store application assets like images, CSS, and code to be loaded at runtime by
the running application code (thereby allowing the developer to push
updates to their application, much like how they do today on the Web).
The SDS design principles come into play in the following tasks:
\begin{itemize}
\item \textbf{Multiple Storage Systems}. Gaia allows users
to choose the storage systems that will host their data in an
application-agnostic way.
\item \textbf{User Storage Policies}. Gaia allows users to
stipulate programmatic policies pertaining to data availability and
durability, thereby preserving organizational autonomy.
\item \textbf{Application-specific Views}. Gaia uses aggregation
drivers to construct global, consistent views of a set of users'
application state, thereby preserving end-to-end storage semantics.
\end{itemize}
\subsection{Blocks, Manifests, and Volumes}
Gaia organizes a user's data into a set of volumes, where each volume
holds one application's data. The user decides whether or not the volume
operates in single-reader or multi-reader mode, and selects the set of
service drivers to use to replicate data.
A volume in Gaia is a key/value store. The API the Gaia client exposes to
programmers resembles HTML5's \texttt{localStorage}. It offers three methods:
\texttt{get(key): value}, \texttt{put(key, value): bool}, and
\texttt{delete(key): bool}. The writers to the volume are the gateways that run
on the user's trusted devices. The readers are either exclusively the user's
devices (in the single-reader mode), or any device that can discover the
publicly-visible data (in multi-reader mode).
Internally, the set of keys in a volume are bundled into a per-device manifest.
Each value is the associated block, which points to replicas of the raw data.
This means that per-device writes are serialized
across keys, and that writes to a key are atomic. These behaviors were chosen
specifically to emulate the semantics of \texttt{localStorage}.
\subsection{Gateways}
Users run one or more Gaia nodes to access their application state. Gaia nodes,
in turn, run gateways on behalf of the application. The Gaia node the
application accesses provides a key/value storage abstraction that encompasses
all of the key/value pairs written by all of the user's devices. It forwards
reads and writes to gateways running within this and other Gaia nodes.
Gaia nodes are distinct from gateways in that a single Gaia node can run many
gateways for many users in many applications. Gaia nodes are meant to be easy
to deploy for non-technical users---the non-technical user should only need to
understand that as long as they have a running Gaia node, then their policies
will be enforced regardless of which applications they use and regardless of
which other users they share data with.
This design constraint means that a
single Gaia node must be able to instantiate many gateways for many users
and many applications. To achieve this, gateways in Gaia are implemented as
closures within the running Gaia code. This makes instantiating a gateway
inexpensive---the node simply allocates a new internal \texttt{gateway} object
and pairs it with an application-specific, user-specific context.
Gateway instantiation and teardown is driven by application sign-ins.
When the user signs into the application, the Gaia node instantiates gateways to
run access and mutate flows. When the application's Web page writes data,
it asks the user's Gaia node to run a mutate flow through its Build, Push, and
Publish gateways to make the write durable and available.
Each of the user's devices runs a Gaia node locally or on a trusted host to
process writes. The device running the node with the writing gateways
must be trusted, since it runs coordinator gateways and has access to their
private keys.
The read path depends on whether or not the volume is a single-reader or
multi-reader volume. If it is a single-reader volume, then the Web page simply
asks the user's own Gaia node to carry out the access flows. If it is a
multi-reader volume, then the Web page instead contacts the Gaia metadata
service (discussed below) to find the Gaia node that can serve the data.
Once this node is known, the Web page asks the node to
carry out the access flow.
Because a user may have multiple devices, there can be multiple writers to
a single volume. However, Gaia assumes that \emph{application writes are
sequentially consistent}~\cite{sequential-consistency}---there exists a total
ordering of writes issued by the application regardless of which devices
originate them. This is a reasonable assumption in practice, because
(1) a volume may only be written to by the user
that owns it, (2) a user typically does not access the same application from
two different devices simultaneously, and (3) any concurrent writes from the
application on the same device can be serialized by the Gaia node.
\begin{figure}[h]
\centering
\includegraphics[width=0.9\textwidth,page=17]{figures/dissertation-figures}
\caption{Application interface to Gaia. Gaia combines the timestamped
key/value writes in each device's manifest to create a coherent view of the
user's volume.}
\label{fig:chap3-gaia-global-key-value-store}
\end{figure}
This assumption side-steps the need for a user's Gaia nodes to
coordinate to resolve write-conflicts in the key/value abstraction. Since each
device runs the coordinator gateway for the key/value pairs it has
written, the key/value abstraction can be realized simply by
merging each device's key/value pairs into a single key/value space.
The merge function simply accepts the
value with latest timestamp in order to handle cross-device
key/value conflicts (Figure~\ref{fig:chap3-gaia-global-key-value-store}).
In the rare cases where an application expects a user to read and write from
multiple devices simultaneously, the developer has the opportunity to implement
write serialization in the volume's aggregation driver. The absence of built-in
write/write conflict resolution is a design choice that makes the common case
simple in its implementation and performant in its execution.
\subsection{Metadata Service}
Gaia nodes implement a peer-to-peer metadata service. The Gaia MS is based on prior work
on Blockstack~\cite{blockstack}~\cite{virtualchain}~\cite{ali2017}. It enables
Gaia nodes to both ensure that readers do not see stale data, and to ensure that
any Gaia node can discover and read key/value pairs from a given volume for any user.
Gaia uses a blockchain-based SSI system both for bootstrapping trust between
users and for implementing the ``volume discovery'' and ``gateway discovery''
functions of its MS. When users Alice and Bob
register their user names in the SSI, they each include a cryptographic hash within the
blockchain transaction. Each hash is the hash of a DNS zone
file~\cite{rfc-zone-file} that contains routing information for discovering the
user's volumes and Gaia nodes.
Gaia nodes work with the SSI system to build a 100\% replica of all zone files.
They self-organize into an unstructured
peer-to-peer network through which they exchange zone files. They exchange
bit-vectors with one another to announce the availability of their zone files,
based on the sequence of transactions the SSI system has processed that include
new zone files hashes. Peers inspect one another's bit-vectors, and pull
zone files from one another in rarest-first order such that they all eventually
build a 100\% replica.
Peers arrange themselves into a $K$-regular peer graph. They each choose an
unbiased random sampling of the peer graph as their neighbors using a
Metropolis-Hastings random graph walk with delayed acceptance~\cite{lee-xu-eun}.
The default implementation chooses $K=80$, and when queried for neighbors, will
respond with a random sample of at most 10 peers that
have historically responded to queries at least 50\% of the time. This helps
each peer quickly discover ``healthy'' peers in its neighbor set. In addition,
each peer remembers and periodically pings up to 65536 discovered
peers regardless of their perceived health in order to accommodate churn in both
the set of peers as well as the network links between them. For example, a
healthy peer that goes offline for a time due to a bad network link or
misconfiguration will not be completely forgotten for some time, giving the
operator a chance to correct the issue and quickly rejoin the network.
Since Gaia nodes view the same blockchain, they calculate the same sequence of zone
file hashes. This gives them a ``zone file whitelist'' that grows at a constant
rate, no faster than the blockchain. They use the whitelist to identify only
legitimate zone files, and rely on the blockchain to ensure that not too many
new zone files can be introduced into the system at once. A detailed
description of the peer network can be found in~\cite{ali2017}.
\begin{figure}[h]
\centering
\includegraphics[width=0.9\textwidth,page=18]{figures/dissertation-figures}
\caption{Volume lookups in Gaia. When Alice wants to read Bob's app data, she
(1) looks up his ID in his SSI database, (2) finds his zone file in
Gaia's peer network, (3) finds his certificate graphs, (4) finds his
public-facing Gaia node and volumes, and (5) routes access flows through it
to access his volume data.}
\label{fig:chap3-gaia-volume-lookups}
\end{figure}
The peer network ensures that each Gaia node knows the names, current public
keys and current zone file for each user. Each user's
zone file points to a set of signed volume and gateway configuration data
structures, including the certificate graphs for each volume.
This way, a Gaia node can
look up an application-specific volume for a user given the user's name on the
SSI system (Figure~\ref{fig:chap3-gaia-volume-lookups}). Importantly, the networks and storage
providers hosting zone files and configuration data are \emph{not} part
of the trusted computing base. As long as the user's local SSI server is
trusted, then they can discover authoritative state about other users' volumes.
Discovering a user's data is a matter of first looking up the volume metadata, and then
searching the key space in the metadata for the desired metadata record. After
Discovery, the reader caches the key's version number, so subsequent reads do
not return stale data. Publishing data is a matter of uploading a new key/value
pair with a greater version number.
\subsection{Aggregation Drivers}
Users enforce end-to-end storage semantics for multi-reader volumes by standing up and running
publicly-routable Gaia nodes to process reads from other users
(Figure~\ref{fig:chap3-gaia-reads-writes}). By design,
these public read nodes are \emph{not} trusted, and the gateways they run do not
have any write or coordination capabilities. The data they serve is
meant for external consumption.
\begin{figure}[h]
\centering
\includegraphics[width=0.9\textwidth,page=19]{figures/dissertation-figures}
\caption{Gaia read/write configurations. Discover and Acquire happen in the
same node in single-reader configurations, but can happen either on untrusted
nodes or on both trusted and untrusted nodes in multi-reader settings.
Mutate flow stages always run within the trusted computing base, however.}
\label{fig:chap3-gaia-reads-writes}
\end{figure}
When a user Alice creates a volume, she simply lists the Gaia node
in her signed Gaia configuration as the ``read'' endpoint. When others Bob and Charlie go to
read from her volume, their Gaia nodes issue the request to her ``read'' Gaia node
indicated by her configuration data. Bob and Charlie discover the ``read'' node
by querying the MS.
\subsection{Flow Routing}
Only Alice can change her volume and node configuration. She does so simply by
regenerating the configuration and signing it with the key listed under her
account in the SSI system. If she wants to change the URLs to her signed
configuration, she uploads her configurations to the new locations, generates a
new zone file with URLs that point to them, and announces the new zone file's
hash in the SSI system's blockchain. Once the SSI system processes the
transaction, she broadcasts the new zone file to Gaia's peer network so all
other Gaia nodes can discover her volumes.
When Bob wants to read Alice's data, his node first inspects her volume record
to determine the ``read'' endpoint. When Bob's node runs the Discover and
Acquire stages, the ``read'' endpoint is given to the aggregation driver as part
of its execution context so it can pull data from it.
When Alice wants to write to data to a volume, her Gaia node ensures that the
appropriate gateways are instantiated with the Build, Push, and Publish stages
from her configuration. Once they are available, the node processes her write
request. In practice, her gateways are locally available for the duration of an
application session---her node instantiates them as part of an application
``sign-in'' process, and shuts them down when the session ends.
\subsection{Administration}
To minimize coordination between developers and users, Gaia
automates as much of the system administration as possible.
In its day-to-day operation, the only administrative contact a user has with their volumes is
in connecting storage providers, which is handled via a provider-specific Web
UI. In addition, Gaia minimizes the instances where the user directly interacts with cryptographic keys
by ensuring that they only need to do so when they acquire or lose a personal
computing device.
Application developers do not interface directly with storage providers, but
instead with the user's designated Gaia node.
Instead, developers specify the storage requirements the
application needs, and the Gaia node pairs the requirements with storage drivers
when creating its volume.
%A table of storage requirements can be found in
%Table~\ref{tab:gaia-storage-requirements}.
% TODO: table of Gaia storage classes
The application code discovers a user's Gaia node as part of the SSI sign-in
process. The SSI service identifies to the application the network address
of the user's Gaia node. The application then learns the set of Gaia storage
providers, and the set of capabilities they offer (which can be matched to
storage requirements).
The resulting storage administration workflow for users and developers works as
follows:
\begin{enumerate}
\item When the user creates an account in the SSI service, she connects one
or more storage providers to her account.
\item The user loads the application and clicks its ``sign-in'' UI element.
\item The application redirects the user to the SSI service's ``sign-in'' UI,
which prompts the user to authorize the sign-in request. Specifically,
the user is presented with the application's request for either a
``single-reader'' or ``multi-reader'' volume.
\item Once approved, the SSI service redirects the user back to the
application, passing it a session token which identifies the user's Gaia
node.
\item The application requests a volume. If this is the first such request,
the Gaia node creates an application-specific volume. The node then
returns a handle to the volume which the application subsequently uses to
load, store, and delete keys.
\end{enumerate}
At no point are users asked to interact with volume, user, or gateway keys, and at no point
are users asked to perform access controls. At no point are the developers
asked to identify or bootstrap a connection to storage providers, and at no
point are developers required to perform any access controls beyond deciding
whether or not their app-specific volume will be world-readable or private
(enforced internally through encryption). This removes the need for developers
and users to coordinate with one another---Gaia ensures that applications' storage
interactions never interact, and ensures that users can only read one anothers'
data if they interact at all.
Gaia users are self-sufficient---there is no designated third party service that is
responsible for keeping the system alive, since users interact with their data
through device-hosted Gaia nodes. However, users nevertheless need to recover
access to their data in the event they lose their computing devices.
To facilitate this, the configuration state for a user's Gaia node is replicated to \textit{all} of
the user's storage providers. This state includes all app-specific public keys,
as well as all encrypted authentication tokens for their storage providers.
The configuration bundle is signed and encrypted with keys linked to
the user's identity on the SSI system's blockchain, so no matter which device(s) the user
uses to modify their configuration state, they will be able to always be able to
at least authenticate the externally-hosted data (even if they lose all of their
devices). If the user changes their keys (i.e. in order to recover from device
loss), the configuration state is
automatically re-signed and re-encrypted by the Gaia node.
The only time a user directly interacts with a cryptographic key is when they
change the device(s) they use to interact with their data.
The implementation facilitates this by
encoding an encrypted ``master'' ECDSA private key as a 12-word mnemonic phrase, and derive keys
for signing name updates and for signing app-specific volume data
using a deterministic key-derivation
algorithm~\cite{bip39}. The encrypted private key is backed up to their email provider by
default.
\subsection{User Scalability}
The limiting factor to the system's scalability is how many users it can support.
This is limited not by Gaia itself, but by the rate that transactions can be
written to the underlying blockchain.
Registering a username requires two
transactions---a commitment to the salted hash of the username, and a matching
revelation of the username and the salt. Two transactions are necessary in
order to prevent front-running, whereby an adversary can watch the set of
unconfirmed blockchain transactions (i.e. those that are present in each peer's
local memory but not yet assigned to a block) and race the victim to send out
a transaction that acquires a username. If Bitcoin's blockchain were utilized
solely for registering Gaia users, it could only process 72,000 requests per day
(assuming 1kb transactions and 144 blocks added per day).
Once a name is registered, the owner can update their zone file with a single
transaction. The blockchain provides a linearizable history of all zone file
updates, and thus all zone files.
To scale up the number of users, Gaia allows an alternative way for registering
and updating usernames by packing many such operations into a single Bitcoin
transaction. It does so by packing them into a single on-chain name's zone
file, and then propagating the zone file through Gaia's metadata service.
The on-chain name owner issues the transaction to set the new zone file.
The history of off-chain user name operations is still linearized, since the history of
zone files is linearized by the blockchain. Each off-chain user name has its
own public key and its own zone file.
Names registered this way must not collide with on-chain names. To do so,
off-chain names use the on-chain name that propagated their initial
``creation'' operation as a suffix. For example, Alice would create the name
\texttt{alice.personal.id} by asking the owner of \texttt{personal.id} to
propagate a ``creation'' operation for it.
Off-chain names retain the same safety properties as on-chain names. Off-chain
names are owned by separate private keys. Only the owner of an off-chain name's
private key can generate a valid ``update'' or ``re-key'' operation. Gaia nodes
only accept newly-discovered off-chain operations if they are signed by the
right principal. The ``create'' operation sets the initial public key.
However, off-chain names do not have the same liveness properties as on-chain
names. An off-chain name owner needs the cooperation of an on-chain name owner
to propagate a new operation. Specifically, a ``create'' and ``re-key''
operation \emph{must} be propagated by the owner of the corresponding on-chain name
(i.e. only \texttt{personal.id}'s owner can propagate these transactions for
\texttt{alice.personal.id}). In addition, a ``create'' or ``re-key'' operation
will only be accepted if the on-chain name owner has propagated \emph{every}
zone file.
The reason for these constraints is that they ensure that each client receives
the \emph{current} public key for each off-chain name, regardless of the order
in which zone files are replicated. This requires making sure
that the history of non-idempotent off-chain name operations (creation, re-keying) is
linearlized. By requiring all zone files
for the on-chain name to be present before processing a subsequent zone file for
the on-chain name, Gaia ensures that no conflicting off-chain name
creation or re-keying operations exist in each off-chain name's history.
Unlike ``create'' and ``re-key'', the ``update'' operation can be propagated by
any on-chain name owner. This is because changing an off-chain name's zone file
is idempotent and commutative. The order in which the off-chain name ``update''
operations are propagated to Gaia's peer network does not affect a correct
Gaia node's ability to resolve the off-chain name to its zone file.
Even though creating off-chain names only improves the system's throughput by
a constant factor, it is sufficient in practice. The Gaia implementation
allows an on-chain name's zone file to be 40Kb, and can fit between
100 and 120 off-chain name operations without compression.
At a rate of eight transactions per block (about 3.4\% of Bitcoin's throughput),
the system can accommodate over 115,000 new user registrations
per day (about the rate at which Twitter acquires
new users~\cite{twitter-user-acquisition}) % https://www.statista.com/statistics/303723/twitters-annual-growth-rate-worldwide/
\subsection{Global Relational Databases}
Even though users own and control their volumes,
application developers need global insights into how people use their
applications in order to catch bugs and make improvements.
For example, developers often need to answer queries like:
\begin{itemize}
\item Which users are using a given application? Which version?
\item Which users are the ``power users'' who create a lot of content?
\item Which users only try the app out a few times and then abandon it?
\item Which users use a competing application?
\end{itemize}
Users need similar functionality to implement cross-volume queries and search
functionality. In the Web today, this need is fulfilled by a search engine.
Similarly, users should not be expected to discover which other users have
useful public data in Gaia without the aid of a search tool.
Gaia addresses both needs by providing the means to
implement a global, read-only, software-defined SQL (called a ``Gaia
database'') over the entire set of public data in the system. Application developers and search
engine providers instantiate their own databases for specific
applications, specific users, and/or specific
types of public data. The database services queries by fetching
public data from users' multi-reader Gaia volumes, so each query still
invokes the data owner's aggregation driver (thereby preserving each
organization's ability to control data exposure to such systems).
A Gaia database is straightforward to
implement because the set of usernames, and thus the set of volume certificate
graphs, is \emph{globally consistent} and \emph{enumerable} thanks to the SSI
system. The developer instantiates a Gaia database by first
enumerating all user names (i.e. by instantiating a Gaia node and
an SSI node). They build up the list of all user names, and for each user name,
they fetch the user's list of volumes and then search the user's public data.
This information is then fed into a commodity database of the developer's
choice. This indexing process is very similar to the indexing process of a Web
crawler, but is simpler to implement because it already has the complete list of
all data to index.
The Gaia database design pattern is a frequently seen in Gaia-powered applications,
since they allow developers to solve common problems like building search
engines or aggregating content in application-specific ways. For example,
a photo-sharing application would use
an application-specific Gaia database instantiated in this manner that
aggregated a user's friends' photos on their behalf. The Gaia database
would coordinate with the photo-sharing app clients to implement a write-through view of
the users' photo albums. Whenever the user posts a new photo or likes someone else's
photo, she would both save this new state to her Gaia node and send a hint
to the Gaia database that new content had been written (so other users could
be informed in real time).
It is important to remember that even though a Gaia database gives
developers a similar set of insights into user data that they enjoy today in
conventional Web applications, the key improvement offered by Gaia is that
the Gaia database preserves each
organization's autonomy. The Gaia database is simply a downstream replica of
each organization's state, and is used in SDS applications as a read optimization
(i.e. so clients do not need to fetch data from as many sources).
Each organization can still unilaterally deny queries or service them in
domain-specific ways using their volumes' aggregation drivers.
This relationship is similar to how search engines on the Web today host downstream
replicas of Web content, and rely on the \texttt{robots.txt}
convention~\cite{robots-txt-convention} to control access to site content.
But unlike the \texttt{robots.txt} convention, Gaia gives users the
\emph{unilateral} ability to filter what a Gaia database can query by means of
strong encryption.
A demonstration of the Gaia database design pattern is available online to search the
set of Blockstack user names~\cite{blockstack-explorer}. It allows users to
discover other users' linked social media profiles. In addition, this design
pattern is common to Gaia-powered applications described in the next chapter.
\section{Syndicate: a SDS File System}
Syndicate is a scalable software-defined distributed file system meant for scientific
workloads. Unlike Gaia, Syndicate is designed to provide shared volumes that
efficiently leverage CDNs for read-heavy workloads and
support I/O from a scalable number of concurrent users. It is meant to be
used for sharing data across compute clusters, where the data sources and
sinks reside in different organizations.
\subsection{Motivation}
Science research is increasingly data-driven and increasingly distributed.
Researchers often share large datasets with other labs across the world and
with the public. As the cost of storage space becomes cheaper, scientists can
afford to generate and retain larger and larger amounts of data for the
indefinite future.
These trends create an interesting set of operational challenges:
\begin{itemize}
\item How do scientists onboard new users and labs that use different
technology stacks than their own?
\item How do scientists keep legacy data-processing workflows running in the face
of changing storage and compute systems?
\item How do scientists take advantage of commodity storage and
compute technologies without having to write a lot of bespoke code
to do so?
\item How do scientists enforce data access and retention policies when the
underlying storage substrate can be changed out from under them?
\end{itemize}
The standard practice today is messy. Each time a lab wants to change its
storage system, it must re-work its workflows to be compatible. This entails
more than patching the code to read and write data. It also means changing their
operational practices for staging data for computation and changing the way they
share data, both internally and with other labs.
The recent ``containerized approach'' to using containers, VMs, and
SDNs to preserve the runtime environment for scientific workflows
is a step in the right direction for preserving end-to-end storage semantics.
However, it still forces scientists to copy their data into the new
runtime and copying results back out, and it additionally forces scientists to
maintain the (virtualized) infrastructure. This puts them in the uncomfortable
position of having to become experts in state-of-the-art devops techniques and
in data management software.
These challenges stem from the fact that scientists increasingly need to share
data across organizations. Organizations include individual
scientists' computers, the computers in the same research group, the computers
across a collaborative set of research groups who
work across multiple labs (including multiple universities, corporations, and
countries), and the general public. Whenever a scientist in one organization
needs data in another organization today, they need to manually copy it out into their
organization's storage. At the same time, whenever a scientist needs to report the results
of their workflow to another organization, she has to manually replicate it to a place
where other organizations can read it.
What is needed instead is a storage system that preserves end-to-end storage
semantics \emph{across} organizations while interoperating with legacy storage.
Scientists should not have to manually access and copy data to move it between
organizations. Instead, the workflow
software ought to be able to do that automatically on an as-needed basis,
while preserving the workflow's expected end-to-end storage semantics.
\subsection{Gateway Types}
Syndicate accommodates cross-organizational data acquisition and
data replication by supplying specially-crafted gateways designed to make it
easy to share and store data. An organization that wishes to share data with
another organization would encode its rules for allowing access into an
\emph{acquisition gateway} that takes care of indexing and exposing the data as
manifests and blocks. An organization that wishes to store the results of
scientific computations would run a \emph{replica gateway} that enforces rules
that govern whether or not (and how) to store manifests and blocks within the
organization's storage systems. Linking the two together are \emph{user
gateways} that expose the Syndicate-formatted data to scientific workflows in a
workflow-defined manner, such as an externally-mounted filesystem within a
container (Figure~\ref{fig:chap3-syndicate-overview}).
\begin{figure}[h]
\centering
\includegraphics[width=0.9\textwidth,page=20]{figures/dissertation-figures}
\caption{Syndicate overview. Syndicate offers three types of gateways (UGs,
RGs, and AGs) for interfacing with CDNs, storage providers, and datasets,
respectively. It is targeted towards science labs.}
\label{fig:chap3-syndicate-overview}
\end{figure}
\textbf{Acquisition gateways} (AGs) are gateways that connect to an externally-hosted
dataset and ``import'' its records into a Syndicate volume in a read-only
fashion. It does so by crawling its backend dataset, and publishing metadata
for each (logical) record to the Syndicate MS. Other gateways read the dataset
by first discovering the metadata, and then asking the AG for the manifest and
chunks (which it generates on-the-fly by fetching data from its backend
dataset).
\textbf{Replica gateways} (RGs) are gateways that connect to existing storage
systems. They provide a read/write interface at the chunk granularity.
The prototype Syndicate implementation comes with service drivers for Amazon S3~\cite{s3}, Dropbox~\cite{dropbox},
Google Drive~\cite{gdrive}, Amazon Glacier~\cite{amazon-glacier}, iRODS,
and local disk (for compatibility with NFS~\cite{nfs}, AFS~\cite{afs},
Ceph~\cite{ceph}, and other legacy distributed filesystems used today).
\textbf{User gateways} (UGs) are gateways that connect users and their workflows
to other gateways. Each UG provides a different interface to workflows, subject
to their needs. For example, Syndicate comes with a UG that implements a
FUSE~\cite{fuse} filesystem, a UG that implements a RESTful~\cite{rest}
interface, a UG that implements a suite of UNIX-like shell utilities, and a UG
that implements a Hadoop filesystem~\cite{hadoop} backend.
\textbf{Other types}. Syndicate allows operators to specify new gateway types at runtime, allowing
them to incrementally deploy and adapt the system to changing workloads. Each
gateway's type is embedded in its certificate, so each gateway knows at all
times the network addresses and types of all other gateways in the volume.
This allows the operator to construct complex acquisition and replication
strategies that span multiple hosts and multiple organizations.
This feature is put to use in Chapter~\ref{chap:applications}.
Each organization runs the appropriate gateways on their computers depending on
how they wish to interact with the data. This allows scientific workflows to
run across organizational boundaries in an automated fashion, allowing
scientists to \emph{independently} devise new workflows without incurring the
cost of coordinating with each lab to set it up.
For example, an astronomy lab would
run acquisition gateways to expose telescope images of earth. They could stipulate rules
in its aggregation driver code that ensure that newly-generated images are only
readable to a privileged set of labs for a time (e.g. only labs in the same
country) before releasing them to the public. Similarly, a meteorology
lab would run replica gateways to store data from trusted scientists, and
store them in a time-series fashion. Unbeknownst to either lab, a scientist could
run a user gateway on her laptop and on her VMs that allowed her to read from both the
astronomy and meteorology labs' gateways and write data to both the meteorology
lab and to her Dropbox account to be shared with her collaborators. By having
multiple gateway types running in these specific roles, no coordination is
necessary between the astronomy and meteorology labs.
\subsection{Data Flows}
Syndicate gateways route requests to one another based in part on what their
type is. In other words, a gateway's type identifies the steps it is guaranteed
to take while handling an access or mutate flow.
The UG, RG, and AG gateway types identify a set of common routing policies that
work well in practice
(Figure~\ref{fig:chap3-syndicate-reads-writes}).
\begin{figure}[h]
\centering
\includegraphics[width=0.9\textwidth,page=21]{figures/dissertation-figures}
\caption{Reads and writes in Syndicate. Writes are initiated by UGs, which
rely on RGs to Push chunks. UGs Publish the files that they coordinate. AGs
crawl datasets and Publish them as files, and both RGs and AGs serve UGs
chunks as part of their Acquire stages.}
\label{fig:chap3-syndicate-reads-writes}
\end{figure}
A UG initiates access flows to AGs and RGs to handle reads, but initiate mutate
flows only to RGs to handle writes. A UG's Discover step will always fetch the
record's manifest metadata from the MS, while optionally caching it for a
user-specified length of time. Once it has the manifest metadata, it identifies
whether or not the record's coordinator is an AG or another UG. If it is an
AG, it will fetch the chunks directly from it in its Acquire step. If it is an
UG, however, it will try to fetch the chunks from each RG. It considers the
read to be successful if it Acquires all of the chunks requested (without regard
to which gateways served them).
A UG initiates mutate flows only to RGs. It executes a logical write by Pushing
its modified chunks to each RG. That is, it starts one mutate flow per RG in
the volume. Once each RG successfully processes the request, it Publishes the
new manifest metadata to the MS.
The behavior of the Build and Publish stages depend on whether or not the UG is
the record coordinator. If the UG is the coordinator, it will Build the
manifest, Push the manifest and chunks, and then Publish the metadata. If it is
not the coordinator, it will Push only the blocks, and then
contact the coordinator UG to Build and Publish the manifest.
AGs and RGs do not initiate any flows of
their own. AGs are always the coordinators for the records they Publish. They mark their
records as read-only, and will not participate in any mutate flows for them.
They will participate in access flows to serve chunks to other UGs in the volume.
RGs load, store, and delete chunks in their underlying storage systems. They do
not serve as coordinators. They react to chunks uploaded by the UG by running
their Push driver stage on each of them.
\subsubsection{Custom Gateways}
The ability to add new gateway types allows operators to define additional
flow-processing policies. Each gateway in the volume can determine the type of
all other gateways, which allows their drivers to make custom routing decisions.
This allows the operator to implement their stage logic to extend the
routing behavior of existing gateways.
For example, suppose the operator defined a custom gateway
type called a \emph{write-logger gateway} (WLG) for logging all mutate flows.
A WLG is not considered to be
an RG, UG, or AG, so the other gateway types will ignore WLG instances by
default. However, the operator could modify the Push implementation for
her volume's RGs to send a syslog message to each WLG in the volume to record
whether or not the RG completed the write flow successfully. In doing so, the
operator is able to define custom mutate flow routing and processing
logic for the volume by \emph{composing} multiple gateways together in a
pipeline. The UG, RG, and AG implementations do not need to be modified at all;
only the RG driver code needs to be patched (which is facilitated by Syndicate's
view-change facilities).
\subsection{Data Organization}
Unlike Gaia, each record in a Syndicate volume has its own manifest, and is
comprised of a variable number of blocks. The block size is fixed for the
volume, but each volume can have its own block size.
Volumes in Syndicate can have arbitrarily many data records, and each data
record may have arbitrary sizes (i.e. made of arbitrarily many blocks).
Manifests, blocks, and certificates are all cacheable for
indefinite amounts of time, since Syndicate ensures that they are all immutable
(that is, they each receive new IDs in the system when their contents change).
Readers construct URLs to manifests, blocks, and certificates using their IDs to
ensure that any intermediate caches serve the right data. Readers learn the IDs
directly from the MS, and use in-band hints to determine when their view of
these IDs is stale (as described in Chapter~\ref{chap:design_principles}).
\subsubsection{Garbage Collection}
A consequence of immutability is that writes to a record will cause
overwritten blocks and manifests to become unreferenced. To prevent memory leaks, Syndicate's gateways
execute a distributed garbage-collection protocol to remove them. The process
is asynchronous and tolerant of gateway failures.
When the coordinator of a record uploads new metadata to the MS, it includes a
vector of block IDs and the old manifest ID. These are appended to a per-record
log in the MS. Once the write completes, the coordinator asynchronously queries
the MS for the first $k$ entries in this log, constructs \texttt{delete}
requests for them, and sends the requests to the volume's replica gateways.
Once all replica gateways successfully acknowledge deletion, the coordinator
instructs the MS to remove the $k$ entries from the log.
\subsection{Metadata Service}
Syndicate's MS runs on top of a scalable NoSQL database. In practice,
deployments run within Google AppEngine~\cite{google-appengine} and
AppScale~\cite{appscale}, meaning that
Syndicate's metadata is hosted in either Cassandra~\cite{cassandra},
Hbase~\cite{hbase}, MySQL~\cite{mysql}, Megastore~\cite{megastore} or
Spanner~\cite{spanner}. In all cases, writes to a single key are atomic, and
multi-key atomic writes are allowed provided that the set of keys is small (e.g.
five or less in the implementation).
There are two reasons for building the Syndicate MS on top of a third-party NoSQL store.
The first reason is that it makes it easy to automate the MS operation.
When running on Google AppEngine, for example, deploying an MS from
scratch can be done simply by creating a new AppEngine project and pushing the
code from the user's laptop to Google's servers. No further maintenance or
infrastructure administration is required, beyond setting up billing.
The second reason for building on top of a NoSQL store is that it makes it easy
to parallelize non-conflicting operations. The MS metadata records are
structured such that in the absence of concurrent writes,
metadata reads execute in parallel. Moreover, writes to different metadata
records execute in parallel. The number of operations that can execute in
parallel depends on how many hosts are running the NoSQL store; adding more
hosts increases the number of parallel operations supported. By building on top
of Google AppEngine specifically, Syndicate users can easily acquire more capacity as
their workloads need it.
Syndicate organizes metadata into a filesystem-like directory hierarchy.
The Syndicate MS does not resolve paths for gateways. Instead, the user
gateway iteratively walks the metadata record hierarchy by querying each
directory on the path and searching for the next metadata record along the path.
Because it is designed for read-heavy workloads, Syndicate's MS
directories are log-structured, and meant to be cached by
other gateways. When a user gateway walks a directory for the first time, it
fetches the entire log and caches it indefinitely. When it walks it again, it
only fetches the new log entries from the MS, and replays them locally to obtain
the current state of the directory. Meanwhile, the MS asynchronously
prunes a directory's log as entries are updated or removed, thereby keeping the
``steady state'' size of the directory log at $O(n)$ for $n$ entries.
As a result, a user gateway can expect an $O(n)$ time and space overhead when
fetching a directory for the first time, and $O(k)$ time and space overhead
when synchronizing its log after $k$ Publish operations on the directory's
children have been processed (each metadata record
has a constant size, takes constant time to parse, and takes a constant amount
of time to cache if the user gateway implements directories with a hash table).
The worst-case time and space bound for fetching a directory occurs
when the MS is receiving many Publish requests to the directory's children
at the time of read. In this case, the bound is $O(n+k)$ for $k$ additional
(uncompressed) Publishes.
In practice, $k$ is small relative to $n$ since Syndicate is used with
read-heavy workloads. Often times, datasets are write-once read-many,
so $k$ can be zero. In addition, a volume's directory structure is expected
to be broad and shallow. With the datasets explored in this thesis,
scientific workflows tend to store their data as many files
within a single directory, and scientific workflows sharing a volume tend to
create their own top-level directories to store their data.
\subsection{Programming Model}
Because Syndicate's gateways are already designed to fulfill special roles in
the system, each gateway has its own programming model. This helps developers
avoid re-implementing boilerplate logic, and instead focus on helping the
gateway fulfill its designated role.
However, some commonalities exist.
Syndicate gateways implement HTTP servers to serve chunks to one
another, in order to remain compatible with existing CDNs. Similarly, they
implement HTTP clients to pull chunks from the CDNs, and from existing
Web-accessible storage services and datasets.
Each gateway's programming model is inspired by the fast CGI protocol~\cite{fastcgi},
whereby the server spins up one or more long-lived ``worker'' subprocesses to
handle a particular kind of request (e.g. defined by a canonical HTTP path).
This gives developers a way to implement long-lived stateful aggregation driver
logic. The aggregation drivers run as \emph{stateful} fast CGI workers, and
the service drivers run as \emph{stateless} libraries that are loaded by the
aggregation drivers as needed.
Syndicate's driver model distinguishes between the \emph{logical} representation of a record,
the \emph{application} representation of the record, and the
\emph{on-the-wire} representation of the record. The logical representation is
simply a flat byte array (i.e. a file), with
additional metadata describing the block boundaries and ownership information
contained within the manifest.
Application-facing gateways (i.e. UGs in Syndicate) are free to represent data
to the application in any way they want. For example, a UG implementation
may represent a data record as a
SQL database. Such a UG would require applications to interact with the data
via SQL commands. The implementation would translate the commands into
reads and writes on the record's bytes
at the logical layer. Syndicate implements a UG programming library and
SDK to allow developers to provide application-specific interfaces.
\begin{figure}[h]
\centering
\includegraphics[width=0.9\textwidth,page=22]{figures/dissertation-figures}
\caption{Syndicate driver model overview. Each gateway controls how it
serializes and deserializes its chunks, but otherwise each type of gateway has a
unique driver profile.}
\label{fig:chap3-syndicate-driver-model}
\end{figure}
Syndicate's aggregation driver model also gives gateways the ability to
control a record's chunks' on-the-wire representation. This allows the
developer to control how the networks that connect gateways view the data. For
example, the developer can implement end-to-end encryption by encrypting and
decrypting chunks as they are transmitted and received, thereby hiding
data from the networks. As another example, the developer can buffer and send
batches of chunks between gateways on-the-wire independently of the logical and
application representations. An overview of the driver model is seen in
Figure~\ref{fig:chap3-syndicate-driver-model}.
\subsubsection{On-the-wire Processing}
All gateway drivers implement a \texttt{serialize()} and \texttt{deserialize()}
method to translate a logical block or manifest to its on-the-wire
representation and back. The \texttt{serialize()} method is called whenever the gateway sends
data or caches it to disk, and the \texttt{deserialize()} method is called
whenever the gateway receives data or loads it from its on-disk cache.
Unlike the remainder of a gateway's methods, these methods are \emph{always}
invoked whenever a chunk is loaded or stored by the gateway.
\subsubsection{Acquisition Gateway Service Drivers}
The AG driver model is designed to handle datasets that can change from
external modifications. For example, the iRODS AG driver
subscribes to the iRODS event queue, which allows it to get notified
when files it indexes change. This allows it to push updates for them to the MS
in order to ensure that the state of the backend dataset is accurately reflected
by the volume.
\noindent{\textbf{Aggregation Driver:}} An AG only needs to implement the Publish stage of
the aggregation driver model, since it will never initiate access or mutate
flows. Its Publish stage is implemented as a method that the AG repeatedly
calls. It takes nothing as input, but outputs new record metadata and a
hint as to whether or not to create, update, or delete the record on the MS. The
implementation is allowed to block the AG in the event that the backend dataset
has not changed.
\noindent{\textbf{Service Driver:}} When another gateway asks for a block from
one of the records, the AG forwards it to its service driver in order to fetch
the bytes from the dataset (the \texttt{read()} method). The AG will
automatically generate manifests on request.
\subsubsection{User Gateway Drivers}
The UG driver model is designed to pull chunks from RGs and AGs, and push new
chunks to RGs. Unlike the other gateways, the UG driver model gives developers
a chance to have the UG connect to one or more CDNs to fetch chunks.