Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

document issues with relationships as entities #7

Closed
carpentermp opened this issue May 31, 2011 · 16 comments
Closed

document issues with relationships as entities #7

carpentermp opened this issue May 31, 2011 · 16 comments

Comments

@carpentermp
Copy link

The “QName role” in RelationshipReference is “denormalized” information, right? The role is defined in the Relationship, so storing it in the reference is redundant and opens the possibility of it not being in agreement with the relationship. If we are putting denormalized information in the RelationshipReference then we ought to be explicit about why we are doing it. Is it to help the client know which relationships to dereference? For example, suppose we want to identify all the children of a person, we would only need to dereference the RelationshipReferences where the role is “Parent”. Is this the reason for the role on the reference?

If we are putting this kind of denormalized information on a person we need to be clear how we plan to keep everything self-consistent. My suggestion would be that we don’t allow Relationships to be created without two Persons and a “type”, and that these three pieces of information are immutable. (I know this flies in the face of Rontel’s opinions on the subject). This gets into all the life-cycle questions between Person and Relationship that need to be clearly defined and documented.

Though we have not yet been very explicit about this, I believe that the current state of the model implies the following:

• When a Relationship is created or deleted, the Person’s involved are “modified” with the addition/deletion of a “RelationshipReference”. While I personally agree with this, I believe it precludes the “git-like” editing model that John Sumsion et. al. are wanting, since their model requires a “directed-acyclic-graph”. It may be worth exploring their ideas some more before we shut the door on them, I don’t know. We would also need to explore what it would mean to the model if Persons didn’t have any explicit knowledge of the Relationships they participate in—or in other words, what if Persons were NOT modified when a Relationship is created that refers to them?
• When an Event or Characteristic is added to, or removed from, a Relationship, the Persons involved are NOT modified. (This seems rather unfortunate, from my point of view, since someone “watching” a Person would probably consider a new “marriage date” to be the kind of thing they would like to be notified of. I believe this is one of the side-effects of having Relationships as entities and now opens up the need for users to explicitly “watch” Relationships as well as Persons. To me, it would seem much more simple and natural for a user to “watch” a Person and be notified of any change to any Relationship that the Person participates in.)
• When a Person is deleted, the Relationships the Person is involved in are also deleted.
• When a Person is merged with another Person, what happens to the Relationships? (Is the “uniqueness constraint” part of the model? If so, then some merging of Relationships would seem to be implied. If not, then systems are free to do this differently. Unfortunately, this becomes an impediment to interoperability since one system may allow multiple relationships of the same type between the same two people, and another may insist upon uniqueness. My opinion is that this is the type of thing that a well-defined model is supposed to guard against and ought to be clearly specified.)

@stoicflame
Copy link
Member

I think the bulk of this conversation implementation-specific. Granted, the questions you raise here are issues that every provider is going to need to deal with in some way or another and I think we need to address these questions in the form of some documentation that defines best practices and suggestions. I'm going to leave this issue open until that documentation is provided.

@carpentermp
Copy link
Author

I personally don't think it is sufficient to say that these issues are implementation-specific because of the interoperability problems that arise when implementors make different choices.

@stoicflame
Copy link
Member

Here is my personal take on the specific questions you deal with here.

The “QName role” in RelationshipReference is “denormalized” information, right? The role is defined in the Relationship, so storing it in the reference is redundant and opens the possibility of it not being in agreement with the relationship. If we are putting denormalized information in the RelationshipReference then we ought to be explicit about why we are doing it. Is it to help the client know which relationships to dereference? For example, suppose we want to identify all the children of a person, we would only need to dereference the RelationshipReferences where the role is “Parent”. Is this the reason for the role on the reference?

Yes, that's the intent of having the "role" on the relationship reference, and we need to better document that it's there for convenience to know which relationships to dereference and that the possibility of inconsistent data is exists if a provider has a buggy implementation.

My suggestion would be that we don’t allow Relationships to be created without two Persons and a “type”, and that these three pieces of information are immutable.

I agree with that implementation, but (again) this is implementation-specific and there's no way for the model to actually enforce these rules.

I believe that the current state of the model implies the following:
When a Relationship is created or deleted, the Person’s involved are “modified” with the addition/deletion of a “RelationshipReference”. While I personally agree with this, I believe it precludes the “git-like” editing model ...

The git-like model for conclusion data is obviously not proven out yet. But I don't believe that relationship references precludes someone who wants to implement a git-like editing model from doing so. In their implementation, they would simply have persons that don't have any relationship references and work out for themselves all of the implications of that.

When an Event or Characteristic is added to, or removed from, a Relationship, the Persons involved are NOT modified. (This seems rather unfortunate, from my point of view, since someone “watching” a Person would probably consider a new “marriage date” to be the kind of thing they would like to be notified of...)

I absolutely agree that the notion of "watching" a person should include notification of an event/characteristic being added to a relevant relationship. But (again) this is implementation-specific. It's all about how a "watch" is implemented. The model doesn't impose that a watch on a person can't include notification of marriage events.

When a Person is deleted, the Relationships the Person is involved in are also deleted.

I would hope so, but (again) implementation-specific.

When a Person is merged with another Person, what happens to the Relationships?

Implementation-specific. Maybe some implementations don't even have the concept of "merge". FamilySearch certainly does, and they have to figure out what that means.

@carpentermp
Copy link
Author

I have some follow-up comments on your comments...on my comments :)

that the current state of the model implies the following:
When a Relationship is created or deleted, the Person’s involved are “modified” with the addition/deletion of a “RelationshipReference”. While I personally agree with this, I believe it precludes the “git-like” editing model ...

The git-like model for conclusion data is obviously not proven out yet. But I don't believe that relationship references precludes someone who wants to implement a git-like editing model from doing so. In their implementation, they would simply have persons that don't have any relationship references and work out for themselves all of the implications of that.

I don't personally see this as an option for them. Suppose they respond to a GET request on a Person with a Person that has no RelationshipReference's? To the client, it will seem to them that this person is not related to anyone, when in fact, if they only knew how to ask, they could see the list of relationships.

If they return the RelationshipReference's but don't consider the Person modified when a Relationship involving the Person is created or deleted, then that destroys the caching model. Clients with have no idea what "not modified" means.

When an Event or Characteristic is added to, or removed from, a Relationship, the Persons involved are NOT modified. (This seems rather unfortunate, from my point of view, since someone “watching” a Person would probably consider a new “marriage date” to be the kind of thing they would like to be notified of...)

I absolutely agree that the notion of "watching" a person should include notification of an event/characteristic being added to a relevant relationship. But (again) this is implementation-specific. It's all about how a "watch" is implemented. The model doesn't impose that a watch on a person can't include notification of marriage events.

Again, I believe this has to be spelled out for proper interoperability. If one system considers the Person modified and another doesn't, this will have very unpredictable behavior to users. Imagine a client displaying a given person making a HEAD request or some such thing to see if the person's data needs to be updated. The server says the Person is not modified, but some of the data being shown is now different. That's not going to be a good experience.

When a Person is deleted, the Relationships the Person is involved in are also deleted.

I would hope so, but (again) implementation-specific.

You are suggesting that one system may delete the relationship, where another may just delete a person from the relationship. This implies that one system may allow relationships to be created with a single person, and another may not. A generic client, operating against both systems will have a difficult time knowing what is going to happen when a given write operation is chosen. This difficulty will likely be passed on to users of the client. Again, if you don't specify the correct system behavior, it will impair interoperability of services with clients and it will be a bad experience for users.

When a Person is merged with another Person, what happens to the Relationships?

Implementation-specific. Maybe some implementations don't even have the concept of "merge". FamilySearch certainly does, and they have to figure out what that means.

It appears that you aren't really trying to achieve "write interoperability", but are content with "read interoperability". Is this so?

@stoicflame
Copy link
Member

I don't personally see this as an option for them. Suppose they respond to a GET request on a Person with a Person that has no RelationshipReference's? To the client, it will seem to them that this person is not related to anyone, when in fact, if they only knew how to ask, they could see the list of relationships. If they return the RelationshipReference's but don't consider the Person modified when a Relationship involving the Person is created or deleted, then that destroys the caching model. Clients with have no idea what "not modified" means.

You're assuming the git model has a REST interface, but the git-like interface is much different from a REST interface. You don't just "get" a person like you would in a REST interface. You "get" a repository of data.

I believe this has to be spelled out for proper interoperability.

Agreed. Spelled out in the definition of the watch endpoint.

@stoicflame
Copy link
Member

It appears that you aren't really trying to achieve "write interoperability", but are content with "read interoperability". Is this so?

Umm... I'm not sure what you mean.

@carpentermp
Copy link
Author

When a Person is merged with another Person, what happens to the Relationships?

Implementation-specific. Maybe some implementations don't even have the concept of "merge". FamilySearch certainly does, and they have to figure out what that means.

It appears that you aren't really trying to achieve "write interoperability", but are content with "read interoperability". Is this so?

Umm... I'm not sure what you mean.

My question came about because of the large number of things that you are saying are "implementation specific."

"Read interoperability" is achieved with the existence of a standard model and protocol which allows clients reading the data to all understand the data in a common way and to predictably navigate the connected resources. It also allows a single client to read data from equally well from different providers.

"Write interoperability" is achieved with the existence of a standard model and protocol for modifying the data, specifying all the different ways that the data may be modified and what is the expected result. This allows generic clients to be developed that can let users do things like "add a conclusion to a Person" or "add an interpreted value to a field on a Record".

Most of my comments on this thread have been with the underlying assumption that we would be attempting to achieve both read and write interoperability. You wrote that some systems may not support "merge". That limits write interoperability since merge is a modify operation. Realizing this, it finally occurred to me that you may not have write interoperability as one of the goals of GedcomX.

While I can see that we may want to go for read interoperability as a first step, if we are ever hoping for write interoperability I believe we need to keep it in mind now, because it will affect modeling decisions. Let's take an example. For read interoperability, you have to decide whether or not multiple Relationships of a given type are allowed between the same two persons (what I commonly call the "uniqueness constraint" for Relationships). It has to be specified one way or the other so that clients can know whether or not they have to deal with it when reading relationships. Considering only the read implications it may seem like an o.k. thing to dispense with the uniqueness constraint. Now suppose "System A" does not enforce the constraint and "System B" does. Later, we decide to go for write interoperability. System A sends data to System B and System B is either forced to reject some relationships or merge them because of its uniqueness constraint. Compatibility suffers.

I can appreciate you desire to let some things be implementation specific, rather than specifying everything. On the other hand, I'm on the lookout for those things that are going to adversely affect interoperability and I'm trying to get them tied down. I would much rather come out with something that is too constrained and let the community urge us to relax the constraints than come up with something that is too loose so that the community suffers from poor interoperability ever after.

@carpentermp
Copy link
Author

I don't personally see this as an option for them. Suppose they respond to a GET request on a Person with a Person that has no RelationshipReference's? To the client, it will seem to them that this person is not related to anyone, when in fact, if they only knew how to ask, they could see the list of relationships. If they return the RelationshipReference's but don't consider the Person modified when a Relationship involving the Person is created or deleted, then that destroys the caching model. Clients with have no idea what "not modified" means.

You're assuming the git model has a REST interface, but the git-like interface is much different from a REST interface. You don't just "get" a person like you would in a REST interface. You "get" a repository of data.

Yes, but I assume we are still talking about resources that have REST endpoints (as well as possibly GIT-like service endpoints)? For example, they want this GIT style API work for CP data (soon to be called CT). CP data is still hosted online, and still has REST endpoints for Persons. Do those REST endpoints have RelationshipReferences? When a RelationshipReference is added to a Person is the person "modified"? (.e.g if I GET it with a conditional GET, does it come back?) If the answer to these questions is "yes" then I don't see how that is compatible with the GIT-like service that is coexisting on the same data.

I believe this has to be spelled out for proper interoperability.

Agreed. Spelled out in the definition of the watch endpoint.

Watch endpoint? I'm talking about basic REST behavior: versioning, modified date, caching, conditional GET, stuff like that.

@carpentermp
Copy link
Author

In order to revive this thread, and to illustrate unresolved issues with regard to Relationships in the Conclusion model, I thought it would be helpful to contrast two approaches at opposite sides of the spectrum. As I do this, the reader will note that positions in the middle between the two approaches are possible.

I will characterize the two approaches thus:

  • Relationships as independent entities (RAIE).
  • Relationships defined by the people involved and the type (RDBPAT).

Also please be careful to note, as you go along, the following distinctions:

  • Person (an object in the model) vs. person (someone who lived on the earth)
  • Relationship (an object in the model) vs. relationship (a "real" relationship between "real people")

Definition of "independent entity"

For clarity, let me begin with a definition. I'm defining "independent entity" as an object where the identity of the object is independent of the object's attributes. Our usual approach with independent entities is to let the system assign a unique identifier. So what is the distinction between "independent entity" and "plain old entity"? Domain Driven Design by Eric Evans has a chapter on entity modeling. He defines "entity" this way:

Some objects are not defined primarily by their attributes. They represent a thread of identity that runs through time and often across distinct representations. Sometimes such an object must be matched with another object even though attributes differ. An object must be distinguished from other objects even though they might have the same attributes. Mistaken identity can lead to data corruption (p91).

In this definition Eric seems to leave open the door for some entities to be "partly" defined by their attributes. In that case, a Relationship that is defined by the people involved and the relationship type (position 2) might be called an "entity", but would still not be considered an "independent entity" as I have defined it.

Persons as independent entities

In the Conclusion profile of GedcomX, A Person is meant to represent a real person who lived on the earth. Real people have identity independent of their name, or anything else we know about them. Regardless what we come to believe about the person (as represented by the Person), the identity of the person (little p) remains constant. Two systems may each have a representation of the Person (with different attributes) and they may want to match them up (perhaps to let a user do a comparison, or perhaps for synchronization). For these reasons (and several more that could be given), Person makes an excellent "independent entity". (Of course, a Person's identity can always be "hijacked", meaning that the data belonging to one real person can be replaced with data belonging to another real person. I'll get into how hijacking relates to relationships a little later on.)

Now a few words about how Persons behave in the system will be useful in preparation for considering the two approaches to Relationships.

Person
identified by:
  • a system-assigned unique identifier
meant to represent:
  • a real flesh-and-blood person
has:
  • names, gender, facts
  • sources
  • change history
participates in:
  • Relationships. (In RDBPAT, it might be said that A Person has Relationships, but it will take some explaining to justify this, so this is sufficiently clear for the moment.)
merge:
  • When users discover that two Persons represent the same "real person", they "merge" them. It is tempting to think of merge as a "wizard" that assists the user to add all the stuff from Person A to Person B, then deletes Person A. If this were all we did, then references to Person A would break. Because of this, we also add an entry into a forward-id table that forwards all Person A references to Person B, and we logically merge the two change histories. This effectively merges the two identities.

Two Different Approaches to Relationship

Now let's contrast the two approaches to Relationship.

Relationship as independent entity (RAIE) Relationship defined by persons and type (RDBPAT)
identified by:
  • a system-assigned unique identifier
  • the two Persons and type (all of which may not be null)
meant to represent:
  • A "chapter" in an actual relationship between two real people. Chapters begin with an event that caused the chapter to commence, and end with an event that caused it to cease (e.g. marriage, divorce).
  • An actual relationship between two real people, spanning all the times, places, and events pertinent that relationship
has:
  • a "left-side" Person (may be null)
  • a "right-side" Person (may be null)
  • a type
  • facts
  • sources
  • change history
  • facts
  • sources
  • change history
The left and right-side persons, and the type, form the identifier, so they are not listed here.
mutability of Person participation
  • "left-side" and "right-side" Persons are changeable. The Persons involved in the Relationship, as well as the "type" are "conclusionary". Like Facts, they have a "Contributor", a "reason", etc.
  • "left-side" and "right-side" Persons and "type" are immutable. To change them would be to change the identity of the Relationship.
Relationship uniqueness constraint
  • No "uniqueness constraint". More than one Relationship between the same two people, of the same type, may be created.
  • The Relationship uniqueness constraint is axiomatic in this model because of the way Relationships are identified.
merging
  • Relationships are "USER merged." Like Person, when a User discovers that two Relationships represent the same "real relationship", users merge them. When this happens, an entry is placed in an id-forwarding table and the two change histories are logically merged.
  • Relationships are "SYSTEM merged." Users are never asked to merge Relationships. Relationships are automatically merged during a Person merge when the merge result would create Relationships that effectively violate the uniqueness constraint.
Source attachment
  • Users explicitly attach sources to Relationships, as they do with Persons.
  • Users generally needn't explicitly attach sources to Relationships. Relationship sources will be attached automatically by the system when related "source Personas" are attached to related Conclusion Persons.
Person "awareness" of participation in Relationships
  • Persons NOT "aware" of Relationships they participate in. The "timestamp" or "version" of the Person is not affected by the Person's being named in a Relationship. When a Person is removed from a Relationship the Person is similarly unaffected--only the Relationship is affected.
  • Persons ARE "aware" of Relationships they participate in. When a Relationship is created naming a Person, the named Person is logically modified, and a "change" is recorded in the change history. Similarly when a Relationship is deleted.

From this table it is easy to see that there are a significant number of fundamental differences between the two approaches. Let's consider each of these differences:

Discussion on mutability of Person participation

hijacking and genealogical soundness

A genealogically sound system provides a clear, unambiguous way for users to make conclusions about things of genealogical significance. The system tracks what they say, who says so, when they said so, why they believe it, and where they got their information. We have already mentioned that the "identity" of a Person may be "hijacked" by replacing the data belonging to one "real person" with data belonging to another "real person". Person hijacking tends to be fairly rare, but is pernicious when it occurs. Why? Primarily, it is because it modifies the meaning of conclusions, making it appear that contributors believed something that they never believed.

Let's consider an example that illustrates this. Suppose, through my research, I determine that Bob, son of Joe and Sue, was born in 1800 and I conclude this by putting a "Birth Fact" on Bob. Suppose someone else hijacks Bob and uses him to represent Tim, Bob's little brother, who was born in 1802. (Suppose he does this by noticing that Tim has Joe and Sue as parents and so he changes the name from "Bob" to "Tim" and voila! job done.) Tim, (from when he was Bob) still has a Birth Fact where I state that Tim was born in 1800! Of course, I never believed this, but there it is in black and white that I do. This example illustrates that, when a Person is hijacked, everything leftover from before the hijacking now no longer applies and instead, lies. This includes Names, Gender, Facts, Sources, Relationships, everything. Some of it might still be true, but none of it still says what it was intended to say. (As in the example, Joe and Sue were Tim's parents, but when they were created they stated that they were Bob's parents.)

Fortunately as I said, Person hijacking is rare, and now that we will be tracking the who, when, why, and where of every conclusion (the 5 W's), is likely to become even more rare. In the past, Persons would gradually accumulate information from more than 1 real person (because of a lack of the 5 W's, it has often been hard to accurately determine which real person the Person is meant to represent) until the "real person" eventually migrated from one person to someone else.

So how does all this relate to "mutability of Person participation" in Relationships? When you allow users to change who participates in a Relationship, you have done nothing more than give them a way to hijack the Relationship! Another example illustrates this. Suppose there exists a "couple" Relationship between Joe and Sue. From my research, I determine that Joe and Sue were married in 1799 and I add a "Marriage Fact" to the existing Relationship. Suppose someone else decides that it wasn't Joe and Sue that were a couple, but Joe and Ann, so he changes the "woman" in the couple Relationship to point to Ann. When I come back, I see that Joe and Ann were married in 1799--and I'm the one who says so!

Relationship as "conclusion"

The existence of a Relationship is, in and of itself, a genealogical conclusion. If I say "Joe and Sue were a couple", that carries genealogical information. I don't need to know when (or if) they were married for the information to be worthy of capturing the 5 W's. Similarly with parent-child relationships. If I say, "Bob was Joe's son," that's a significant genealogical statement even without knowing for sure if the relationship was biological or adoptive.

What of "Relationship Facts"? How do they relate to the "Relationship conclusion"? When I put a "Marriage Fact" on a Relationship I am "fleshing out" the original conclusion e.g.

  • Original conclusion: Joe and Sue were a couple.
  • Marriage conclusion: Joe was married to Sue on 1 Jan 1799 in Liverpool, England (ergo, Joe and Sue were a couple).

From this we can see that the marriage conclusion is subordinate to the original conclusion. If the original conclusion (Joe and Sue were a couple) is no longer believed, it is impossible to continue to believe the marriage conclusion (Joe was married to Sue...). Thus, if you change the original conclusion, you destroy all subordinate conclusions. (The astute reader will notice that this is just another way of describing "Relationship hijacking.")

Because of the problems cited with mutable Person participation, immutability is the only genealogically sound option. Relationships should be created specifying the Persons involved and the type. These three pieces of information from the original conclusion of the Relationship. Relationships may be fleshed out with subordinate conclusions that are in agreement with the original conclusion. When the original conclusion is no longer believed, the only genealogically sound option is to delete the Relationship and create a new one according to what is now believed to be true.

Discussion on the Relationship uniqueness constraint (RUC)

The uniqueness constraint on Relationships goes hand-in-hand with the definition of what Relationships are meant to represent. Without the RUC you are forced to define each relationship as one of possibly many "chapters" in the actual human relationship. As a concept, this is sort of defensible with couple relationships where people can get married, divorced, and remarried. It's a real stretch, however, for parent-child relationships. Also, the concept seems to serve no purpose in the model. All that can be understood from multiple "relationship chapters" can be understood equally well by multiple marriage and divorce events on a single Relationship object.

Furthermore, the concept brings additional complexity for the system and for users:

  • The system has to devise a way of showing multiple "Relationship chapters" to users that doesn't confuse them e.g. "Did this couple really have two children named John, or is this just two relationships to the same John child? If the latter, why is the system broken?"
  • Instead of automatic Relationship merging by the system, users are now forced to decide whether or not to merge Relationships, e.g. "Are these two couple relationships really 'two chapters', or is this just a discrepancy between when/where people think they were married? Gag.

I really can see no value to the "relationship chapter" definition of Relationship and a huge downside. Truly, the Relationship uniqueness constraint is our friend.

Discussion on system assigned unique identifiers vs. identifying by persons and type.

A question that might be asked is, "why is it better to identify Relationships by the Persons involved and the type instead of an arbitrary system-assigned unique identifier?". The simple answer is, "it's the longest-lived, most robust identifier that can be used." A an example will illustrate this.

Suppose I add a "marriage Fact" to the Couple Relationship between Joe and Sue. After a week I come back and see that the marriage Fact I added is not there. I check the change history and there is no record of the marriage Fact I added. I think, "the system is broken." What happened? Well, someone came along and decided that Joe and Sue were not a couple after all so they deleted the Relationship (the one that had my marriage Fact on it). Someone else came along and decided that, yes, they were a couple after all, so they re-created the Relationship--only this is a "new never seen before" Relationship, with a new change history.

This is one of the fundamental problems with "Relationships as independent entities." Without the Relationship uniqueness constraint, there is no way around this problem. Even with RUC, the problem can happen as I described. However, it must be admitted that, with RUC, there is a way around the problem. Whenever a Relationship is created, the system checks for the existence of a "tombstoned" Relationship and "resurrects" the Relationship, when found (thus preserving the change history). Once you are doing this, however, you are just using the system-assigned identifier as a pseudonym for the RDBPAT identifier. You've accepted that Relationships are defined by the Persons involved and the type. Unfortunately, you haven't solved all the problems that come with the system-assigned identifier. Here are a couple more:

  • Since external systems will be saving references to Relationships with system-assigned unique identifiers, when Relationships merge you are forced to have a forward-id table for Relationships as well as for Persons. With RDBPAT, no such table is necessary--merged Relationships just work.
  • When synchronizing data in 2 systems, each system needs a way of making a correspondence between the Relationship identifier in one system with the Relationship identifier in the other system. For Persons, this correspondence is done via the "AlternateId" table, and in RAIE the same is necessary for Relationships. In RDBPAT, no such table is necessary--correspondence is automatic once the Person correspondence is made.

The net effect of these to differences is that in RDBPAT, Relationships are much simpler to deal with.

Discussion on Relationship sources

Another area of difference between the RAIE and RDBPAT models is in attachment of sources. In RAIE, sources are attached explicitly by users, in RDBPAT attachment is done automatically by the system. To ask users to attach sources to Relationships is to needlessly complicate their life.

To illustrate my meaning, let me give an example. Suppose I find a birth record that shows that Joe and Bob are father and son. Suppose I also find Joe and Bob in my conclusion tree and decide to attach the sources. I attach source-Bob to conclusion-Bob and source-Joe to conclusion-Joe. Having done this, the system is able to infer that source-parent-child-relationship supports the conclusion-parent-child-relationship. It was not necessary to ask the user to do this.

In fact, asking the user to do it will probably create chaos. Suppose I attach the source-relationship to a different conclusion-relationship--what does that even mean if the source Persons don't agree? When they don't agree, it's just a mess--and one that we will have to ask our users to clean up. When they do agree, it's redundant, and so pointless.

Discussion on Person "awareness" of participation in Relationships

From the outset it must be noted that, where up to now most of the discussion has been primarily GedcomX model focused, this is primarily a GedcomX webservice discussion.

To begin this discussion I will describe a use case that I feel the web service should support without difficulty: suppose a client wishes to download a portion of the tree (set of Persons) and thereafter "stay in sync" with changes made to that portion. To support this, the webservice will, at minimum, need to be able to answer these two questions:

  1. Tell me everything you know about a given person
  2. Tell me what's changed about a given person since last I asked

Now one question that must be answered is, "when fetching a Person, what is returned?"

Consider the following in the abstract: for any given person (little p), if I say, "tell me what you know about that person" you would certainly tell me the person's name, gender, birth date, and such. If the person had been married, would you tell me when and where and to whom, if it was known? Of course you would. Obituaries are a great example of this. They are all about a single person--recently deceased--and are filled with relationship information. Suppose yesterday I asked you, "tell me what you know about Bob" and you told me. Suppose later that day you discovered that Bob had another child that you didn't tell me about when I asked. Suppose I ask you today, "has anything you know about Bob changed since I asked you yesterday?" Would you tell me about the new child? Of course.

So, with respect to question 1 above, "everything you know about a given person" includes the Relationships he participates in, and the Relationship Facts of any of those Relationships.

With this as a backdrop we are ready to define "Person awareness of participation in Relationships", then we can consider what "awareness" means for our synchronization use case. To say that a Person is aware of involvement in Relationships is to say that, whenever a Person is added to, or removed from, a Relationship:

  • the Person is logically modified
  • a change is recorded in the change history

There is an additional level of awareness where the Person is modified, and a change is recorded, whenever a change is made to the contents of any Relationship the Person participates in e.g. the Person is logically modified when a Marriage Fact is added to a couple Relationship he participates in.

From our table on the two approaches to Relationship above, we remember that RAIE has neither type of "awareness", and RDBPAT has both. Now let's consider the implications of this for our synchronization use case and the 2 questions that the webservice has to be able to answer:

Relationship as independent entity (RAIE) Relationship defined by persons and type (RDBPAT)
1. Tell me everything you know about a given person:
HTTP GET on Person link.
Since, in this model, Persons are unaware of Relationships, when fetching a Person, only the Person is returned. Still, there must be a way to discover what Relationships a Person participates in, so the Person also returns a link to how to ask the system the question, "what Relationships does this Person participate in."
HTTP GET on Relationships query link
returns a list of Relationships that the Person participates in
for (every Relationship returned)
  HTTP GET on Relationship
end for
these fetches would really be done in a multithreaded fashion.
HTTP GET on Person link.
When fetching a Person, Relationship and Relationship Fact information are part of the payload. The effect of this is that, to clients of the webservice, Relationships, and Relationship Facts behave as though "part of" the Person. Of course, they are shared between two Persons, so they behave as though logically part of *both* Persons.
2. Tell me what's changed about a given person since last I asked:
HTTP GET on Person-changes link w/ "since" parameter.
Since, in this model, Persons are unaware of Relationships, fetching Person changes isn't enough. We still need to fetch all the changes to all the Relationships.
HTTP GET on Relationships query link
returns a list of Relationships that the Person participates in
for (every Relationship returned)
  HTTP GET on Relationship changes link
end for
At first blush it might seem that this will do the job, but there is something missing: any Relationships that have been deleted won't be in the list. Supposing we solve this, we still have the problem that the Relationship is *deleted*--there isn't any change history where a "relationship deleted" change is stored. This is a fundamental problem with not having "Relationship awareness" in Persons.
HTTP GET on Person-changes link w/ "since" parameter.
The set of changes since the "since" information are returned. These changes include any Relationships added or removed, or any changes to any Relationships, such as the creation of a new Marriage Fact on a couple Relationship.

Person awareness of participation in Relationships would seem to be an important requirement. It not only saves a great deal of trouble for clients, having a "deleted Relationship" change in the change history gives the client a handle to deleted Relationships so the delete can be inspected and/or undone. Without such a change entry, that functionality is very awkward, if not impossible.

As for 2nd-level awareness (where change entries are put into the Person change history for Relationship Fact changes) it's probably not as hard a requirement as basic Relationship awareness, but it sure simplifies the client's life. Instead of fetching a Person and then doing a multithreaded fetch of all the Relationships he participates in, a single Person fetch suffices. Also, when asking for "changes since", in RDBPAT a single Person-change fetch suffices.

Discussion of "implementation-specific"

Another idea that has been proffered is that many of these issues can be left as details for each system implementer to choose. I have been operating under the assumption that we want to define GedcomX such that two systems, built by different parties, could potentially synchronize changes with each other in both directions. Interoperability of this kind is not possible without these aspects of the model being part of the specification. For this kind of interoperability, it is necessary to specify all the ways that the data may be modified, and what the results of those operations are expected to be. Also, to be genealogically sound, each change has to record the 5 W's.

For example, it is not possible to leave the question of mutable Person participation as implementation-specific. If these changes are to be allowed, then both systems have to support them in order to synchronize, and there has to be a place in the model to record the 5 W's for these changes.

As another example, consider RUC. As we have already explained, if one system enforces the RUC and another does not, the two systems actually have a different model for Relationships. Or stated another way, Relationship means something different in each system. With such different Relationship models, it is impossible to round-trip Relationships between the two systems.

Conclusion

I have been uneasy for a long time that there are issues with the Relationship model that have needed addressing. As a catalyst to addressing them, I have described two very different models and I have tried to describe the consequences of these differences to users and to system implementers. I hope this will foster the kind of dialogue that will ultimately result in a resolution of the issues.

@stoicflame
Copy link
Member

Wow. That's an impressive essay. Thanks for all that work.

I see some holes, though. It's going to take me some time to sift through and pull together a worthy response.

I think it would be helpful to summarize all the changes that would be applied to the model if your argument were wholly accepted without dispute. I think it looks like this (correct me if I'm wrong):

  • Relationship would not extend GenealogicalEntity.
  • Person would contain a list of Relationships.
  • Relationship would just refer to one Person since the "source" Person is identified by the context of the Relationship.

Anything else?

@carpentermp
Copy link
Author

Thanks for reading it!

I was deliberately vague about specific model changes because I wanted to get conceptual understanding first, and I figured that if I started with the changes implied by my arguments, that might produce an emotional response in some people that would cause them to reject the arguments without first hearing them out.

Your summary of the model changes I would propose is pretty accurate, but it should be noted that I made arguments in favor of several semi-independent things, and some of them don't require model changes, per se. Here are the main points and the model changes that would be inferred by them:

  • immutable Person participation in Relationships. I don't think a model change is required for this--the model seems to imply this already since there is a single Attribution for the Relationship, and not separate Attributions for "person1" and "person2". I stressed the issue of immutability because I have heard people espouse allowing changes to the people involved in Relationships, and I would like a clear statement that the model does not allow this.
  • Relationship uniqueness constraint. It is possible to enforce the RUC with the model as it exists today, but the constraint is not implied by the existing model. If no model change is to be made, then I would argue for plain and prominent documentation that explains RUC and makes it a MUST to implementers who want to achieve full compatibility. Of course, if the model changes you described above are made, no such documentation is necessary since the model dictates it.
  • Person awareness of participation in Relationships. This is the topic that implies the most model changes. From my previous post, you'll remember that I wrote of 2 "levels" of awareness. In the first level, Persons are aware only of the Relationships they participate in, but not aware of changes to those Relationships. This would imply that Person would contain a list of Relationship "references" (of type ResourceReference?). The 2nd level of awareness (which you correctly understood to be what I favor) implies the changes you listed:
    • Relationship would extend Conclusion or GenealogicalResource instead of GenealogicalEntity.
    • Person would contain a list of Relationships.
    • Relationship would just refer to one Person since the "source" Person is identified by the context of the Relationship (and there must be a way of distinguishing parent-child Relationships where the Person is a Parent from those where he is a Child).

The last main point had to do with "system-attached" vs. "user-attached" Relationship sources. As I considered the model changes implied by my arguments on this topic I realized that the light treatment I gave the topic will really not suffice to explain the changes I would like to see. I would like to correct that now with a general discussion of source-linking.

Discussion on linking to sources

Let's start with the list of source links on Person. With any model it is important to be able to talk about the meaning of each of the model's parts. What, then, does it mean when a Person links to a source? My answer would depend slightly on what is being linked to. I would characterize the meaning this way:

  • Person linked to generic web resource (image, web page, etc.): the web resource has something to do with this Person (but the system doesn't know what).
  • Person linked to a Persona: the Person and Persona represent the same "real person".
  • Person linked to Person (in another tree): both Person's represent the same "real person".

The last two meanings in this list are the essence of "source-centric" genealogy that @ranbo has been promoting lo these many years.

Now I noticed that, in the model today, "Conclusion" also has a list of sources, so we have to answer: what does it mean when a Fact, for example, links to a source? I suppose:

  • Fact linked to generic web resource: the web resource has something to do with this Fact.

Is it ever the case that a source has "something to do with" a Fact without also having "something to do with" the Person where the Fact resides? No, that makes no sense e.g. Anything having to do with Bob's birth fact, by definition, must also have something do do with Bob. Thus, it makes no sense for a Conclusion to list a source that is not also listed in the Person. The current model, however, with its independent lists of sources on Person and on Conclusion, would seem to be perfectly comfortable with this logical contradiction. So this is the first thing about the source model that I would like to refine.

Secondly, I view attaching a source as a "genealogical statement" akin to a "Conclusion" in the model. If I say, "this resource has something to do with Bob," or if I say, "the person mentioned in this source is the same real person as the real person represented by my tree Person Bob", I am saying something of genealogical significance, something that could be disagreed with, and something that could take some justification. Thus, it is important to capture the 5 W's of that statement. The current model's simple list of ResourceReferences gives me no way to do this.

Thirdly, you may have noticed a lack of symmetry in my definitions of what it means to attach sources-to-Persons vs. what it means to attach sources-to-Facts (or other Conclusions). For Persons, we attached special significance to attaching Personas and other-tree-Persons. Is there not an analogue with Facts? Would it not be logical to give the same kind of status to attaching source-Facts and other-tree-Facts to Facts? Perhaps I want to say, "the Birth Fact in this source represents the same real-life-birth-event as the birth Fact in this Person?" Is that not logical? Yes, but it turns out that by attaching Person-Bob to Persona-Bob you have already made this statement. If Person-Bob and Persona-Bob represent the same real person, then any birth Facts on either object must be talking about the same real-life-birth-event--Bob's birth.

Thus, we see that, when attaching Personas or other-tree-Persons to Persons, the system is able to understand everything the source believes to be true about the real person, and this information can easily be compared with anything the tree-Person says about the real person. For the most part, it is not necessary or helpful for users to explicitly attach source-Facts to conclusion-Facts (or source-Relationships to conclusion-Relationships, which was my original point in my earlier post).

Having said this, however, it must be noted that we still need a kind of "genealogical statement" that we have not yet provided for. It is often the case that, while we believe the Persona and Person represent the same real person, we disagree with a piece of information in the source. For example, it's fairly common for a death record to give a date of birth and occasionally the birth date is wrong. We may have other information about what the correct birth day is, and so we disbelieve what the death record says about the birth date. In order to avoid countless re-evaluations of the discrepancy between the conclusion-birth-Fact and the source-birth-Fact (implied by attaching the Persona to the Person) it will be important to provide a way for users to acknowledge such discrepancies when making conclusions so that they can be "put to rest."

So this brings us at last to how I would propose changing the model for source-linking. To solve my second concern (Relationship as conclusion), I would:

  • create a new class "SourceReference" that extends GenealogicalResource and includes a ResourceReference. (SourceReferences may be thought of as ResourceReferences that are being used as the source of something, and so need the 5 W's, aka Attribution)
  • Person, instead of having a list of ResourceReferences, has a list of SourceReferences.

Then, to solve my first concern (independent lists of sources in Person and Conclusion), I would:

  • remove the list of sources from Conclusion. Every other option I can think of is problematic. We could put on Conclusion a list of references to the SourceReferences attached to the Person, but those references would need the 5 W's and what happens to those references when the SourceReference is removed from the Person? They would need to be deleted and users would have to be somehow made aware of what's happening. It's all very messy and of limited genealogical utility. It seems to me that we ought to avoid these complexities in the first version of the standard while we are learning the vagaries of linking sources to Persons. I believe we can get a great deal of benefit out of linking Persons to Personas (and other-tree-Persons)--benefit that few people have appreciated. After we have done this for a while, it will become clearer what enhancements to the source-linking model are most needed.

I would probably stop here, and leave my third concern (accounting for source data that disagrees with my conclusions) to another version of the standard. However, to give an idea of what I have in mind, let me say that, if I had to try to solve my third concern today, I would:

  • allow negative genealogical statements. The model currently provides statements like, "this person was born in Provo, Utah on January 1, 1930." The simplest way I can think of to account for data that disagrees with what I believe to be true is to let me say that it is not true, and why, e.g. "this person was NOT born on January 1, 1930, because..." I know that this suggestion is not new to me--others have suggested it. I would like to point out, however, that I would include Relationships and SourceReferences in the list of "genealogical statements" that could be negated. It is useful to be able to say, "Bob and Sue were NOT a couple", as well as to say, "that Persona-Bob does NOT represent the same real person as this Person-Bob".

@ranbo
Copy link
Contributor

ranbo commented Jan 10, 2012

There are some excellent points in the above proposals. One that I want to throw my entire weight behind is this:

A Relationship should represent everything we know about the relationship of that type between those two people.

This implies the relationship uniqueness constraint, which in turn implies that relationships must be automatically merged when the people on both ends are merged. This is what has been done in new FamilySearch for years, and is one thing that has been working just fine. Yes, merging and relationships are both complex areas, but having multiple relationships of the same type between the same people makes it more complex without adding value.
I also agree that the model may not need to change to support RUC, but the documentation should explain clearly that this is what a Relationship means. The concept of having a separate relationship for each "chapter" (time span), or each event, or each source is a horrible complication for clients and users.

@stoicflame
Copy link
Member

I continue to seek a big block of time to formulate a point-by-point response to @carpentermp's essay, but I just can't find it. So I'm afraid my response here is going to lack the detail that is deserved by such a great essay. But the essay deserves some timeliness, too, so here we go...

Personal Opinions

  • I think RDBPAT is a Good idea.
  • I think "Relationship as independent entity (RAIE)" is a Bad name for the concept that was described and labeled such. I think RDBPAT implies that relationships are implemented as independent entities even if they're not formally represented as such.
  • I think RUC is a Good idea.
  • I think relationship as a conclusion is a Good idea.
  • I think it's really important for a webservice to provide the functionality described under the "discussion on person awareness".
  • I think the best way to model RDBPAT is with a relationship entity, separate and distinct from the person entity (i.e. just like it is modeled today).

Entity Boundaries != Resource Boundaries

The whole "discussion on person awareness" seems to be based on an erroneous assumption that I think needs to be clarified. Entity boundaries as defined by the model are NOT the same as the boundaries for web service resources.

Your use case about a client needing to stay in sync with changes made to person simply argues for the need for defining appropriate web service resource boundaries to make sure you can do that conveniently. Great. I totally agree. For the case that you're most interested in, we'd want to define a "person with relationships" resource that will provide a resource that includes a person and all relevant relationships, and we'd make sure that the cache validation (e.g. Last-Modified) was appropriate for that resource. Just because the entity boundaries are defined as they are today doesn't mean that we can't define a "resource" with wider boundaries.

Replies to Some Selected Points

  • Relationship would extend Conclusion or GenealogicalResource instead of GenealogicalEntity.
  • Person would contain a list of Relationships.
  • Relationship would just refer to one Person since the "source" Person is identified by the context of the Relationship (and there must be a way of distinguishing parent-child Relationships where the Person is a Parent from those where he is a Child).

This is totally baffling to me. I can't see how this would be an improvement to the model, even given all your arguments for RDBPAT, which I generally agree with. I see nothing but trouble with this suggestion. We'd have to first write up all the extra rules that say that relationships on the left-side person match the relationships on the right-side person, and then we'd have to write up all the rules for what to do when the relationships on the left-side person don't match the relationships on the right-side person. And in many cases, there's no good way to resolve the differences, such as when a relationship fact exists on one side but not the other (do I remove it? do I add it to the other?). Yuck!

We'd be imposing that developers implement relationships as entities in their back-end, but the model would require denormalization. This is just bad modeling practice.

The current model, however, with its independent lists of sources on Person and on Conclusion, would seem to be perfectly comfortable with this logical contradiction. So this is the first thing about the source model that I would like to refine.

Actually, the existing model intends that conclusion source references point to source references on the person. That needs to be more explicitly clarified, though.

Thus, it is important to capture the 5 W's of that statement. The current model's simple list of ResourceReferences gives me no way to do this.

You're absolutely right. There needs to be a SourceReference object that extends ResouceReference that contains all the "5 W's" stuff. Thanks for pointing that out. It was always intended to be there, and I'm pretty sure it was there at one point, but I don't know where it went. Disturbing...

allow negative genealogical statements.

I agree. Let's open up and issue to track that.

Proposed Action Items

  • Open up a new issue that can be tracked for defining the "person with relationships" resource.
  • Enhance the set of GEDCOM X application profiles with a profile that formally defines the RDBPAT concept (including the RUC), and describes the constraints for full compatibility. @carpentermp's essays should be the basis for that definition.
  • Clarify how the source references on the conclusion should really references source on the person (we'll probably need to open an issue to track this).
  • Open up an issue to track creating SourceReference again.
  • Open up an issue to track "negative genealogical statements"

@stoicflame
Copy link
Member

New issues have been spawned as children of this issue at #120, #125, #126, and #127 so their progress can be tracked independently.

Obviously, this issue is still active. We'll use it to track the formal definition of the RDBPAT concept.

@carpentermp
Copy link
Author

I have been working on this response for some time, and it's not really done, but after yesterday's meeting I thought it would be a good idea to post it, as is, for the benefit of the community since it contains much of what was discussed. I don't expect it will be very influential (it certainly wasn't yesterday), but here goes nothing...

As I consider my efforts to persuade on this issue I realize that some of our differences of opinion may come from different design objectives, or from a different emphasis on them. Perhaps making those design objectives explicit will help.

Discussion on GedcomX web service design objectives

The following were my top design objectives:

  • RESTful. This almost goes without saying, but I mention it here to be explicit. Being truly RESTful brings key advantages, some of which, among others, I list here as additional design objectives.
  • Generality. The GedcomX web service is meant to be implemented by many data providers and consumed by many clients. Any client should be able to successfully consume the web service from any data provider. This implies two kinds of generality:
    • Generality for clients: the GedcomX web service must be able to support a variety of different clients, without knowing ahead-of-time what they all will be. It must avoid, as much as possible, assumptions about specific client implementations. At the same time, it must support well the kinds of things that we expect most clients will need.
    • Generality for data providers: the GedcomX web service should be as general as possible to allow for potentially differing back-end implementations. At the same time, these differing implementations must be compatible. Thus, the web service must fully specify all possible operations and expected behaviors.
  • Simplicity. The greatest power of GedcomX will be achieved only when it becomes a standard in fact, rather than in aspiration. This can only happen if there are many implementers, both clients and data providers. The simpler the web service is, the wider and earlier adoption it is likely to have. In addition, a simpler web service will tend to have fewer interoperability problems (which also inhibit adoption). All of this is suggestive that simplicity is a strong design objective. From this, we might adopt, in our modeling, a couple of rules of thumb:
    • two ways of doing something should not be provided when one will do.
    • Occam's razor. When evaluating two solutions, the burden of proof should rest with the more complex of the two.
  • Cacheability. The benefits of caching are well known: scalability, improved speed, reduced server load, etc. Caching is a requirement of REST, but it is easy to create a RESTful API that doesn't cache well, if you are not careful. We have created many so-called RESTful API's over the years and I am not aware of any where we've done a proper job in this area. To rectify this, cacheability should be a key design objective and the cacheability of each of the resources defined by the web service must be considered. When weighing the caching benefits there are at least three aspects worth considering:
    • costs/savings at the server.
    • load on the cache (whether client-cached or proxy-cached).
    • over-the-wire costs/savings.
  • Synchronization. Using the GedcomX web services of two systems, it should be possible to bi-directionally synchronize all or part of the data between them. One example of the kind of synchronization I mean is the "desktop record manager" case, where users will want to synchronize their locally stored tree with a portion of CT (or other online tree). Many design decisions can affect how difficult it is to perform such synchronization. By keeping this requirement in mind early in the design process, we can optimize for it.

I did not intend the order of these objectives to be significant--the desire is to achieve them all. Of course, there are always tradeoffs. Achieving the optimal balance of all the objectives is the ultimate goal. This is admittedly difficult, but it should be possible to compare and contrast any set of options with respect to these design objectives.

Now that I have listed my key design objectives, I am ready to respond to your comments. Let me begin with this from your post:

Entity Boundaries != Resource Boundaries

The whole "discussion on person awareness" seems to be based on an erroneous assumption that I think needs to be clarified. Entity boundaries as defined by the model are NOT the same as the boundaries for web service resources.

I'm really glad you brought up this point because there may be a difference in our thinking here that has been leading us down different paths. By exploring it perhaps our paths can be brought closer together.

The client of a web service experiences the model through the web service. The resources of the web service, and the methods of manipulating them are the only model the client knows. When you define a resource in a RESTful web service, from the point of view of REST (and the client), you have created an entity. Why do I say this? You gave it a unique identifier (URI). You may like to consider these REST entities as a different kind of entity than model entities, but they are the only kind of entity that REST naturally understands. Whenever you assign a URI to something, you are forced to answer all the usual REST entity questions:

  • what is the lifecycle of this thing?
  • what operations are allowable?
  • how may it be cached?
  • how is it connected to other resources?
  • what is the longevity of the identifier?

So, why have I taken such pains to characterize resources as "REST entities"? Bear with me. I am trying to lay the groundwork that will allow issues to be brought into focus.

Having defined these two kinds of entities, we could rephrase your initial statement this way:

Model entity boundaries != REST entity boundaries

Given that we have agreed that there needn't be a one-to-one correspondence between model entities and REST entities, it would seem that we differ only on how to deviate from this to best advantage.

For example, with RDBPAT I really wasn't trying to abolish Relationship as a model entity--I just didn't want it expressed as a REST entity. I wanted to avoid creating a system-assigned unique identifier (which, when expressing Relationship as a REST entity, is the URI). Instead, I wanted to be sure that, whenever a Relationship is identified, it is identified by the two people involved and the type. The advantages I cited for identifying Relationships this way were:

  • no need for persistent id or forward-id logic
  • no need for alternate-ids (which leads to a simpler synchronization model)
  • natural continuity of identity and change history across relationship deletes and re-creations.
  • simpler client interaction model. (You may want to contest this claim, and there is a lot more I could say in support of this, but I'll have to resist the temptation for the moment.)

I proposed a REST entity like this (pseudo-xml):

<person persistentId="...">
  <facts>...</facts>
  <sources>...</sources>
  <relationships>
    <parent ref="...">
      <facts>...</facts>
    </parent>
    ...
  </relationships>
</person>

As I said before, with this proposal I wasn't trying to change the model entities. I just wanted to change how they are expressed through the web service. (For example, when a client consumes an entire tree at once, even I can see that having Relationship outside of Person is better, since the data is then fully normalized.)

You countered by suggesting a "person with relationships" resource. I suppose it would be logically something like this:

<person-with-relationships>
  <person persistentId="...">
    <facts>...</facts>
    <sources>...</sources>
  </person>
  <relationship type="parent-child">
    <person1 ref="..."/>
    <person2 ref="..."/>
    <facts>...</facts>
  </relationship>
  ...
</person-with-relationships>

In the above example, note that the Relationship has "person1" and "person2", which have URI's to their respective Person's. If I do an HTTP GET on these URI's, what comes back, the simple <person> or <person-with-relationships>? If the latter, and if the web service doesn't provide a URI to individual Relationships, then we are free to have Relationship extend GenealogicalResource instead of GenealogicalEntity and it wasn't necessary to embed Relationships inside of Person. Are you more comfortable with this approach?

For the purposes of illustration, let's explore for a minute the other option, where Person URI's return simple <person>. In that case, to be fully connected I suppose we'll need a link to the <person-with-relationships> in <person>, e.g.:

<person persistentId="...">
  <facts>...</facts>
  <sources>...</sources>
  <links>
    <link rel='person-with-rels' href="..."/>
  </links>
</person>

With this approach, to get to the Relationships of a Person, I'll end up fetching the <person>, then fetching the <person-with-relationships>. Once I have done this I have fetched the contents of <person> twice. This affects all three aspects of cacheability:

  • load on the server is increased because the server is forced to assemble <person> information twice.
  • load over the wire is increased because <person> information is sent twice
  • load on the cache (whether client-side, or proxy) is increased because <person> has to be cached twice. This will cause the cache to reach it size limits sooner, forcing more objects out of the cache without going stale, which will further impact server and over-the-wire costs.

This is a small illustration of a problem with what I call "bags". I define "bag" this way:

bag: a REST entity that aggregates other REST entities.

<person-with-relationships> is a small bag because it aggregates <person>.

Discussion on bags and "summary objects"

The small bag example cited above is only one of a variety of bags that I have seen discussed and some of which are in the CT api that was presented at RootsTech this year. For example, in the CT API each "conclusion" has its own URI and can be operated upon (HTTP GET, DELETE) independently. By giving each conclusion its own URI, even simple <person> becomes a bag.

A couple of other common bags I have seen contemplated are the "bowtie" bag, (give me the person and his 1-hop relatives) and the "n-generations" bag (give me the ancestors/descendants of a given person to "n" generations). These bags may contain whole persons, or just "person summaries" but either way, they have huge implications for caching.

Let's consider the caching implications of the "bowtie bag". Suppose a man has three children, two parents, and a wife (who also has two parents). In this simple example, the "bowtie bag" person information for that man (whether a complete Person, or a SummaryPerson) is available in 9 different bowtie bags! Over time as a user navigates this tree he is likely to cache each of these bowties, as well as the individual objects (and potentially even each individual conclusion of each person, if that way of fetching information is kept). Thus we see that the amount of "wasted load" upon the cache, "wasted bandwith" on the network, and "wasted load" on the server is a function of the number of different ways the client has for getting the same piece of information.

Let's review the problems associated with "bags" and "summary objects":

Problems with bags

  • caching
    • stuff gets assembled, sent, cached multiple times. (At the meeting it was suggested that this problem is ameliorated by simply marking bags as "unchacheable". This is a little like finding that you have a broken finger and cutting off your hand to "solve" the problem. It is true that stuff doesn't get cached multiple times, but now, since bag resources are not cached, they will have to be assembled and sent countless additional times.
    • deletes, puts, and posts invalidate many cached items, so caching transparency is reduced.
  • complexity. Complexity is increased for both data providers and clients. Services have to support all the different ways of fetching the same data. Clients have to learn to consume data in all its forms.

Problems with "summary" objects (in or outside of bags)

  • summary data is always redundant (bag-like), and so has the same caching and complexity drawbacks as bags.
  • impossible to predict the client's needs. Regardless of what is chosen as the appropriate "summary" data for any given entity, my experience is that some new requirement is inevitably added to the client that requires a piece of data that has not been provided in the summary. When this happens, there will be a strong temptation to add the new piece of information to the summary. This temptation is natural because the alternative will be that the client will have to fetch all the full objects after fetching the bag of summary objects. In this scenario, the client pays the bag-like costs without any summary-object benefits. In addition, our design objectives state that there will be many clients of this data, with many differing needs. Many of them will find themselves in this dilemma of either adding their particular needed information to the summary, or paying bag-like costs with no benefit. This could end up splintering the standard into a variety of differing "extension" implementations.

This leads us to the following rule of thumb:

In general, each piece of information should have a single way of being retrieved/modified. Where a compelling need can be clearly demonstrated, this can be relaxed, but there is a strong burden of proof placed upon anyone wanting to recommend relaxing this rule.

@stoicflame
Copy link
Member

Bug scrub bump.

There is a lot of water under the bridge since this issue has been opened. These concerns are true, but this project is no longer the place to get them addressed. They are, generally, concerns of the application, so we need to pick up these issues over at FamilySearch/gedcomx-rs and hash out many of them within our FamilySearch implementation teams.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants