Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test sample of UK PSC streaming API results against PSC snapshots #264

Open
StephenAbbott opened this issue May 3, 2024 · 21 comments
Open
Assignees

Comments

@StephenAbbott
Copy link
Member

No description provided.

@tiredpixel
Copy link
Contributor

tiredpixel commented May 23, 2024

There are a few things to consider, here:

  1. What were the differences between psc.2024-05-03T10:53:43+00:00.jsonl.gz and psc.2024-04-08T07:06:44+00:00.jsonl.gz bulk data exports? The latter was without bulk Ingester PSC being run, but with streaming Ingester PSC being run. If all worked well with streaming Ingester PSC, this should have been roughly equivalent to the monthly bulk data import of PSC snapshots.
  2. What were the differences between psc.2024-05-05T10:53:40+00:00.jsonl.gz and psc.2024-05-03T10:53:43+00:00.jsonl.gz bulk data exports? The latter was after bulk Ingester PSC was run, and with streaming Ingester PSC continuing to run meanwhile. If all worked well, these changes from this should have been minimal-to-none, since those snapshots would already have been processed (but note certain differences in order processing statement chains could be an issue, here).
  3. For statements written to oo-register-v2 S3 bucket raw_data directory by streaming Ingester PSC, how did those statements compare to what was written to the same place by bulk Ingester PSC, which downloaded the PSC datasource snapshots? That effectively compares the raw data coming over the streaming link via the raw data offered for download in daily (we import it monthly) PSC snapshots.

@tiredpixel
Copy link
Contributor

tiredpixel commented May 23, 2024

(1), (2)

Trying to analyse these files directly could be problematic, since they're 3.5G compressed files. However, there is no need to consider using Athena or similar, here, since the newer method of combining Register files (#213) fully indexes snapshots and generates them by appending to previous snapshots. In addition, comprehensive log files are written. This makes it possible to generate deltas, and analyse those instead.

Considering the sizes of the snapshots in bytes:

-rw-r--r--. 1 core core 3676957170 Apr  8 08:20 psc.2024-04-08T07:06:44+00:00.jsonl.gz
-rw-r--r--. 1 core core 3679878624 May  3 11:14 psc.2024-05-03T10:53:43+00:00.jsonl.gz
-rw-r--r--. 1 core core 3691953735 May  5 11:36 psc.2024-05-05T10:53:40+00:00.jsonl.gz

So, psc.2024-04-08T07:06:44+00:00.jsonl.gz is 3676957170 bytes. We can cut that from the head of the next snapshot, and compute the delta:

dd if='psc.2024-05-03T10:53:43+00:00.jsonl.gz' of='psc.2024-05-03T10:53:43+00:00.jsonl.gz.delta.jsonl.gz' bs=3676957170 skip=1

There, skip=3676957170 and bs=1 would make a lot more sense, but because of the way dd works, doing it the other way round results in the same output in this case, whilst being orders of magnitude faster.

Similarly, psc.2024-05-03T10:53:43+00:00.jsonl.gz is 3679878624 bytes. Cutting that from the head of the next snapshot results in the delta:

dd if='psc.2024-05-05T10:53:40+00:00.jsonl.gz' of='psc.2024-05-05T10:53:40+00:00.jsonl.gz.delta.jsonl.gz' bs=3679878624 skip=1

This results in delta compressed files:

-rw-r--r--. 1 core core  2921454 May 22 16:24 psc.2024-05-03T10:53:43+00:00.jsonl.gz.delta.jsonl.gz
-rw-r--r--. 1 core core 12075111 May 22 16:24 psc.2024-05-05T10:53:40+00:00.jsonl.gz.delta.jsonl.gz
3676957170 +  2921454 = 3679878624
           + 12075111 = 3691953735

so the sizes of these match. But do the numbers of statements?

Since these delta compressed files are only 2.8M and 12M in size, we can easily decompress them and work with that. These are JSONL files, and there is one statement per line. Counting the lines shows:

    24584 psc.2024-05-03T10:53:43+00:00.jsonl.gz.delta.jsonl
   105932 psc.2024-05-05T10:53:40+00:00.jsonl.gz.delta.jsonl

Looking at the logs written by the Register files combiner when these snapshots were generated, we see that psc.2024-04-08T07:06:44+00:00.jsonl.gz has 32808029 statements, psc.2024-05-03T10:53:43+00:00.jsonl.gz has 32832613 statements, and psc.2024-05-05T10:53:40+00:00.jsonl.gz has 32938545 statements.

32832613 - 32808029 =  24584
32938545 - 32832613 = 105932

This matches the number of lines in the delta files, so this method seems reasonable.

@tiredpixel
Copy link
Contributor

tiredpixel commented May 23, 2024

Here, we can already spot a potential issue: psc.2024-05-05T10:53:40+00:00.jsonl.gz resulted in 105932 extra statements which psc.2024-05-03T10:53:43+00:00.jsonl.gz didn't have. That in turn had only 24584 statements, but that's less of a concern because bulk PSC Ingester was run during the month a couple of times, as part of the deployment of streaming PSC Ingester (#248). But what are these additional statements?

The first statement in psc.2024-05-05T10:53:40+00:00.jsonl.gz.delta.jsonl is:

{"interestedParty":{"describedByPersonStatement":"1058695637818751597"},"interests":[{"details":"ownership-of-shares-75-to-100-percent","share":{"exclusiveMaximum":false,"exclusiveMinimum":false,"maximum":100,"minimum":75},"startDate":"2024-05-03","type":"shareholding"},{"details":"voting-rights-75-to-100-percent","share":{"exclusiveMaximum":false,"exclusiveMinimum":false,"maximum":100,"minimum":75},"startDate":"2024-05-03","type":"voting-rights"},{"details":"right-to-appoint-and-remove-directors","startDate":"2024-05-03","type":"appointment-of-board"}],"isComponent":false,"publicationDetails":{"bodsVersion":"0.2","license":"https://register.openownership.org/terms-and-conditions","publicationDate":"2024-05-04","publisher":{"name":"OpenOwnership Register","url":"https://register.openownership.org"}},"source":{"assertedBy":null,"description":"GB Persons Of Significant Control Register","retrievedAt":"2024-05-04","type":"officialRegister","url":"https://api.company-information.service.gov.uk/company/15702115/persons-with-significant-control/individual/GIB_iQS1N4Spi5_swikegKrlT3g"},"statementDate":"2024-05-03","statementID":"10017005497377867478","statementType":"ownershipOrControlStatement","subject":{"describedByEntityStatement":"14003992553979663845"}}

That has "startDate":"2024-05-03", so perhaps it was added to the registry after psc.2024-05-03T10:53:43+00:00.jsonl.gz. Instead, let's try an older statement from 2024-05-01.

{"interestedParty":{"describedByPersonStatement":"12124130482618967232"},"interests":[{"details":"ownership-of-shares-75-to-100-percent","share":{"exclusiveMaximum":false,"exclusiveMinimum":false,"maximum":100,"minimum":75},"startDate":"2024-05-01","type":"shareholding"},{"details":"voting-rights-75-to-100-percent","share":{"exclusiveMaximum":false,"exclusiveMinimum":false,"maximum":100,"minimum":75},"startDate":"2024-05-01","type":"voting-rights"},{"details":"right-to-appoint-and-remove-directors","startDate":"2024-05-01","type":"appointment-of-board"}],"isComponent":false,"publicationDetails":{"bodsVersion":"0.2","license":"https://register.openownership.org/terms-and-conditions","publicationDate":"2024-05-04","publisher":{"name":"OpenOwnership Register","url":"https://register.openownership.org"}},"source":{"assertedBy":null,"description":"GB Persons Of Significant Control Register","retrievedAt":"2024-05-04","type":"officialRegister","url":"https://api.company-information.service.gov.uk/company/15699683/persons-with-significant-control/individual/mpocV74edYjZnyeCoWQmETQZniI"},"statementDate":"2024-05-01","statementID":"10030505153561054686","statementType":"ownershipOrControlStatement","subject":{"describedByEntityStatement":"17878820742214078800"}}

Searching the deltas for https://api.company-information.service.gov.uk/company/15699683/persons-with-significant-control/individual/mpocV74edYjZnyeCoWQmETQZniI yields matches:

psc.2024-05-03T10:53:43+00:00.jsonl.gz.delta.jsonl
10432:{"addresses":[],"birthDate":"1997-11-01","identifiers":[{"id":"/entities/7523673282719851566","schemeName":"OpenOwnership Register","uri":"/entities/7523673282719851566"}],"isComponent":false,"names":[{"familyName":"Veliai","fullName":"Lentia Veliai","givenName":"Lentia","type":"individual"}],"nationalities":[{"code":"AL","name":"Albania"}],"personType":"knownPerson","publicationDetails":{"bodsVersion":"0.2","license":"https://register.openownership.org/terms-and-conditions","publicationDate":"2024-05-02","publisher":{"name":"OpenOwnership Register","url":"https://register.openownership.org"}},"replacesStatements":[],"source":{"assertedBy":null,"description":"GB Persons Of Significant Control Register","retrievedAt":"2024-05-02","type":"officialRegister","url":"https://api.company-information.service.gov.uk/company/15699683/persons-with-significant-control/individual/mpocV74edYjZnyeCoWQmETQZniI"},"statementDate":"2024-05-02","statementID":"1211871035483748167","statementType":"personStatement"}
10810:{"interestedParty":{"describedByPersonStatement":"1211871035483748167"},"interests":[{"details":"ownership-of-shares-75-to-100-percent","share":{"exclusiveMaximum":false,"exclusiveMinimum":false,"maximum":100,"minimum":75},"startDate":"2024-05-01","type":"shareholding"},{"details":"voting-rights-75-to-100-percent","share":{"exclusiveMaximum":false,"exclusiveMinimum":false,"maximum":100,"minimum":75},"startDate":"2024-05-01","type":"voting-rights"},{"details":"right-to-appoint-and-remove-directors","startDate":"2024-05-01","type":"appointment-of-board"}],"isComponent":false,"publicationDetails":{"bodsVersion":"0.2","license":"https://register.openownership.org/terms-and-conditions","publicationDate":"2024-05-02","publisher":{"name":"OpenOwnership Register","url":"https://register.openownership.org"}},"source":{"assertedBy":null,"description":"GB Persons Of Significant Control Register","retrievedAt":"2024-05-02","type":"officialRegister","url":"https://api.company-information.service.gov.uk/company/15699683/persons-with-significant-control/individual/mpocV74edYjZnyeCoWQmETQZniI"},"statementDate":"2024-05-01","statementID":"4270567247707214438","statementType":"ownershipOrControlStatement","subject":{"describedByEntityStatement":"17878820742214078800"}}

psc.2024-05-05T10:53:40+00:00.jsonl.gz.delta.jsonl
2:{"interestedParty":{"describedByPersonStatement":"12124130482618967232"},"interests":[{"details":"ownership-of-shares-75-to-100-percent","share":{"exclusiveMaximum":false,"exclusiveMinimum":false,"maximum":100,"minimum":75},"startDate":"2024-05-01","type":"shareholding"},{"details":"voting-rights-75-to-100-percent","share":{"exclusiveMaximum":false,"exclusiveMinimum":false,"maximum":100,"minimum":75},"startDate":"2024-05-01","type":"voting-rights"},{"details":"right-to-appoint-and-remove-directors","startDate":"2024-05-01","type":"appointment-of-board"}],"isComponent":false,"publicationDetails":{"bodsVersion":"0.2","license":"https://register.openownership.org/terms-and-conditions","publicationDate":"2024-05-04","publisher":{"name":"OpenOwnership Register","url":"https://register.openownership.org"}},"source":{"assertedBy":null,"description":"GB Persons Of Significant Control Register","retrievedAt":"2024-05-04","type":"officialRegister","url":"https://api.company-information.service.gov.uk/company/15699683/persons-with-significant-control/individual/mpocV74edYjZnyeCoWQmETQZniI"},"statementDate":"2024-05-01","statementID":"10030505153561054686","statementType":"ownershipOrControlStatement","subject":{"describedByEntityStatement":"17878820742214078800"}}
79:{"addresses":[],"birthDate":"1997-11-01","identifiers":[{"id":"/entities/7523673282719851566","schemeName":"OpenOwnership Register","uri":"/entities/7523673282719851566"}],"isComponent":false,"names":[{"familyName":"Veliai","fullName":"Lentia Veliai","givenName":"Lentia","type":"individual"}],"nationalities":[{"code":"GR","name":"Greece"}],"personType":"knownPerson","publicationDetails":{"bodsVersion":"0.2","license":"https://register.openownership.org/terms-and-conditions","publicationDate":"2024-05-04","publisher":{"name":"OpenOwnership Register","url":"https://register.openownership.org"}},"replacesStatements":["1211871035483748167"],"source":{"assertedBy":null,"description":"GB Persons Of Significant Control Register","retrievedAt":"2024-05-04","type":"officialRegister","url":"https://api.company-information.service.gov.uk/company/15699683/persons-with-significant-control/individual/mpocV74edYjZnyeCoWQmETQZniI"},"statementDate":"2024-05-04","statementID":"12124130482618967232","statementType":"personStatement"}

At a glance, these seem to contain mostly the same data, but in a different order. retrievedAt is also different, as is statementID. This isn't what we hoped. Perhaps there's something to do with the order in which things were run during the deployment of streamed Ingester PSC, or perhaps it's not possible to import PSC snapshots on top of the already-ingested data, since those will refer to previous moments in time.

Let us park this line of investigation here, for the moment; next, we will consider the raw data coming from PSC.

@tiredpixel
Copy link
Contributor

(3)

In order to examine this, we need to look at a sample of data written to the oo-register-v2 S3 bucket raw_data directory. Since streaming Ingester PSC is running constantly, the directory is updated with the original data coming over the PSC streaming link. However, when bulk Ingester PSC was run, that data also got written to the same directory, with the data obtained from downloading the PSC snapshots available daily. Thus, a set of files needs to be chosen such that streaming Ingester PSC was running during that time but bulk Ingester PSC was not, yet running at a point in time prior to the subsequent bulk Ingester PSC run. Then, that data can be matched to data received subsequently, and the matched lines examined for differences.

2024-04-30 seems a good day to pick. The streaming Ingester PSC files were written to oo-register-v2/raw_data/source=PSC/year=2024/month=04/day=30. There were 228 files written on that day, totalling 9130 lines (the files are not compressed).

First, we would like to develop a method for comparing the data. Let us pick the first file, psc-prod-2-2024-04-30-00-00-03-abe59265-3c0c-4633-8062-128767baba05, which contains only 2 lines. The first line is:

{"company_number":"15692709","data":{"address":{"address_line_1":"Vines Cross Way","country":"England","locality":"Skelmersdale","postal_code":"WN8 6HP","premises":"40"},"country_of_residence":"England","date_of_birth":{"month":4,"year":2000},"etag":"18f63af65e21104d901ffc760f0f3324a628ae06","kind":"individual-person-with-significant-control","links":{"self":"/company/15692709/persons-with-significant-control/individual/fPjFLkE9nnly_aKoDopjs6x6jDk"},"name":"Mr Josh Peter Reeb","name_elements":{"forename":"Josh","surname":"Reeb","title":"Mr"},"nationality":"English","natures_of_control":["ownership-of-shares-75-to-100-percent","voting-rights-75-to-100-percent","right-to-appoint-and-remove-directors"],"notified_on":"2024-04-30"}}

This has etag 18f63af65e21104d901ffc760f0f3324a628ae06, so we search oo-register-v2/raw_data/source=PSC/year=2024 for this. But there are no additional matches found. Hm.

That record also has links.self /company/15692709/persons-with-significant-control/individual/fPjFLkE9nnly_aKoDopjs6x6jDk. Searching by that shows a match in month=04/day=30/psc-prod-2-2024-04-30-00-59-03-d587088d-9ace-4540-bd76-58dd6c0278a7. This is earlier than we would expect, since it is prior to bulk Ingester PSC being run. Searching based on company_number also doesn't return any hits. Hm.

Let us instead pick this later file, psc-prod-2-2024-04-30-00-59-03-d587088d-9ace-4540-bd76-58dd6c0278a7. This contains 31 lines. The first line is:

{"company_number":"15692730","data":{"address":{"address_line_1":"Canterbury Avenue","country":"England","locality":"Ilford","postal_code":"IG1 3NG","premises":"82"},"country_of_residence":"England","date_of_birth":{"month":4,"year":1995},"etag":"3055b641a603f780633289dd50855844a6139177","kind":"individual-person-with-significant-control","links":{"self":"/company/15692730/persons-with-significant-control/individual/XIasrLbXq7EQ-MuvC0n_IWkBL-0"},"name":"Mr Altaf Hyder","name_elements":{"forename":"Altaf","surname":"Hyder","title":"Mr"},"nationality":"Indian","natures_of_control":["ownership-of-shares-75-to-100-percent","voting-rights-75-to-100-percent","right-to-appoint-and-remove-directors"],"notified_on":"2024-04-30"}}

Again, etag 3055b641a603f780633289dd50855844a6139177 and links.self /company/15692730/persons-with-significant-control/individual/XIasrLbXq7EQ-MuvC0n_IWkBL-0 has no matches. This is not as expected.

Can we find a line which actually has a match? Yes, in psc-prod-2-2024-04-30-23-04-02-ec15dca6-56f6-4386-b979-08edba26e042. This contains 4 lines, one of which is:

{"company_number":"11451665","data":{"address":{"address_line_1":"Shire Hill","country":"United Kingdom","locality":"Saffron Walden","postal_code":"CB11 3AQ","premises":"Business & Technology Centre"},"country_of_residence":"United Kingdom","date_of_birth":{"month":7,"year":1972},"etag":"82c6adc78824b3fade3ea2ac6bf58a4d62b4d428","kind":"individual-person-with-significant-control","links":{"self":"/company/11451665/persons-with-significant-control/individual/kMt3JLrqsv8_0nqrvIFtbhM2Bwg"},"name":"Mr Garry Mark Reed","name_elements":{"forename":"Garry","surname":"Reed","title":"Mr"},"nationality":"British","natures_of_control":["ownership-of-shares-75-to-100-percent","voting-rights-75-to-100-percent"],"notified_on":"2018-07-06"}}

Again, etag 82c6adc78824b3fade3ea2ac6bf58a4d62b4d428 has no matches. But links.self /company/11451665/persons-with-significant-control/individual/kMt3JLrqsv8_0nqrvIFtbhM2Bwg does: psc-prod-2-2024-05-03-12-34-56-be098ce0-c067-4f83-8aa2-82f7877656bb. This contains 101 lines. The matching line is:

{"company_number":"11451665","data":{"address":{"address_line_1":"Shire Hill","country":"United Kingdom","locality":"Saffron Walden","postal_code":"CB11 3AQ","premises":"Business & Technology Centre"},"country_of_residence":"United Kingdom","date_of_birth":{"month":7,"year":1972},"etag":"8d4d3ee8f26b2ad575305c57f74a728fa1e29438","kind":"individual-person-with-significant-control","links":{"self":"/company/11451665/persons-with-significant-control/individual/kMt3JLrqsv8_0nqrvIFtbhM2Bwg"},"name":"Mr Garry Mark Reed","name_elements":{"forename":"Garry","surname":"Reed","title":"Mr"},"nationality":"British","natures_of_control":["ownership-of-shares-75-to-100-percent","voting-rights-75-to-100-percent"],"notified_on":"2018-07-06"}}

We can see that etag is indeed different. Comparing the fields, we see nothing else is different:

16c16
<     "etag": "82c6adc78824b3fade3ea2ac6bf58a4d62b4d428",
---
>     "etag": "8d4d3ee8f26b2ad575305c57f74a728fa1e29438",

This gives us our first major findings:

  1. Some records published through PSC stream appear in PSC bulk data a few days later. (As expected.)
  2. However, these records have a different etag! (Unexpected.)
  3. Some records published through PSC stream do not appear in PSC bulk data. (Also unexpected.)

(2) is a pain. It means we cannot simply match that record reliably, nor could we deduplicate the record based on that even if we had such functionality.

(3) is curious. Perhaps records changed, and then got changed back prior to bulk data export? Or perhaps there's some other reason why entire records are missing.

What we need is some way to match records based on links.self. But this is not foolproof, because it's possible to have different updates with the same links.self. There isn't anything else reasonable to match on, however—except possibly company_number, but this would be even more likely to match multiple records.

@tiredpixel
Copy link
Contributor

Examining a large number of files through a filter, data.links.self can be one of these values:

corporate-entity-person-with-significant-control
individual-person-with-significant-control
legal-person-person-with-significant-control
persons-with-significant-control-statement
super-secure-person-with-significant-control

For each of these possible values, examining a line of raw data shows that data.links.self is set. So this seems to be the best available field to match on.

@tiredpixel
Copy link
Contributor

In order to examine the number of matches for each data.links.self, we write a small script. This accepts a source directory to analyse and a destination directory to perform the search in (which should usually be up a couple of levels in the directory tree). It loops through each file, then through each line, extracts data.links.self, searches for all matches, and reports the result. Note that by this method, the number of matches will usually be at least 1, since that match is the line in the source file.

#!/usr/bin/env bash
set -Eeuo pipefail

src_d=$1 # e.g. year\=2024/month\=04/day\=30/
dst_d=$2 # e.g. year\=2024/

mapfile -d '' src_fs < <(find "$src_d" \
    \( -name '.*' -prune \) -o -type f -print0 | sort -z)

for f in "${src_fs[@]}" ; do
    while read -r l ; do
        links_self=$(echo "$l" | jq -r '.data.links.self')
        matches=$( (grep -r "$links_self" "$dst_d" || true) | wc -l)
        echo -e "$f\t$links_self\t$matches"
    done < "$f"
done

psc-prod-2-2024-04-30-09-13-03-05e60595-c03a-43b0-83bc-acc9f4b24826 has 127 lines. This is much more than many of the files, so we use it for a benchmark.

Grep takes 23s. Replacing Grep with Ack:

matches=$( (ack "$links_self" "$dst_d" || true) | wc -l)

Ack takes even longer: 48s. Replacing Ack with Ag (The Silver Searcher):

matches=$( (ag "$links_self" "$dst_d" || true) | wc -l)

This takes less than 10s, so we'll use that instead (portability isn't a concern, here).

Using Ag, we can do slightly better, and use --ignore to ignore the source file. We could otherwise subtract 1 from the results, but this is more accurate since it's possible the same data.links.self could recur later in the same source file, whereas we're mostly only interested in the files which are written days later. This isn't necessarily 100% accurate, since it's possible something could be published to the stream immediately prior to the bulk snapshot generation, and that downloaded and run within the same period of time—but that would have to all happen within a 5-minute interval (based on Kinesis Firehose configuration), which is pretty much impossible. So it's reasonable.

That gives us the following final script:

#!/usr/bin/env bash
set -Eeuo pipefail

src_d=$1 # e.g. year\=2024/month\=04/day\=30/
dst_d=$2 # e.g. year\=2024/

mapfile -d '' src_fs < <(find "$src_d" \
    \( -name '.*' -prune \) -o -type f -print0 | sort -z)

for f in "${src_fs[@]}" ; do
    f2=$(basename "$f")
    while read -r l ; do
        links_self=$(echo "$l" | jq -r '.data.links.self')
        matches=$( (ag --ignore "$f2" "$links_self" "$dst_d" || true) | wc -l)
        echo -e "$f\t$links_self\t$matches"
    done < "$f"
done

@tiredpixel
Copy link
Contributor

Using this script, we can identify the number of matches for each data.links.self in the sample of 2024-04-30:

sample.2024-04-30.matches.log

This results in 9130 lines, which is what we expect from the raw data. But there are lots of lines with 0 matches: 6678 lines, in fact!

Here, it's worth sanity-checking. Picking some of these lines supposedly without matches randomly:

ag '\t0$' sample.2024-04-30.matches.log | shuf | head -n10
4135:year=2024/month=04/day=30/psc-prod-2-2024-04-30-11-31-03-39b4ac98-76e9-471e-8fad-f0f3e2db9af1      /company/12559415/persons-with-significant-control/individual/2JPt0xjtsaE1rLyVzdsQ3eBD8gU       0
3620:year=2024/month=04/day=30/psc-prod-2-2024-04-30-11-04-04-968769fd-0c87-46f9-aa6f-8d466166cfcb      /company/NI616749/persons-with-significant-control/individual/OhPEXRJ1BVaRznCiL8LKRKL3_9Q       0
6935:year=2024/month=04/day=30/psc-prod-2-2024-04-30-15-18-02-de6e5b7c-14cb-4cb7-983a-51beb5efaabf      /company/12004626/persons-with-significant-control/individual/FczKfVDqWWxJEVLo5klTweFLg0E       0
7858:year=2024/month=04/day=30/psc-prod-2-2024-04-30-16-11-03-45cb2676-b93a-435f-a4c3-e355c51a4bda      /company/15697048/persons-with-significant-control/individual/oKAmHZh9AnsfP1xoBHWT9i0_qNg       0
4312:year=2024/month=04/day=30/psc-prod-2-2024-04-30-11-42-03-31a90b2e-5152-416f-b295-8aaad8a89317      /company/SC808975/persons-with-significant-control/individual/3BrDBOnNM4TEjPmH-bcSwmOTRaI       0
6147:year=2024/month=04/day=30/psc-prod-2-2024-04-30-14-19-03-58fd785c-3fa7-42f2-8a0c-12c766f3b511      /company/14072606/persons-with-significant-control/individual/LN7qz4FKxvCWeOBNY1jIK__D4PM       0
7334:year=2024/month=04/day=30/psc-prod-2-2024-04-30-15-39-04-917515d2-599f-4fca-b3d3-8ba9b4d7aeed      /company/15696696/persons-with-significant-control/individual/UK0N-eipqzXSPCFNTrCl7-uGIQA       0
1620:year=2024/month=04/day=30/psc-prod-2-2024-04-30-09-03-02-94cff627-4632-45a0-bc46-05d4ddcb7cf0      /company/01487374/persons-with-significant-control/individual/yxISNNEuOkRNg7sYbQsAO_6zLFk       0
7169:year=2024/month=04/day=30/psc-prod-2-2024-04-30-15-29-02-54337120-39d3-47ca-ae25-c525b6c7d388      /company/13165896/persons-with-significant-control/individual/NARS-jPFJcGQTZ5vJTO5XAKd3L4       0
7792:year=2024/month=04/day=30/psc-prod-2-2024-04-30-16-06-02-6650c699-c11b-4d35-89c3-243a77e9560a      /company/11916352/persons-with-significant-control/individual/MVWnm_hyjlZVhsvCL7M3oDq6pmY       0

Manually searching for each of these data.links.self in the entire PSC dataset (oo-register-v2/raw_data/source=PSC) shows that there are indeed no matches after 2024-04-30. Some have matches from 2023, but that's perfectly acceptable. At least one has a match within the same source file on that day with the same payload, which is a little unusual, but also possible—e.g. perhaps the stream was interrupted (which happens daily) and streaming Ingester PSC was restarted.

There are 1057 lines with more than 1 match. This is entirely possible, data.links.self may occur more than once if there are multiple updates within the registry. Given that etag doesn't match, sorting through these potential matches would be tedious. However, there are 1395 lines with precisely 1 match, so we can reduce the sample to this, and analyse that in more detail instead:

sample.2024-04-30.matches=1.log

@tiredpixel
Copy link
Contributor

tiredpixel commented May 24, 2024

Looking into sample.2024-04-30.matches.1.log, an issue with the process becomes clear: some of the matches are from prior to the bulk Ingester PSC. Rather than searching the whole of 2024, it would be better to search just 2024-05. This is possible because the original sample is 2024-04-30 (the last day of the month), and bulk Ingester PSC was run only during 2024-05.

Rerunning the script with different parameters, we can identify the number of matches for each data.links.self in the sample of 2024-04-30, where those matches occur only sometime in 2024-05:

sample.2024-04-30.2024-05.matches.log

Again, this has 9130 lines. But the number of matches is even worse: 7938 data.links.self have no matches within 2024-05. And this reduces the next sample to only 880 lines with precisely 1 match in 2024-05.

Another sanity-check is merited. A random sample of lines is:

1236:year=2024/month=04/day=30/psc-prod-2-2024-04-30-08-30-02-6bd3f460-5691-480f-8624-4aede520eb8b      /company/15693659/persons-with-significant-control/individual/D5INTYHwHkEFUG4iv8_Tbvos0a8       0
6826:year=2024/month=04/day=30/psc-prod-2-2024-04-30-15-07-02-0090bb94-66f4-441e-8d0a-3f1c0e3e728d      /company/15696393/persons-with-significant-control/individual/fRUzlOBkjeGWtNMxm3uE9SAsN4Q       0
6060:year=2024/month=04/day=30/psc-prod-2-2024-04-30-14-08-05-bd24f4b9-7faf-412e-bee9-368c747b9e12      /company/06911744/persons-with-significant-control/individual/tZx9vWQNhaJ08kazW7kecO9_heA       0
3390:year=2024/month=04/day=30/psc-prod-2-2024-04-30-10-49-02-731acfde-e108-465a-9638-8ca28ac59d97      /company/15694677/persons-with-significant-control/individual/PwIy3LVbmxB4Orw_sO8HQwLU87g       0
2418:year=2024/month=04/day=30/psc-prod-2-2024-04-30-09-55-05-acfe7eeb-cf3e-402b-811b-7ba7e96d507e      /company/12941146/persons-with-significant-control/individual/-1oPefDxXX6tQ8p-CpgPWXLyBLw       0
3290:year=2024/month=04/day=30/psc-prod-2-2024-04-30-10-43-03-5fde2797-2fa2-4846-a7fc-7746f224df1b      /company/15324434/persons-with-significant-control/individual/c1us1ZVCgsuroUnP449c3q_Wfus       0
1510:year=2024/month=04/day=30/psc-prod-2-2024-04-30-08-57-02-7c037879-d712-47ec-9877-51293a798ebd      /company/14895215/persons-with-significant-control/individual/Es_cutCL_jO6FFXocUgfDKYMspU       0
7821:year=2024/month=04/day=30/psc-prod-2-2024-04-30-16-11-03-45cb2676-b93a-435f-a4c3-e355c51a4bda      /company/15697010/persons-with-significant-control/individual/ygVdeFoF0aadXq0O8HKwhqeHKhI       0
8782:year=2024/month=04/day=30/psc-prod-2-2024-04-30-18-45-02-5a5ae755-3468-4980-86c7-4ac8c146f0bd      /company/15697849/persons-with-significant-control/individual/MWBDfUsWvUaI5d-jVD7dz4DF3FM       0
4232:year=2024/month=04/day=30/psc-prod-2-2024-04-30-11-42-03-31a90b2e-5152-416f-b295-8aaad8a89317      /company/13871346/persons-with-significant-control/legal-person/8uLJ6b7aCzuk3s2H1YiDflDiGd4     0

Another manual search confirms no matches for these within 2024-05 raw data. This is highly unexpected.

At this point, we check the original source data contained in the PSC snapshots files downloaded when bulk Ingester PSC was run—that is, prior to any processing or restreaming whatsoever. These files are contained in oo-register-bods S3 bucket psc_snapshots/import_id=2024_05_03 directory. That is, bulk Ingester PSC was run on 2024-05-03, meaning that most (all?) of the records from our 2024-04-30 sample should be present in that PSC snapshot.

However, comparing the format of these files, it's clear that in fact these are after the first level of processing. It shouldn't make a difference since that should just be splitting the files, but it turns out that the original PSC snapshots are still available at the old URLs, even though they are no longer listed on the PSC snapshots page. We iterate through these and download them:

for i in $(seq 1 26); do axel "https://download.companieshouse.gov.uk/psc-snapshot-2024-05-03_${i}of26.zip"; done

This yields the following ZIP files:

total 1.6G
-rw-r--r--. 1 core core 64M May 24 14:42 psc-snapshot-2024-05-03_1of26.zip
-rw-r--r--. 1 core core 65M May 24 14:42 psc-snapshot-2024-05-03_2of26.zip
-rw-r--r--. 1 core core 64M May 24 14:42 psc-snapshot-2024-05-03_3of26.zip
-rw-r--r--. 1 core core 64M May 24 14:42 psc-snapshot-2024-05-03_4of26.zip
-rw-r--r--. 1 core core 64M May 24 14:42 psc-snapshot-2024-05-03_5of26.zip
-rw-r--r--. 1 core core 64M May 24 14:42 psc-snapshot-2024-05-03_6of26.zip
-rw-r--r--. 1 core core 64M May 24 14:42 psc-snapshot-2024-05-03_7of26.zip
-rw-r--r--. 1 core core 64M May 24 14:42 psc-snapshot-2024-05-03_8of26.zip
-rw-r--r--. 1 core core 65M May 24 14:42 psc-snapshot-2024-05-03_9of26.zip
-rw-r--r--. 1 core core 64M May 24 14:42 psc-snapshot-2024-05-03_10of26.zip
-rw-r--r--. 1 core core 63M May 24 14:42 psc-snapshot-2024-05-03_11of26.zip
-rw-r--r--. 1 core core 63M May 24 14:42 psc-snapshot-2024-05-03_12of26.zip
-rw-r--r--. 1 core core 63M May 24 14:42 psc-snapshot-2024-05-03_13of26.zip
-rw-r--r--. 1 core core 63M May 24 14:43 psc-snapshot-2024-05-03_14of26.zip
-rw-r--r--. 1 core core 62M May 24 14:43 psc-snapshot-2024-05-03_15of26.zip
-rw-r--r--. 1 core core 62M May 24 14:43 psc-snapshot-2024-05-03_16of26.zip
-rw-r--r--. 1 core core 62M May 24 14:43 psc-snapshot-2024-05-03_17of26.zip
-rw-r--r--. 1 core core 62M May 24 14:43 psc-snapshot-2024-05-03_18of26.zip
-rw-r--r--. 1 core core 62M May 24 14:43 psc-snapshot-2024-05-03_19of26.zip
-rw-r--r--. 1 core core 63M May 24 14:43 psc-snapshot-2024-05-03_20of26.zip
-rw-r--r--. 1 core core 63M May 24 14:43 psc-snapshot-2024-05-03_21of26.zip
-rw-r--r--. 1 core core 63M May 24 14:43 psc-snapshot-2024-05-03_22of26.zip
-rw-r--r--. 1 core core 62M May 24 14:44 psc-snapshot-2024-05-03_23of26.zip
-rw-r--r--. 1 core core 62M May 24 14:44 psc-snapshot-2024-05-03_24of26.zip
-rw-r--r--. 1 core core 42M May 24 14:44 psc-snapshot-2024-05-03_25of26.zip
-rw-r--r--. 1 core core 26M May 24 14:44 psc-snapshot-2024-05-03_26of26.zip

But is this the right PSC snapshot? Comparing the S3 split files to the PSC snapshot just obtained:

zcat import_id\=2024_05_03/url_index\=*/part\=part*/file-*.csv.gz | wc -l
12902790
unzip -p psc-snapshot-2024-05-03/\*.zip | wc -l
12902790

So both methods contain exactly 12902790 records. Thus, this is the right snapshot.

Searching the S3 snapshot for the missing data.links.selfs:

zcat import_id\=2024_05_03/url_index\=*/part\=part*/file-*.csv.gz | grep -E '(/company/15693659/persons-with-significant-control/individual/D5INTYHwHkEFUG4iv8_Tbvos0a8|/company/15696393/persons-with-significant-control/individual/fRUzlOBkjeGWtNMxm3uE9SAsN4Q|/company/06911744/persons-with-significant-control/individual/tZx9vWQNhaJ08kazW7kecO9_heA|/company/15694677/persons-with-significant-control/individual/PwIy3LVbmxB4Orw_sO8HQwLU87g|/company/12941146/persons-with-significant-control/individual/-1oPefDxXX6tQ8p-CpgPWXLyBLw|/company/15324434/persons-with-significant-control/individual/c1us1ZVCgsuroUnP449c3q_Wfus|/company/14895215/persons-with-significant-control/individual/Es_cutCL_jO6FFXocUgfDKYMspU|/company/15697010/persons-with-significant-control/individual/ygVdeFoF0aadXq0O8HKwhqeHKhI|/company/15697849/persons-with-significant-control/individual/MWBDfUsWvUaI5d-jVD7dz4DF3FM|/company/13871346/persons-with-significant-control/legal-person/8uLJ6b7aCzuk3s2H1YiDflDiGd4)'

This returns results, so some of these at least are present in the bulk snapshot, after all. But of course, these could be previous records from anywhere in the past. Examining data.notified_on shows some matches from 2024-04-30, which makes sense, since if the change was notified on that day, it would have been published to the stream immediately (hopefully). So using a filter of grep '"notified_on":"2024-04-30"' helps with this.

One example resulting from this is:

{"company_number":"15693659","data":{"address":{"address_line_1":"Cossington Road","country":"United Kingdom","locality":"Loughborough","postal_code":"LE12 7RS","premises":"72","region":"Leicestershire"},"country_of_residence":"United Kingdom","date_of_birth":{"month":7,"year":1980},"etag":"902c47bbc00939fc618b48b73a434fc4d3747b40","kind":"individual-person-with-significant-control","links":{"self":"/company/15693659/persons-with-significant-control/individual/D5INTYHwHkEFUG4iv8_Tbvos0a8"},"name":"Mr Paul Jonathon Barker","name_elements":{"forename":"Paul","middle_name":"Jonathon","surname":"Barker","title":"Mr"},"nationality":"British","natures_of_control":["ownership-of-shares-75-to-100-percent","voting-rights-75-to-100-percent","right-to-appoint-and-remove-directors"],"notified_on":"2024-04-30"}}

sample.2024-04-30.2024-05.matches.log said /company/15693659/persons-with-significant-control/individual/D5INTYHwHkEFUG4iv8_Tbvos0a8 had no matches in 2024-05, yet here it is in the PSC snapshot for 2024-05-03! That seems wrong. But what of the etag? It is 902c47bbc00939fc618b48b73a434fc4d3747b40. Searching for that indeed yields exactly one match: year=2024/month=04/day=30/psc-prod-2-2024-04-30-08-30-02-6bd3f460-5691-480f-8624-4aede520eb8b. This is indeed not in 2024-05, but is in 2024-04-30…

Looking at the code for Ingester PSC, we find:
https://github.com/openownership/register-ingester-psc/blob/main/lib/register_ingester_psc/records_handler.rb#L42
That is, records are always excluded if repository.get yields a result. Looking at the code for Sources PSC, we find:
https://github.com/openownership/register-sources-psc/blob/main/lib/register_sources_psc/repository.rb#L32
That is, records are searched by etag.

What does this mean? It means that if a record were encountered during the bulk Ingester PSC with the same etag, it would be skipped, and would not appear again in the raw data. But we already found above that some records are republished with different etags—even if they otherwise contains exactly the same data.

Without investigating too much further (in the interests of time), this suggests the following:

  1. Some records published through PSC stream do not appear in PSC bulk data. (Also unexpected.)

So we have found that this is possible, if the etag in fact matched (which is what we would ideally hope). But if the etag did not match, the record would appear again in PSC bulk data. So what are the precise differences in this data, where the etag was different, but a match was found in the bulk (rather than streamed) data, with only one match (to make sampling easier)? We can filter the sample above, like before, to result in these lines only:

sample.2024-04-30.2024-05.matches=1.log

We now have 880 lines we can use as a sample.

@tiredpixel
Copy link
Contributor

What we would like is 2 files, only containing our sample from 2024-04-30, and one containing its match from 2024-05.

Data for 2024-04-30 can be found by modifying the previous script:

#!/usr/bin/env bash
set -Eeuo pipefail

src_d=$1 # e.g. year\=2024/month\=04/day\=30/
dst_d=$2 # e.g. year\=2024/

mapfile -d '' src_fs < <(find "$src_d" \
    \( -name '.*' -prune \) -o -type f -print0 | sort -z)

for f in "${src_fs[@]}" ; do
    f2=$(basename "$f")
    while read -r l ; do
        links_self=$(echo "$l" | jq -r '.data.links.self')
        matches=$( (ag --ignore "$f2" "$links_self" "$dst_d" || true) | wc -l)
        if [ "$matches" -eq 1 ]; then
            echo "$l"
        fi
    done < "$f"
done

Data for 2024-05 can be found by reprocessing the log file, using different parameters:

#!/usr/bin/env bash
set -Eeuo pipefail

log_f=$1 # e.g. sample.2024-04-30.matches=1.log
dst_d=$2 # e.g. year\=2024/month\=05/

while read -r l ; do
    src_f=$(echo "$l" | cut -f1)
    src_f2=$(basename "$src_f")
    links_self=$(echo "$l" | cut -f2)
    ag --ignore "$src_f2" "$links_self" "$dst_d" || true
done < "$log_f"

sample-880.2024-04-30.log
sample-880.2024-05.log

Each of these are in order, and contain 880 lines.

@tiredpixel
Copy link
Contributor

We'd like to compare the 2 files line-by-line. But there are a lot of differences. Using Vimdiff, we can get a general sense of the types of differences which occur:

vimdiff sample-880 2024-04-30 log sample-880 2024-05 log

From this, we can see that some lines differ only by 1 field, but others differ substantially.

Examining these differences, we find that some indeed differ by etag, as found previously. But are the etags always different? We extract data.etag via Jq, add line numbers so the files are in sorted order (without actually changing the sort order), and use Comm to display only those lines in common:

comm -12 <(jq -r '.data.etag' < sample-880.2024-04-30.log | nl) <(jq -r '.data.etag' < sample-880.2024-05.log | nl)
(empty)

That is, etags appear to be always different between streaming PSC and bulk PSC, regardless of the rest of the record!

since the etags are not useful for us, we next eliminate them:

jq -c 'del(.data.etag)' < sample-880.2024-04-30.log > sample-880.2024-04-30.no-etag.jsonl
jq -c 'del(.data.etag)' < sample-880.2024-05.log    > sample-880.2024-05.no-etag.jsonl

sample-880.2024-04-30.no-etag.log
sample-880.2024-05.no-etag.log

Again using Vimdiff to get a sense of the sort of differences:

vimdiff sample-880 2024-04-30 no-etag log sample-880 2024-05 no-etag log

That is, now the etags have been eliminated, there are still lots of differences, but also many lines are identical.

To make comparison easier, we next eliminate those lines which are identical. To do so, we can use Comm. However, Comm requires files to be sorted in order to be compared. We don't want to sort them, since we'd lose the correspondence between the 2 files. So we temporarily add line numbers with Nl, compare them and filter the similarities, and again remove the line numbers:

comm -23 <(nl sample-880.2024-04-30.no-etag.log) <(nl sample-880.2024-05.no-etag.log) | cut -f2 > sample-880.2024-04-30.no-etag.differences.log
comm -13 <(nl sample-880.2024-04-30.no-etag.log) <(nl sample-880.2024-05.no-etag.log) | cut -f2 > sample-880.2024-05.no-etag.differences.log

sample-880.2024-04-30.no-etag.differences.log
sample-880.2024-05.no-etag.differences.log

Each of these files contains 412 lines.

@tiredpixel
Copy link
Contributor

Using Vimdiff:

vimdiff sample-880 2024-04-30 no-etag differences log sample-880 2024-05 no-etag differences log

This is becoming easier to compare. We can spot that in some cases, company_number is null in streaming data, but populated in bulk data. e.g.

{"company_number":null,"data":{"address":{"address_line_1":"Kirkton Avenue","address_line_2":"Blantyre","country":"United Kingdom","locality":"Glasgow","postal_code":"G72 0HR","premises":"140"},"country_of_residence":"Scotland","date_of_birth":{"month":2,"year":1999},"kind":"individual-person-with-significant-control","links":{"self":"/company/SC756576/persons-with-significant-control/individual/Qx9rnxRKgCbmQDJnywF24518QEw"},"name":"Mr Mackenzie Malcolm","name_elements":{"forename":"Mackenzie","surname":"Malcolm","title":"Mr"},"nationality":"British","natures_of_control":["ownership-of-shares-25-to-50-percent","voting-rights-25-to-50-percent"],"notified_on":"2023-01-25"}}
{"company_number":"SC756576","data":{"address":{"address_line_1":"Kirkton Avenue","address_line_2":"Blantyre","country":"United Kingdom","locality":"Glasgow","postal_code":"G72 0HR","premises":"140"},"country_of_residence":"Scotland","date_of_birth":{"month":2,"year":1999},"kind":"individual-person-with-significant-control","links":{"self":"/company/SC756576/persons-with-significant-control/individual/Qx9rnxRKgCbmQDJnywF24518QEw"},"name":"Mr Mackenzie Malcolm","name_elements":{"forename":"Mackenzie","surname":"Malcolm","title":"Mr"},"nationality":"British","natures_of_control":["ownership-of-shares-25-to-50-percent","voting-rights-25-to-50-percent"],"notified_on":"2023-01-25"}}

This is not a fault in the PSC data itself, but rather a bug in Ingester PSC. It was already fixed (openownership/register-ingester-psc#37). Other than this, all company_number fields match. So, we can eliminate the company_number field, and simplify the sample further:

jq -c 'del(.company_number)' < sample-880.2024-04-30.no-etag.differences.log > sample-880.2024-04-30.no-etag.differences.no-cn.log
jq -c 'del(.company_number)' < sample-880.2024-05.no-etag.differences.log    > sample-880.2024-05.no-etag.differences.no-cn.log

comm -23 <(nl sample-880.2024-04-30.no-etag.differences.no-cn.log) <(nl sample-880.2024-05.no-etag.differences.no-cn.log) | cut -f2 > sample-880.2024-04-30.no-etag.differences.no-cn.differences.log
comm -13 <(nl sample-880.2024-04-30.no-etag.differences.no-cn.log) <(nl sample-880.2024-05.no-etag.differences.no-cn.log) | cut -f2 > sample-880.2024-05.no-etag.differences.no-cn.differences.log

sample-880.2024-04-30.no-etag.differences.no-cn.differences.log
sample-880.2024-05.no-etag.differences.no-cn.differences.log

This leaves 402 lines in each file, with relatively minor differences:

vimdiff sample-880 2024-{04-30,05} no-etag differences no-cn differences log

@tiredpixel
Copy link
Contributor

tiredpixel commented May 25, 2024

At this point, it's likely easier comparing fields in expanded, not compact, JSON form:

jq < sample-880.2024-04-30.no-etag.differences.no-cn.differences.log > sample-402.2024-04-30.log
jq < sample-880.2024-05.no-etag.differences.no-cn.differences.log    > sample-402.2024-05.log

diff sample-402.2024-{04-30,05}.log > sample-402.log

sample-402.log

There are 2984 lines in this diff.

@tiredpixel
Copy link
Contributor

The remaining differences can be broadly grouped into categories. Those are:

ceased on additions

9a10
>     "ceased_on": "2024-05-01",
4589a4512
>     "ceased_on": "2023-10-13",

natures of control changes

60,64c61
<       "ownership-of-shares-25-to-50-percent-as-trust",
<       "ownership-of-shares-25-to-50-percent-as-firm",
<       "voting-rights-25-to-50-percent",
<       "voting-rights-25-to-50-percent-as-trust",
<       "voting-rights-25-to-50-percent-as-firm"
---
>       "voting-rights-25-to-50-percent"

address country or district changes

73d69
<       "country": "England",
1681c1657
<     "country_of_residence": "England",
---
>     "country_of_residence": "United Kingdom",

address specific changes

76c72
<       "premises": "25"
---
>       "premises": "15"

nationality changes

227c217
<     "nationality": "British",
---
>     "nationality": "Portuguese",
2714c2677
<     "nationality": "English,Nigerian",
---
>     "nationality": "British",

minor typographical changes

327c314
<       "address_line_1": "The Winning Box, 27-37 Station Road",
---
>       "address_line_1": "The Winning Box 27-37 Station Road",
12587c12505
<       "title": "Mr,"
---
>       "title": "Mr"

title changes

409c396
<     "name": "Ms Asma Naaz",
---
>     "name": "Miss Asma Naaz",
413c400
<       "title": "Ms"
---
>       "title": "Miss"

forename vs surname reversals

475c463
<     "name": "Mr. Ahmadzai Pachakhan",
---
>     "name": "Mr. Pachakhan Ahmadzai",
477,478c465,466
<       "forename": "Ahmadzai",
<       "surname": "Pachakhan",
---
>       "forename": "Pachakhan",
>       "surname": "Ahmadzai",

name changes

540c528
<     "name": "Mr Benjamin James",
---
>     "name": "Benjamin James Dew",
543,544c531
<       "surname": "James",
<       "title": "Mr"
---
>       "surname": "Dew"
5354c5268
<     "name": "Mr Adrian Iulian Cipcigan",
---
>     "name": "Mr Adrian-Iulian Cipcigan",
5356c5270
<       "forename": "Adrian",
---
>       "forename": "Adrian-Iulian",

postcode changes

561,562c548,549
<       "postal_code": "B60 2AB",
<       "premises": "Maple House"
---
>       "postal_code": "B60 2BG",
>       "premises": "Maple Tree House"

address reformatting

786c770
<       "address_line_1": "Islington Studios,159-163 Marlborough Road",
---
>       "address_line_1": "159-163 Marlborough Road",
791c775
<       "premises": "Islington Studios,159-163 Marlborough Road"
---
>       "premises": "Islington Studios"

address changes to Companies House default

1378,1382c1360,1362
<       "address_line_1": "Hamilton Street",
<       "country": "England",
<       "locality": "Worksop",
<       "postal_code": "S81 7DD",
<       "premises": "30"
---
>       "locality": "Cardiff",
>       "postal_code": "CF14 8LH",
>       "premises": "15538793 - Companies House Default Address"

name spelling changes

2379c2345
<     "name": "Mr Tahir Najib Lone",
---
>     "name": "Mr Tahir Nagib Lone",
5848c5758
<     "name": "Mr Ahmad Jumir",
---
>     "name": "Mr Ahmed Jumir",
5850c5760
<       "forename": "Ahmad",
---
>       "forename": "Ahmed",

address spelling changes

3080c3032
<       "address_line_1": "Longrigg Road",
---
>       "address_line_1": "Long Rigg Road",

address deletions

3634,3638c3579,3582
<       "address_line_1": "Vining Street",
<       "country": "England",
<       "locality": "London",
<       "postal_code": "SW9 8QA",
<       "premises": "18"
---
>       "address_line_1": "..",
>       "locality": "..",
>       "postal_code": "..",
>       "premises": ".."

@tiredpixel
Copy link
Contributor

tiredpixel commented May 25, 2024

Questions

There's a bit of a complexity, here, in that it's not clear in the case of amendments whether the change came from an update via the stream, or only in the bulk data snapshot. In order to understand this better, some of these specific examples given above will be explored manually in order to try to answer these questions:

  1. Is ceased_on present within the stream at any point? Or is it only present in bulk data?
  2. For the records found to differ only by etag, were those matches definitely in the bulk data, rather than the stream?
  3. Where corrections were received, such as to name or address, did these ever come via the stream, or only in bulk data?
  4. Is our transformed BODS data actually up-to-date, properly taking into account name changes, etc.?
  5. What are the consequences of etags not matching between the stream and bulk data on PSC-STM-B6: Add tracking of which records have been transformed #254 and PSC-STM-B7: Switch over to live transformation #255 ?

@tiredpixel
Copy link
Contributor

1. Is ceased_on present within the stream at any point? Or is it only present in bulk data?

ag -c '"ceased_on":"2024-05-25"'
month=05/day=25/psc-prod-2-2024-05-25-05-57-03-ae8f6ea6-4225-49f2-b682-016f3b255e8e:1
month=05/day=25/psc-prod-2-2024-05-25-06-25-03-3ca73436-b28e-4ba7-9fca-57b12d1c6f41:1
month=05/day=25/psc-prod-2-2024-05-25-09-03-02-a2348fb1-ea96-4085-b633-f0fd72ceaf19:1
month=05/day=25/psc-prod-2-2024-05-25-08-57-03-cf5f3b96-53b2-433c-8b84-5df6ac0596df:1
month=05/day=25/psc-prod-2-2024-05-25-09-13-03-958ee50e-0358-4fd3-b9c3-c777f1659848:2
month=05/day=25/psc-prod-2-2024-05-25-10-20-03-be8301d6-ce31-417b-a4c9-d08a2c90611a:1
month=05/day=25/psc-prod-2-2024-05-25-11-15-03-880e74a5-d98b-4015-b7cc-c14ebd1f5276:1

ceased_on is being received through the stream, as recently as today. Additionally, this is not only historical dates, but also includes specifying today (2024-05-25). So everything is fine with this being received via the stream.

@tiredpixel
Copy link
Contributor

2. For the records found to differ only by etag, were those matches definitely in the bulk data, rather than the stream?

comm -12 <(nl sample-880.2024-04-30.no-etag.log) <(nl sample-880.2024-05.no-etag.log) | cut -f2 | jq -r '.data.links.self' | shuf | head -n10
/company/06647587/persons-with-significant-control/individual/UQSTHFiGZJnofCUgULeZKOXWpI0
/company/12751591/persons-with-significant-control/individual/Ks4NZ5gbnbfMkCRIKTGZht7C8AA
/company/15694981/persons-with-significant-control/individual/gOMHoIkyWEs7CEp8peQ59OcwEhI
/company/15127834/persons-with-significant-control/individual/m3G849Vk6p-miVvpbkeMZwVOmaU
/company/15506348/persons-with-significant-control/individual/l9UJH3Hz67Am_AeoNejHpoort-8
/company/15607167/persons-with-significant-control/individual/8nOKF2D4-sobjnLmHg6rYvD3VDA
/company/15601950/persons-with-significant-control/individual/Jh50jGfQxIDGQHsFqZ3VLsnL2t0
/company/15692722/persons-with-significant-control/individual/6Ey9nchMQbHpCvEmlVzodc_8g-k
/company/15695197/persons-with-significant-control/individual/ayvLQFuC9xXCASo7n1PYWZ9GKLo
/company/15607167/persons-with-significant-control/individual/8nOKF2D4-sobjnLmHg6rYvD3VDA

Most of these matched with 2024-05-03. Some, however, matched with very different dates, even as recently as 2024-05-20. e.g /company/15695197/persons-with-significant-control/individual/ayvLQFuC9xXCASo7n1PYWZ9GKLo:

month=04/day=30/psc-prod-2-2024-04-30-11-59-02-dac8c73b-37e1-4682-83aa-0916e41f132b
 6:{"company_number":"15695197","data":{"address":{"address_line_1":"Bracken Close","country":"England","locality":"Burntwood","postal_code":"WS7 9BD","premises":"38"},"country_of_residence":"England","date_of_birth":{"month":7,"year":1993},"etag":"7c2ea9b3e7172c13b2f0a196cbd305c2ff56925e","kind":"individual-person-with-significant-control","links":{"self":"/company/15695197/persons-with-significant-control/individual/ayvLQFuC9xXCASo7n1PYWZ9GKLo"},"name":"Mr Yu Hongwei","name_elements":{"forename":"Yu","surname":"Hongwei","title":"Mr"},"nationality":"Chinese","natures_of_control":["ownership-of-shares-75-to-100-percent","voting-rights-75-to-100-percent","right-to-appoint-and-remove-directors"],"notified_on":"2024-04-30"}}

month=05/day=20/psc-prod-2-2024-05-20-08-21-03-0a9cea0a-82c4-4bfe-b834-58bc8122a1ba
12:{"company_number":"15695197","data":{"address":{"address_line_1":"Bracken Close","country":"England","locality":"Burntwood","postal_code":"WS7 9BD","premises":"38"},"country_of_residence":"England","date_of_birth":{"month":7,"year":1993},"etag":"a0110502947b5aa1d9f2877ceb1e0ef7fcd8623a","kind":"individual-person-with-significant-control","links":{"self":"/company/15695197/persons-with-significant-control/individual/ayvLQFuC9xXCASo7n1PYWZ9GKLo"},"name":"Mr Yu Hongwei","name_elements":{"forename":"Yu","surname":"Hongwei","title":"Mr"},"nationality":"Chinese","natures_of_control":["ownership-of-shares-75-to-100-percent","voting-rights-75-to-100-percent","right-to-appoint-and-remove-directors"],"notified_on":"2024-04-30"}}

It appears that sometimes, records are republished through the stream at a later date, but with no material changes. However, the etag is different.

Searching through the PSC snapshot for those data.links.self and extracting the data.etags:

zcat import_id\=2024_05_03/url_index\=*/part\=part*/file-*.csv.gz | grep -E '(/company/06647587/persons-with-significant-control/individual/UQSTHFiGZJnofCUgULeZKOXWpI0|/company/12751591/persons-with-significant-control/individual/Ks4NZ5gbnbfMkCRIKTGZht7C8AA|/company/15694981/persons-with-significant-control/individual/gOMHoIkyWEs7CEp8peQ59OcwEhI|/company/15127834/persons-with-significant-control/individual/m3G849Vk6p-miVvpbkeMZwVOmaU|/company/15506348/persons-with-significant-control/individual/l9UJH3Hz67Am_AeoNejHpoort-8|/company/15607167/persons-with-significant-control/individual/8nOKF2D4-sobjnLmHg6rYvD3VDA|/company/15601950/persons-with-significant-control/individual/Jh50jGfQxIDGQHsFqZ3VLsnL2t0|/company/15692722/persons-with-significant-control/individual/6Ey9nchMQbHpCvEmlVzodc_8g-k|/company/15695197/persons-with-significant-control/individual/ayvLQFuC9xXCASo7n1PYWZ9GKLo|/company/15607167/persons-with-significant-control/individual/8nOKF2D4-sobjnLmHg6rYvD3VDA)' | jq -r '.data.etag'
7b0ed691ac940b4e3b7f8691b85667d8f3e6cfb1
88111106566ba06908a4ee0e1699a11141942e10
37591116a16da6e19848c643d28e71ca956be23a
fd7400e8cdfb55fb161d4e484024a1731a221516
0f79d5bd443d5454e88eed0b96aea1150f0968d9
4a96f7de664bcc43b086ff821f3c0936055160ba
f0c03e0c3717164814e44965fedc0777e2872388
2be1987bbb0d6e2891adbbff1f8f112323bd5964
7c2ea9b3e7172c13b2f0a196cbd305c2ff56925e

So 9 of these were present within the bulk data snapshot.

The first 6 appear in sample-880.2024-05.etag.log, so these were indeed in the bulk data rather than the stream. The last 3 do not appear, appearing instead in sample-880.2024-04-30.etag.log.

It appears that some of the almost-duplicates (the same except for etag) were received within the bulk data, but some were received within the streamed data.

@tiredpixel
Copy link
Contributor

tiredpixel commented May 25, 2024

3. Where corrections were received, such as to name or address, did these ever come via the stream, or only in bulk data?

2714c2677
<     "nationality": "English,Nigerian",
---
>     "nationality": "British",

This maps to /company/14208721/persons-with-significant-control/individual/l_9CIHxTSzQI-mbJkZsKIOFt6hE. The update appears to have come via bulk data, not via the stream.

327c314
<       "address_line_1": "The Winning Box, 27-37 Station Road",
---
>       "address_line_1": "The Winning Box 27-37 Station Road",

This maps to /company/15644900/persons-with-significant-control/individual/Zgs2yBCHelHVU4rugIgiIYyHL4s. The update appears to have come via bulk data, not via the stream.

409c396
<     "name": "Ms Asma Naaz",
---
>     "name": "Miss Asma Naaz",
413c400
<       "title": "Ms"
---
>       "title": "Miss"

This maps to /company/15481998/persons-with-significant-control/individual/aHGoh9WLbrWv5twQWqthTNYkDlI and also to /company/15489841/persons-with-significant-control/individual/UmNcVVG08FbDmBcDKUN3nzNxs0o. Both updates appear to have come via the stream on 2024-05-01, and also via bulk data on 2024-05-03. In both cases, the etag matched, so the records in the bulk data were skipped.

475c463
<     "name": "Mr. Ahmadzai Pachakhan",
---
>     "name": "Mr. Pachakhan Ahmadzai",
477,478c465,466
<       "forename": "Ahmadzai",
<       "surname": "Pachakhan",
---
>       "forename": "Pachakhan",
>       "surname": "Ahmadzai",

This maps to /company/15693176/persons-with-significant-control/individual/8-YvSZ1YKB3ZwXEj6aX9Np4Lr6k. The update appears to have come via the stream, on 2024-05-22.

540c528
<     "name": "Mr Benjamin James",
---
>     "name": "Benjamin James Dew",
543,544c531
<       "surname": "James",
<       "title": "Mr"
---
>       "surname": "Dew"

This maps to /company/15693210/persons-with-significant-control/individual/6Vs1yjGFUn9_BkuRb0pXKEIHJf4. The update appears to have come via the stream, on 2024-05-10.

5354c5268
<     "name": "Mr Adrian Iulian Cipcigan",
---
>     "name": "Mr Adrian-Iulian Cipcigan",
5356c5270
<       "forename": "Adrian",
---
>       "forename": "Adrian-Iulian",

This maps to /company/10135411/persons-with-significant-control/individual/igR9StULsS53jqsVQMDYJp5sdkE. This update appears to have come via the bulk data, on 2024-05-03.

561,562c548,549
<       "postal_code": "B60 2AB",
<       "premises": "Maple House"
---
>       "postal_code": "B60 2BG",
>       "premises": "Maple Tree House"

This maps to /company/06649403/persons-with-significant-control/individual/L8ABigsMaGylgu4DfYM02RS_kfE. This update appears to have come via the bulk data.

786c770
<       "address_line_1": "Islington Studios,159-163 Marlborough Road",
---
>       "address_line_1": "159-163 Marlborough Road",
791c775
<       "premises": "Islington Studios,159-163 Marlborough Road"
---
>       "premises": "Islington Studios"

This maps to /company/07418222/persons-with-significant-control/individual/rIU21Q2H4aIosqNEZ1Pc-nqjHwo. This update appears to have been via bulk data.

1378,1382c1360,1362
<       "address_line_1": "Hamilton Street",
<       "country": "England",
<       "locality": "Worksop",
<       "postal_code": "S81 7DD",
<       "premises": "30"
---
>       "locality": "Cardiff",
>       "postal_code": "CF14 8LH",
>       "premises": "15538793 - Companies House Default Address"

This maps to /company/15538793/persons-with-significant-control/individual/qnQ2xW1ApJWDoOyvuev78IlFowk. This update appears to have occurred via bulk data.

2379c2345
<     "name": "Mr Tahir Najib Lone",
---
>     "name": "Mr Tahir Nagib Lone",

This maps to /company/15693726/persons-with-significant-control/individual/hzCAh1VXqZUnDOTlM5TYo1N1RcU. This update appears to have occurred via streaming, on 2024-05-10.

5848c5758
<     "name": "Mr Ahmad Jumir",
---
>     "name": "Mr Ahmed Jumir",
5850c5760
<       "forename": "Ahmad",
---
>       "forename": "Ahmed",

This maps to /company/SC808958/persons-with-significant-control/individual/0zqvZBxoBPHSypOeNeBFCdxm1RM. This update appears to have occurred via streaming, on 2024-05-22.

3080c3032
<       "address_line_1": "Longrigg Road",
---
>       "address_line_1": "Long Rigg Road",

This maps to /company/04085243/persons-with-significant-control/individual/eG5mpB-Zkt4HTkYO3N28N6Zzdf8. This update appears to have occurred via streaming, yet on 2024-05-03. Presumably, it was published before the PSC snapshot.

3634,3638c3579,3582
<       "address_line_1": "Vining Street",
<       "country": "England",
<       "locality": "London",
<       "postal_code": "SW9 8QA",
<       "premises": "18"
---
>       "address_line_1": "..",
>       "locality": "..",
>       "postal_code": "..",
>       "premises": ".."

This maps to /company/15546131/persons-with-significant-control/individual/38iJa7Bpn7lHo0oYeVX8lJf8yPs. This update appears to have occurred via bulk data.

So, it appears that such updates to names and addresses indeed also came via the stream, not just via bulk data.

@tiredpixel
Copy link
Contributor

4. Is our transformed BODS data actually up-to-date, properly taking into account name changes, etc.?

2714c2677
<     "nationality": "English,Nigerian",
---
>     "nationality": "British",

https://register.openownership.org/entities/2537715391486389948
Up-to-date in Register.

327c314
<       "address_line_1": "The Winning Box, 27-37 Station Road",
---
>       "address_line_1": "The Winning Box 27-37 Station Road",

https://register.openownership.org/entities/16099118349770729486
Appears to be out-of-date in Register, but it's not entirely clear.

409c396
<     "name": "Ms Asma Naaz",
---
>     "name": "Miss Asma Naaz",
413c400
<       "title": "Ms"
---
>       "title": "Miss"

https://register.openownership.org/entities/2814195974146911352
Not clear whether up-to-date in Register, since titles aren't reported.

475c463
<     "name": "Mr. Ahmadzai Pachakhan",
---
>     "name": "Mr. Pachakhan Ahmadzai",
477,478c465,466
<       "forename": "Ahmadzai",
<       "surname": "Pachakhan",
---
>       "forename": "Pachakhan",
>       "surname": "Ahmadzai",

https://register.openownership.org/entities/7201651826472414285
Appears to be up-to-date in Register, but company number cannot be found.

540c528
<     "name": "Mr Benjamin James",
---
>     "name": "Benjamin James Dew",
543,544c531
<       "surname": "James",
<       "title": "Mr"
---
>       "surname": "Dew"

https://register.openownership.org/entities/14955650631393783568
Up-to-date in Register, but company number cannot be found.

5354c5268
<     "name": "Mr Adrian Iulian Cipcigan",
---
>     "name": "Mr Adrian-Iulian Cipcigan",
5356c5270
<       "forename": "Adrian",
---
>       "forename": "Adrian-Iulian",

https://register.openownership.org/entities/3688790089275359800
Up-to-date in Register.

561,562c548,549
<       "postal_code": "B60 2AB",
<       "premises": "Maple House"
---
>       "postal_code": "B60 2BG",
>       "premises": "Maple Tree House"

Not clear whether up-to-date in Register.

786c770
<       "address_line_1": "Islington Studios,159-163 Marlborough Road",
---
>       "address_line_1": "159-163 Marlborough Road",
791c775
<       "premises": "Islington Studios,159-163 Marlborough Road"
---
>       "premises": "Islington Studios"

https://register.openownership.org/entities/12140554810766935525
Up-to-date in Register.

1378,1382c1360,1362
<       "address_line_1": "Hamilton Street",
<       "country": "England",
<       "locality": "Worksop",
<       "postal_code": "S81 7DD",
<       "premises": "30"
---
>       "locality": "Cardiff",
>       "postal_code": "CF14 8LH",
>       "premises": "15538793 - Companies House Default Address"

https://register.openownership.org/entities/8354898573207053504
Up-to-date in Register, but a curious name: MMJGKDFGHJFG LIMITED.

2379c2345
<     "name": "Mr Tahir Najib Lone",
---
>     "name": "Mr Tahir Nagib Lone",

https://register.openownership.org/entities/235508562390378449
Not up-to-date in Register; however, the update occurred on 2024-05-10, so this is as expected. Company number not found.

5848c5758
<     "name": "Mr Ahmad Jumir",
---
>     "name": "Mr Ahmed Jumir",
5850c5760
<       "forename": "Ahmad",
---
>       "forename": "Ahmed",

https://register.openownership.org/entities/3669122985577466274
Not up-to-date in Register; however, the update occurred on 2024-05-22, so this is as expected. Company number not found.

3080c3032
<       "address_line_1": "Longrigg Road",
---
>       "address_line_1": "Long Rigg Road",

https://register.openownership.org/entities/16250831102261150185
Not up-to-date in Register on HTML page, but seemingly up-to-date in Register in JSON download.

3634,3638c3579,3582
<       "address_line_1": "Vining Street",
<       "country": "England",
<       "locality": "London",
<       "postal_code": "SW9 8QA",
<       "premises": "18"
---
>       "address_line_1": "..",
>       "locality": "..",
>       "postal_code": "..",
>       "premises": ".."

https://register.openownership.org/entities/1544374556635860030
Not up-to-date in Register. But it's also not clear what the address should be.

Overall, most of the above appears to be up-to-date in Register. A couple of cases weren't, but that's as expected, since the update occurred via streaming after the last time bulk Transformer PSC was run (and streaming Transformer PSC, #255, isn't yet live). In a couple of minor cases, Register doesn't appear to be up-to-date—but it's also not entirely clear what the data should be. Various company numbers did not return results via Register search.

@tiredpixel
Copy link
Contributor

5. What are the consequences of etags not matching between the stream and bulk data on #254 and #255 ?

If etags don't match, the update will be processed, rather than skipped. If the rest of the fields in the record are the same, this shouldn't lead to any changes, since the statement ID should be the same (or contain no changes other than metadata, otherwise). So, although etags sometimes varying even with the same underlying records causes issues for testing samples or investigating any issues, it likely won't lead to issues during the deployment of streaming Transformer PSC.

There is, potentially, some considering of event ordering—but this is the same consideration throughout, including when deploying streaming Ingester PSC. So whilst it could well be an issue in certain scenarios, it's a similar type of complication we've been working with already.

@tiredpixel tiredpixel moved this from In Progress to In Testing in Open Ownership Register and BODS pipelines May 25, 2024
@tiredpixel
Copy link
Contributor

tiredpixel commented May 25, 2024

Fields sample

Another sample was taken, without going into as much depth. For that a small file (over 100 lines) ingested via streaming was considered, it's self-link identifiers extracted, and those used to match directly in a PSC snapshot (ignoring the data written to S3 via bulk Ingester PSC). A minimal comparison of fields, confirmed via searching the stream S3 files, showed nothing for the following fields:

  • data.address.po_box
  • data.name_elements.middle_name

I didn't find any other missing fields in the sample I examined, but it's possible that a different sample would highlight more missing fields.

However, it's strange that nothing was found for these in the S3 files, even for data imported via bulk Ingester PSC; that makes me wonder whether those fields are being missed despite being present in the PSC snapshots. If so, that issue would predate the streaming Ingester PSC.

e.g. There is nothing mentioning po_box here:
https://github.com/openownership/register-sources-psc/blob/f8ef872a4b69ab1905bdb5ecf42c052517fa2af4/lib/register_sources_psc/structs/address.rb#L6

e.g. There is nothing mentioning middle_name here:
https://github.com/openownership/register-sources-psc/blob/f8ef872a4b69ab1905bdb5ecf42c052517fa2af4/lib/register_sources_psc/structs/name_elements.rb#L6

Perhaps the data schema used in PSC snapshots changed at some point?

So, it seems that these fields are indeed missing from the stream—but also that we probably weren't using them anyway…

@StephenAbbott
Copy link
Member Author

Issues resolved via #270

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants