-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test sample of UK PSC streaming API results against PSC snapshots #264
Comments
There are a few things to consider, here:
|
(1), (2)Trying to analyse these files directly could be problematic, since they're 3.5G compressed files. However, there is no need to consider using Athena or similar, here, since the newer method of combining Register files (#213) fully indexes snapshots and generates them by appending to previous snapshots. In addition, comprehensive log files are written. This makes it possible to generate deltas, and analyse those instead. Considering the sizes of the snapshots in bytes:
So, dd if='psc.2024-05-03T10:53:43+00:00.jsonl.gz' of='psc.2024-05-03T10:53:43+00:00.jsonl.gz.delta.jsonl.gz' bs=3676957170 skip=1 There, Similarly, dd if='psc.2024-05-05T10:53:40+00:00.jsonl.gz' of='psc.2024-05-05T10:53:40+00:00.jsonl.gz.delta.jsonl.gz' bs=3679878624 skip=1 This results in delta compressed files:
so the sizes of these match. But do the numbers of statements? Since these delta compressed files are only 2.8M and 12M in size, we can easily decompress them and work with that. These are JSONL files, and there is one statement per line. Counting the lines shows:
Looking at the logs written by the Register files combiner when these snapshots were generated, we see that
This matches the number of lines in the delta files, so this method seems reasonable. |
Here, we can already spot a potential issue: The first statement in {"interestedParty":{"describedByPersonStatement":"1058695637818751597"},"interests":[{"details":"ownership-of-shares-75-to-100-percent","share":{"exclusiveMaximum":false,"exclusiveMinimum":false,"maximum":100,"minimum":75},"startDate":"2024-05-03","type":"shareholding"},{"details":"voting-rights-75-to-100-percent","share":{"exclusiveMaximum":false,"exclusiveMinimum":false,"maximum":100,"minimum":75},"startDate":"2024-05-03","type":"voting-rights"},{"details":"right-to-appoint-and-remove-directors","startDate":"2024-05-03","type":"appointment-of-board"}],"isComponent":false,"publicationDetails":{"bodsVersion":"0.2","license":"https://register.openownership.org/terms-and-conditions","publicationDate":"2024-05-04","publisher":{"name":"OpenOwnership Register","url":"https://register.openownership.org"}},"source":{"assertedBy":null,"description":"GB Persons Of Significant Control Register","retrievedAt":"2024-05-04","type":"officialRegister","url":"https://api.company-information.service.gov.uk/company/15702115/persons-with-significant-control/individual/GIB_iQS1N4Spi5_swikegKrlT3g"},"statementDate":"2024-05-03","statementID":"10017005497377867478","statementType":"ownershipOrControlStatement","subject":{"describedByEntityStatement":"14003992553979663845"}} That has {"interestedParty":{"describedByPersonStatement":"12124130482618967232"},"interests":[{"details":"ownership-of-shares-75-to-100-percent","share":{"exclusiveMaximum":false,"exclusiveMinimum":false,"maximum":100,"minimum":75},"startDate":"2024-05-01","type":"shareholding"},{"details":"voting-rights-75-to-100-percent","share":{"exclusiveMaximum":false,"exclusiveMinimum":false,"maximum":100,"minimum":75},"startDate":"2024-05-01","type":"voting-rights"},{"details":"right-to-appoint-and-remove-directors","startDate":"2024-05-01","type":"appointment-of-board"}],"isComponent":false,"publicationDetails":{"bodsVersion":"0.2","license":"https://register.openownership.org/terms-and-conditions","publicationDate":"2024-05-04","publisher":{"name":"OpenOwnership Register","url":"https://register.openownership.org"}},"source":{"assertedBy":null,"description":"GB Persons Of Significant Control Register","retrievedAt":"2024-05-04","type":"officialRegister","url":"https://api.company-information.service.gov.uk/company/15699683/persons-with-significant-control/individual/mpocV74edYjZnyeCoWQmETQZniI"},"statementDate":"2024-05-01","statementID":"10030505153561054686","statementType":"ownershipOrControlStatement","subject":{"describedByEntityStatement":"17878820742214078800"}} Searching the deltas for
At a glance, these seem to contain mostly the same data, but in a different order. Let us park this line of investigation here, for the moment; next, we will consider the raw data coming from PSC. |
(3)In order to examine this, we need to look at a sample of data written to the 2024-04-30 seems a good day to pick. The streaming Ingester PSC files were written to First, we would like to develop a method for comparing the data. Let us pick the first file, {"company_number":"15692709","data":{"address":{"address_line_1":"Vines Cross Way","country":"England","locality":"Skelmersdale","postal_code":"WN8 6HP","premises":"40"},"country_of_residence":"England","date_of_birth":{"month":4,"year":2000},"etag":"18f63af65e21104d901ffc760f0f3324a628ae06","kind":"individual-person-with-significant-control","links":{"self":"/company/15692709/persons-with-significant-control/individual/fPjFLkE9nnly_aKoDopjs6x6jDk"},"name":"Mr Josh Peter Reeb","name_elements":{"forename":"Josh","surname":"Reeb","title":"Mr"},"nationality":"English","natures_of_control":["ownership-of-shares-75-to-100-percent","voting-rights-75-to-100-percent","right-to-appoint-and-remove-directors"],"notified_on":"2024-04-30"}} This has That record also has Let us instead pick this later file, {"company_number":"15692730","data":{"address":{"address_line_1":"Canterbury Avenue","country":"England","locality":"Ilford","postal_code":"IG1 3NG","premises":"82"},"country_of_residence":"England","date_of_birth":{"month":4,"year":1995},"etag":"3055b641a603f780633289dd50855844a6139177","kind":"individual-person-with-significant-control","links":{"self":"/company/15692730/persons-with-significant-control/individual/XIasrLbXq7EQ-MuvC0n_IWkBL-0"},"name":"Mr Altaf Hyder","name_elements":{"forename":"Altaf","surname":"Hyder","title":"Mr"},"nationality":"Indian","natures_of_control":["ownership-of-shares-75-to-100-percent","voting-rights-75-to-100-percent","right-to-appoint-and-remove-directors"],"notified_on":"2024-04-30"}} Again, Can we find a line which actually has a match? Yes, in {"company_number":"11451665","data":{"address":{"address_line_1":"Shire Hill","country":"United Kingdom","locality":"Saffron Walden","postal_code":"CB11 3AQ","premises":"Business & Technology Centre"},"country_of_residence":"United Kingdom","date_of_birth":{"month":7,"year":1972},"etag":"82c6adc78824b3fade3ea2ac6bf58a4d62b4d428","kind":"individual-person-with-significant-control","links":{"self":"/company/11451665/persons-with-significant-control/individual/kMt3JLrqsv8_0nqrvIFtbhM2Bwg"},"name":"Mr Garry Mark Reed","name_elements":{"forename":"Garry","surname":"Reed","title":"Mr"},"nationality":"British","natures_of_control":["ownership-of-shares-75-to-100-percent","voting-rights-75-to-100-percent"],"notified_on":"2018-07-06"}} Again, {"company_number":"11451665","data":{"address":{"address_line_1":"Shire Hill","country":"United Kingdom","locality":"Saffron Walden","postal_code":"CB11 3AQ","premises":"Business & Technology Centre"},"country_of_residence":"United Kingdom","date_of_birth":{"month":7,"year":1972},"etag":"8d4d3ee8f26b2ad575305c57f74a728fa1e29438","kind":"individual-person-with-significant-control","links":{"self":"/company/11451665/persons-with-significant-control/individual/kMt3JLrqsv8_0nqrvIFtbhM2Bwg"},"name":"Mr Garry Mark Reed","name_elements":{"forename":"Garry","surname":"Reed","title":"Mr"},"nationality":"British","natures_of_control":["ownership-of-shares-75-to-100-percent","voting-rights-75-to-100-percent"],"notified_on":"2018-07-06"}} We can see that 16c16
< "etag": "82c6adc78824b3fade3ea2ac6bf58a4d62b4d428",
---
> "etag": "8d4d3ee8f26b2ad575305c57f74a728fa1e29438", This gives us our first major findings:
(2) is a pain. It means we cannot simply match that record reliably, nor could we deduplicate the record based on that even if we had such functionality. (3) is curious. Perhaps records changed, and then got changed back prior to bulk data export? Or perhaps there's some other reason why entire records are missing. What we need is some way to match records based on |
Examining a large number of files through a filter,
For each of these possible values, examining a line of raw data shows that |
In order to examine the number of matches for each #!/usr/bin/env bash
set -Eeuo pipefail
src_d=$1 # e.g. year\=2024/month\=04/day\=30/
dst_d=$2 # e.g. year\=2024/
mapfile -d '' src_fs < <(find "$src_d" \
\( -name '.*' -prune \) -o -type f -print0 | sort -z)
for f in "${src_fs[@]}" ; do
while read -r l ; do
links_self=$(echo "$l" | jq -r '.data.links.self')
matches=$( (grep -r "$links_self" "$dst_d" || true) | wc -l)
echo -e "$f\t$links_self\t$matches"
done < "$f"
done
Grep takes 23s. Replacing Grep with Ack: matches=$( (ack "$links_self" "$dst_d" || true) | wc -l) Ack takes even longer: 48s. Replacing Ack with Ag (The Silver Searcher): matches=$( (ag "$links_self" "$dst_d" || true) | wc -l) This takes less than 10s, so we'll use that instead (portability isn't a concern, here). Using Ag, we can do slightly better, and use That gives us the following final script: #!/usr/bin/env bash
set -Eeuo pipefail
src_d=$1 # e.g. year\=2024/month\=04/day\=30/
dst_d=$2 # e.g. year\=2024/
mapfile -d '' src_fs < <(find "$src_d" \
\( -name '.*' -prune \) -o -type f -print0 | sort -z)
for f in "${src_fs[@]}" ; do
f2=$(basename "$f")
while read -r l ; do
links_self=$(echo "$l" | jq -r '.data.links.self')
matches=$( (ag --ignore "$f2" "$links_self" "$dst_d" || true) | wc -l)
echo -e "$f\t$links_self\t$matches"
done < "$f"
done |
Using this script, we can identify the number of matches for each This results in 9130 lines, which is what we expect from the raw data. But there are lots of lines with 0 matches: 6678 lines, in fact! Here, it's worth sanity-checking. Picking some of these lines supposedly without matches randomly: ag '\t0$' sample.2024-04-30.matches.log | shuf | head -n10
Manually searching for each of these There are 1057 lines with more than 1 match. This is entirely possible, |
Looking into Rerunning the script with different parameters, we can identify the number of matches for each sample.2024-04-30.2024-05.matches.log Again, this has 9130 lines. But the number of matches is even worse: 7938 Another sanity-check is merited. A random sample of lines is:
Another manual search confirms no matches for these within 2024-05 raw data. This is highly unexpected. At this point, we check the original source data contained in the PSC snapshots files downloaded when bulk Ingester PSC was run—that is, prior to any processing or restreaming whatsoever. These files are contained in However, comparing the format of these files, it's clear that in fact these are after the first level of processing. It shouldn't make a difference since that should just be splitting the files, but it turns out that the original PSC snapshots are still available at the old URLs, even though they are no longer listed on the PSC snapshots page. We iterate through these and download them: for i in $(seq 1 26); do axel "https://download.companieshouse.gov.uk/psc-snapshot-2024-05-03_${i}of26.zip"; done This yields the following ZIP files:
But is this the right PSC snapshot? Comparing the S3 split files to the PSC snapshot just obtained: zcat import_id\=2024_05_03/url_index\=*/part\=part*/file-*.csv.gz | wc -l
unzip -p psc-snapshot-2024-05-03/\*.zip | wc -l
So both methods contain exactly 12902790 records. Thus, this is the right snapshot. Searching the S3 snapshot for the missing zcat import_id\=2024_05_03/url_index\=*/part\=part*/file-*.csv.gz | grep -E '(/company/15693659/persons-with-significant-control/individual/D5INTYHwHkEFUG4iv8_Tbvos0a8|/company/15696393/persons-with-significant-control/individual/fRUzlOBkjeGWtNMxm3uE9SAsN4Q|/company/06911744/persons-with-significant-control/individual/tZx9vWQNhaJ08kazW7kecO9_heA|/company/15694677/persons-with-significant-control/individual/PwIy3LVbmxB4Orw_sO8HQwLU87g|/company/12941146/persons-with-significant-control/individual/-1oPefDxXX6tQ8p-CpgPWXLyBLw|/company/15324434/persons-with-significant-control/individual/c1us1ZVCgsuroUnP449c3q_Wfus|/company/14895215/persons-with-significant-control/individual/Es_cutCL_jO6FFXocUgfDKYMspU|/company/15697010/persons-with-significant-control/individual/ygVdeFoF0aadXq0O8HKwhqeHKhI|/company/15697849/persons-with-significant-control/individual/MWBDfUsWvUaI5d-jVD7dz4DF3FM|/company/13871346/persons-with-significant-control/legal-person/8uLJ6b7aCzuk3s2H1YiDflDiGd4)' This returns results, so some of these at least are present in the bulk snapshot, after all. But of course, these could be previous records from anywhere in the past. Examining One example resulting from this is: {"company_number":"15693659","data":{"address":{"address_line_1":"Cossington Road","country":"United Kingdom","locality":"Loughborough","postal_code":"LE12 7RS","premises":"72","region":"Leicestershire"},"country_of_residence":"United Kingdom","date_of_birth":{"month":7,"year":1980},"etag":"902c47bbc00939fc618b48b73a434fc4d3747b40","kind":"individual-person-with-significant-control","links":{"self":"/company/15693659/persons-with-significant-control/individual/D5INTYHwHkEFUG4iv8_Tbvos0a8"},"name":"Mr Paul Jonathon Barker","name_elements":{"forename":"Paul","middle_name":"Jonathon","surname":"Barker","title":"Mr"},"nationality":"British","natures_of_control":["ownership-of-shares-75-to-100-percent","voting-rights-75-to-100-percent","right-to-appoint-and-remove-directors"],"notified_on":"2024-04-30"}}
Looking at the code for Ingester PSC, we find: What does this mean? It means that if a record were encountered during the bulk Ingester PSC with the same Without investigating too much further (in the interests of time), this suggests the following:
So we have found that this is possible, if the sample.2024-04-30.2024-05.matches=1.log We now have 880 lines we can use as a sample. |
What we would like is 2 files, only containing our sample from 2024-04-30, and one containing its match from 2024-05. Data for 2024-04-30 can be found by modifying the previous script: #!/usr/bin/env bash
set -Eeuo pipefail
src_d=$1 # e.g. year\=2024/month\=04/day\=30/
dst_d=$2 # e.g. year\=2024/
mapfile -d '' src_fs < <(find "$src_d" \
\( -name '.*' -prune \) -o -type f -print0 | sort -z)
for f in "${src_fs[@]}" ; do
f2=$(basename "$f")
while read -r l ; do
links_self=$(echo "$l" | jq -r '.data.links.self')
matches=$( (ag --ignore "$f2" "$links_self" "$dst_d" || true) | wc -l)
if [ "$matches" -eq 1 ]; then
echo "$l"
fi
done < "$f"
done Data for 2024-05 can be found by reprocessing the log file, using different parameters: #!/usr/bin/env bash
set -Eeuo pipefail
log_f=$1 # e.g. sample.2024-04-30.matches=1.log
dst_d=$2 # e.g. year\=2024/month\=05/
while read -r l ; do
src_f=$(echo "$l" | cut -f1)
src_f2=$(basename "$src_f")
links_self=$(echo "$l" | cut -f2)
ag --ignore "$src_f2" "$links_self" "$dst_d" || true
done < "$log_f" sample-880.2024-04-30.log Each of these are in order, and contain 880 lines. |
We'd like to compare the 2 files line-by-line. But there are a lot of differences. Using Vimdiff, we can get a general sense of the types of differences which occur: From this, we can see that some lines differ only by 1 field, but others differ substantially. Examining these differences, we find that some indeed differ by comm -12 <(jq -r '.data.etag' < sample-880.2024-04-30.log | nl) <(jq -r '.data.etag' < sample-880.2024-05.log | nl)
That is, since the jq -c 'del(.data.etag)' < sample-880.2024-04-30.log > sample-880.2024-04-30.no-etag.jsonl
jq -c 'del(.data.etag)' < sample-880.2024-05.log > sample-880.2024-05.no-etag.jsonl sample-880.2024-04-30.no-etag.log Again using Vimdiff to get a sense of the sort of differences: That is, now the To make comparison easier, we next eliminate those lines which are identical. To do so, we can use Comm. However, Comm requires files to be sorted in order to be compared. We don't want to sort them, since we'd lose the correspondence between the 2 files. So we temporarily add line numbers with Nl, compare them and filter the similarities, and again remove the line numbers: comm -23 <(nl sample-880.2024-04-30.no-etag.log) <(nl sample-880.2024-05.no-etag.log) | cut -f2 > sample-880.2024-04-30.no-etag.differences.log
comm -13 <(nl sample-880.2024-04-30.no-etag.log) <(nl sample-880.2024-05.no-etag.log) | cut -f2 > sample-880.2024-05.no-etag.differences.log sample-880.2024-04-30.no-etag.differences.log Each of these files contains 412 lines. |
Using Vimdiff: This is becoming easier to compare. We can spot that in some cases, {"company_number":null,"data":{"address":{"address_line_1":"Kirkton Avenue","address_line_2":"Blantyre","country":"United Kingdom","locality":"Glasgow","postal_code":"G72 0HR","premises":"140"},"country_of_residence":"Scotland","date_of_birth":{"month":2,"year":1999},"kind":"individual-person-with-significant-control","links":{"self":"/company/SC756576/persons-with-significant-control/individual/Qx9rnxRKgCbmQDJnywF24518QEw"},"name":"Mr Mackenzie Malcolm","name_elements":{"forename":"Mackenzie","surname":"Malcolm","title":"Mr"},"nationality":"British","natures_of_control":["ownership-of-shares-25-to-50-percent","voting-rights-25-to-50-percent"],"notified_on":"2023-01-25"}} {"company_number":"SC756576","data":{"address":{"address_line_1":"Kirkton Avenue","address_line_2":"Blantyre","country":"United Kingdom","locality":"Glasgow","postal_code":"G72 0HR","premises":"140"},"country_of_residence":"Scotland","date_of_birth":{"month":2,"year":1999},"kind":"individual-person-with-significant-control","links":{"self":"/company/SC756576/persons-with-significant-control/individual/Qx9rnxRKgCbmQDJnywF24518QEw"},"name":"Mr Mackenzie Malcolm","name_elements":{"forename":"Mackenzie","surname":"Malcolm","title":"Mr"},"nationality":"British","natures_of_control":["ownership-of-shares-25-to-50-percent","voting-rights-25-to-50-percent"],"notified_on":"2023-01-25"}} This is not a fault in the PSC data itself, but rather a bug in Ingester PSC. It was already fixed (openownership/register-ingester-psc#37). Other than this, all jq -c 'del(.company_number)' < sample-880.2024-04-30.no-etag.differences.log > sample-880.2024-04-30.no-etag.differences.no-cn.log
jq -c 'del(.company_number)' < sample-880.2024-05.no-etag.differences.log > sample-880.2024-05.no-etag.differences.no-cn.log
comm -23 <(nl sample-880.2024-04-30.no-etag.differences.no-cn.log) <(nl sample-880.2024-05.no-etag.differences.no-cn.log) | cut -f2 > sample-880.2024-04-30.no-etag.differences.no-cn.differences.log
comm -13 <(nl sample-880.2024-04-30.no-etag.differences.no-cn.log) <(nl sample-880.2024-05.no-etag.differences.no-cn.log) | cut -f2 > sample-880.2024-05.no-etag.differences.no-cn.differences.log sample-880.2024-04-30.no-etag.differences.no-cn.differences.log This leaves 402 lines in each file, with relatively minor differences: |
At this point, it's likely easier comparing fields in expanded, not compact, JSON form: jq < sample-880.2024-04-30.no-etag.differences.no-cn.differences.log > sample-402.2024-04-30.log
jq < sample-880.2024-05.no-etag.differences.no-cn.differences.log > sample-402.2024-05.log
diff sample-402.2024-{04-30,05}.log > sample-402.log There are 2984 lines in this diff. |
The remaining differences can be broadly grouped into categories. Those are: ceased on additions9a10
> "ceased_on": "2024-05-01", 4589a4512
> "ceased_on": "2023-10-13", natures of control changes60,64c61
< "ownership-of-shares-25-to-50-percent-as-trust",
< "ownership-of-shares-25-to-50-percent-as-firm",
< "voting-rights-25-to-50-percent",
< "voting-rights-25-to-50-percent-as-trust",
< "voting-rights-25-to-50-percent-as-firm"
---
> "voting-rights-25-to-50-percent" address country or district changes73d69
< "country": "England", 1681c1657
< "country_of_residence": "England",
---
> "country_of_residence": "United Kingdom", address specific changes76c72
< "premises": "25"
---
> "premises": "15" nationality changes227c217
< "nationality": "British",
---
> "nationality": "Portuguese", 2714c2677
< "nationality": "English,Nigerian",
---
> "nationality": "British", minor typographical changes327c314
< "address_line_1": "The Winning Box, 27-37 Station Road",
---
> "address_line_1": "The Winning Box 27-37 Station Road", 12587c12505
< "title": "Mr,"
---
> "title": "Mr" title changes409c396
< "name": "Ms Asma Naaz",
---
> "name": "Miss Asma Naaz",
413c400
< "title": "Ms"
---
> "title": "Miss" forename vs surname reversals475c463
< "name": "Mr. Ahmadzai Pachakhan",
---
> "name": "Mr. Pachakhan Ahmadzai",
477,478c465,466
< "forename": "Ahmadzai",
< "surname": "Pachakhan",
---
> "forename": "Pachakhan",
> "surname": "Ahmadzai", name changes540c528
< "name": "Mr Benjamin James",
---
> "name": "Benjamin James Dew",
543,544c531
< "surname": "James",
< "title": "Mr"
---
> "surname": "Dew" 5354c5268
< "name": "Mr Adrian Iulian Cipcigan",
---
> "name": "Mr Adrian-Iulian Cipcigan",
5356c5270
< "forename": "Adrian",
---
> "forename": "Adrian-Iulian", postcode changes561,562c548,549
< "postal_code": "B60 2AB",
< "premises": "Maple House"
---
> "postal_code": "B60 2BG",
> "premises": "Maple Tree House" address reformatting786c770
< "address_line_1": "Islington Studios,159-163 Marlborough Road",
---
> "address_line_1": "159-163 Marlborough Road",
791c775
< "premises": "Islington Studios,159-163 Marlborough Road"
---
> "premises": "Islington Studios" address changes to Companies House default1378,1382c1360,1362
< "address_line_1": "Hamilton Street",
< "country": "England",
< "locality": "Worksop",
< "postal_code": "S81 7DD",
< "premises": "30"
---
> "locality": "Cardiff",
> "postal_code": "CF14 8LH",
> "premises": "15538793 - Companies House Default Address" name spelling changes2379c2345
< "name": "Mr Tahir Najib Lone",
---
> "name": "Mr Tahir Nagib Lone", 5848c5758
< "name": "Mr Ahmad Jumir",
---
> "name": "Mr Ahmed Jumir",
5850c5760
< "forename": "Ahmad",
---
> "forename": "Ahmed", address spelling changes3080c3032
< "address_line_1": "Longrigg Road",
---
> "address_line_1": "Long Rigg Road", address deletions3634,3638c3579,3582
< "address_line_1": "Vining Street",
< "country": "England",
< "locality": "London",
< "postal_code": "SW9 8QA",
< "premises": "18"
---
> "address_line_1": "..",
> "locality": "..",
> "postal_code": "..",
> "premises": ".." |
QuestionsThere's a bit of a complexity, here, in that it's not clear in the case of amendments whether the change came from an update via the stream, or only in the bulk data snapshot. In order to understand this better, some of these specific examples given above will be explored manually in order to try to answer these questions:
|
1. Is ceased_on present within the stream at any point? Or is it only present in bulk data?ag -c '"ceased_on":"2024-05-25"'
|
2. For the records found to differ only by etag, were those matches definitely in the bulk data, rather than the stream?comm -12 <(nl sample-880.2024-04-30.no-etag.log) <(nl sample-880.2024-05.no-etag.log) | cut -f2 | jq -r '.data.links.self' | shuf | head -n10
Most of these matched with 2024-05-03. Some, however, matched with very different dates, even as recently as 2024-05-20. e.g
It appears that sometimes, records are republished through the stream at a later date, but with no material changes. However, the Searching through the PSC snapshot for those zcat import_id\=2024_05_03/url_index\=*/part\=part*/file-*.csv.gz | grep -E '(/company/06647587/persons-with-significant-control/individual/UQSTHFiGZJnofCUgULeZKOXWpI0|/company/12751591/persons-with-significant-control/individual/Ks4NZ5gbnbfMkCRIKTGZht7C8AA|/company/15694981/persons-with-significant-control/individual/gOMHoIkyWEs7CEp8peQ59OcwEhI|/company/15127834/persons-with-significant-control/individual/m3G849Vk6p-miVvpbkeMZwVOmaU|/company/15506348/persons-with-significant-control/individual/l9UJH3Hz67Am_AeoNejHpoort-8|/company/15607167/persons-with-significant-control/individual/8nOKF2D4-sobjnLmHg6rYvD3VDA|/company/15601950/persons-with-significant-control/individual/Jh50jGfQxIDGQHsFqZ3VLsnL2t0|/company/15692722/persons-with-significant-control/individual/6Ey9nchMQbHpCvEmlVzodc_8g-k|/company/15695197/persons-with-significant-control/individual/ayvLQFuC9xXCASo7n1PYWZ9GKLo|/company/15607167/persons-with-significant-control/individual/8nOKF2D4-sobjnLmHg6rYvD3VDA)' | jq -r '.data.etag'
So 9 of these were present within the bulk data snapshot. The first 6 appear in It appears that some of the almost-duplicates (the same except for |
3. Where corrections were received, such as to name or address, did these ever come via the stream, or only in bulk data?2714c2677
< "nationality": "English,Nigerian",
---
> "nationality": "British", This maps to 327c314
< "address_line_1": "The Winning Box, 27-37 Station Road",
---
> "address_line_1": "The Winning Box 27-37 Station Road", This maps to 409c396
< "name": "Ms Asma Naaz",
---
> "name": "Miss Asma Naaz",
413c400
< "title": "Ms"
---
> "title": "Miss" This maps to 475c463
< "name": "Mr. Ahmadzai Pachakhan",
---
> "name": "Mr. Pachakhan Ahmadzai",
477,478c465,466
< "forename": "Ahmadzai",
< "surname": "Pachakhan",
---
> "forename": "Pachakhan",
> "surname": "Ahmadzai", This maps to 540c528
< "name": "Mr Benjamin James",
---
> "name": "Benjamin James Dew",
543,544c531
< "surname": "James",
< "title": "Mr"
---
> "surname": "Dew" This maps to 5354c5268
< "name": "Mr Adrian Iulian Cipcigan",
---
> "name": "Mr Adrian-Iulian Cipcigan",
5356c5270
< "forename": "Adrian",
---
> "forename": "Adrian-Iulian", This maps to 561,562c548,549
< "postal_code": "B60 2AB",
< "premises": "Maple House"
---
> "postal_code": "B60 2BG",
> "premises": "Maple Tree House" This maps to 786c770
< "address_line_1": "Islington Studios,159-163 Marlborough Road",
---
> "address_line_1": "159-163 Marlborough Road",
791c775
< "premises": "Islington Studios,159-163 Marlborough Road"
---
> "premises": "Islington Studios" This maps to 1378,1382c1360,1362
< "address_line_1": "Hamilton Street",
< "country": "England",
< "locality": "Worksop",
< "postal_code": "S81 7DD",
< "premises": "30"
---
> "locality": "Cardiff",
> "postal_code": "CF14 8LH",
> "premises": "15538793 - Companies House Default Address" This maps to 2379c2345
< "name": "Mr Tahir Najib Lone",
---
> "name": "Mr Tahir Nagib Lone", This maps to 5848c5758
< "name": "Mr Ahmad Jumir",
---
> "name": "Mr Ahmed Jumir",
5850c5760
< "forename": "Ahmad",
---
> "forename": "Ahmed", This maps to 3080c3032
< "address_line_1": "Longrigg Road",
---
> "address_line_1": "Long Rigg Road", This maps to 3634,3638c3579,3582
< "address_line_1": "Vining Street",
< "country": "England",
< "locality": "London",
< "postal_code": "SW9 8QA",
< "premises": "18"
---
> "address_line_1": "..",
> "locality": "..",
> "postal_code": "..",
> "premises": ".." This maps to So, it appears that such updates to names and addresses indeed also came via the stream, not just via bulk data. |
4. Is our transformed BODS data actually up-to-date, properly taking into account name changes, etc.?2714c2677
< "nationality": "English,Nigerian",
---
> "nationality": "British", https://register.openownership.org/entities/2537715391486389948 327c314
< "address_line_1": "The Winning Box, 27-37 Station Road",
---
> "address_line_1": "The Winning Box 27-37 Station Road", https://register.openownership.org/entities/16099118349770729486 409c396
< "name": "Ms Asma Naaz",
---
> "name": "Miss Asma Naaz",
413c400
< "title": "Ms"
---
> "title": "Miss" https://register.openownership.org/entities/2814195974146911352 475c463
< "name": "Mr. Ahmadzai Pachakhan",
---
> "name": "Mr. Pachakhan Ahmadzai",
477,478c465,466
< "forename": "Ahmadzai",
< "surname": "Pachakhan",
---
> "forename": "Pachakhan",
> "surname": "Ahmadzai", https://register.openownership.org/entities/7201651826472414285 540c528
< "name": "Mr Benjamin James",
---
> "name": "Benjamin James Dew",
543,544c531
< "surname": "James",
< "title": "Mr"
---
> "surname": "Dew" https://register.openownership.org/entities/14955650631393783568 5354c5268
< "name": "Mr Adrian Iulian Cipcigan",
---
> "name": "Mr Adrian-Iulian Cipcigan",
5356c5270
< "forename": "Adrian",
---
> "forename": "Adrian-Iulian", https://register.openownership.org/entities/3688790089275359800 561,562c548,549
< "postal_code": "B60 2AB",
< "premises": "Maple House"
---
> "postal_code": "B60 2BG",
> "premises": "Maple Tree House" Not clear whether up-to-date in Register. 786c770
< "address_line_1": "Islington Studios,159-163 Marlborough Road",
---
> "address_line_1": "159-163 Marlborough Road",
791c775
< "premises": "Islington Studios,159-163 Marlborough Road"
---
> "premises": "Islington Studios" https://register.openownership.org/entities/12140554810766935525 1378,1382c1360,1362
< "address_line_1": "Hamilton Street",
< "country": "England",
< "locality": "Worksop",
< "postal_code": "S81 7DD",
< "premises": "30"
---
> "locality": "Cardiff",
> "postal_code": "CF14 8LH",
> "premises": "15538793 - Companies House Default Address" https://register.openownership.org/entities/8354898573207053504 2379c2345
< "name": "Mr Tahir Najib Lone",
---
> "name": "Mr Tahir Nagib Lone", https://register.openownership.org/entities/235508562390378449 5848c5758
< "name": "Mr Ahmad Jumir",
---
> "name": "Mr Ahmed Jumir",
5850c5760
< "forename": "Ahmad",
---
> "forename": "Ahmed", https://register.openownership.org/entities/3669122985577466274 3080c3032
< "address_line_1": "Longrigg Road",
---
> "address_line_1": "Long Rigg Road", https://register.openownership.org/entities/16250831102261150185 3634,3638c3579,3582
< "address_line_1": "Vining Street",
< "country": "England",
< "locality": "London",
< "postal_code": "SW9 8QA",
< "premises": "18"
---
> "address_line_1": "..",
> "locality": "..",
> "postal_code": "..",
> "premises": ".." https://register.openownership.org/entities/1544374556635860030 Overall, most of the above appears to be up-to-date in Register. A couple of cases weren't, but that's as expected, since the update occurred via streaming after the last time bulk Transformer PSC was run (and streaming Transformer PSC, #255, isn't yet live). In a couple of minor cases, Register doesn't appear to be up-to-date—but it's also not entirely clear what the data should be. Various company numbers did not return results via Register search. |
5. What are the consequences of etags not matching between the stream and bulk data on #254 and #255 ?If etags don't match, the update will be processed, rather than skipped. If the rest of the fields in the record are the same, this shouldn't lead to any changes, since the statement ID should be the same (or contain no changes other than metadata, otherwise). So, although etags sometimes varying even with the same underlying records causes issues for testing samples or investigating any issues, it likely won't lead to issues during the deployment of streaming Transformer PSC. There is, potentially, some considering of event ordering—but this is the same consideration throughout, including when deploying streaming Ingester PSC. So whilst it could well be an issue in certain scenarios, it's a similar type of complication we've been working with already. |
Fields sampleAnother sample was taken, without going into as much depth. For that a small file (over 100 lines) ingested via streaming was considered, it's self-link identifiers extracted, and those used to match directly in a PSC snapshot (ignoring the data written to S3 via bulk Ingester PSC). A minimal comparison of fields, confirmed via searching the stream S3 files, showed nothing for the following fields:
I didn't find any other missing fields in the sample I examined, but it's possible that a different sample would highlight more missing fields. However, it's strange that nothing was found for these in the S3 files, even for data imported via bulk Ingester PSC; that makes me wonder whether those fields are being missed despite being present in the PSC snapshots. If so, that issue would predate the streaming Ingester PSC. e.g. There is nothing mentioning e.g. There is nothing mentioning Perhaps the data schema used in PSC snapshots changed at some point? So, it seems that these fields are indeed missing from the stream—but also that we probably weren't using them anyway… |
Issues resolved via #270 |
No description provided.
The text was updated successfully, but these errors were encountered: