-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introducing the ability to add static sources of BODS data #171
Comments
See openownership/register-sources-bods#41 for progress on this |
Without this, Register including the draft AM data explodes: Dry::Types::CoercionError ([RegisterSourcesBods::EntityStatement.new] ["«AMP HOLDING»"] (Array) has invalid type for :alternateNames violates constraints (type?(String, ["«AMP HOLDING»"]) failed)):
[#171] upgrade common, sources-bods, sources-oc for types fix
Most of the work on this is now done. Ingestion and transformation into production using AM data has been a success, other than known issues with the AM data itself not being fully BODS 0.2 compliant. However, only the local import method has been used, so far, not the bulk method connecting to Kinesis stream. Remaining on this:
|
Kinesis data streamsThe
Default data retention of Kinesis delivery streamsThe S3 destination was selected using The buffer size was overridden to be |
Kinesis delivery streamsI've amended the buffer size to use the AWS recommended defaults of The buffer settings also make me wonder: Is it guaranteed that running the Register Files Combiner in the monthly bulk data import will include all the relevant data? Because if it was ingested within the previous 15 minutes, might the buffer have not yet been flushed, resulting in files missing in S3 prior to combination? Realistically, this is unlikely to happen, given the manual steps necessary—but it's a wondrous thought. |
Kinesis data streamsI also created Kinesis delivery streamsI also created It seems this should be used when calling |
With the various AWS Kinesis work and patches written as part of this ticket, ingestion and transformation, both using the local and the bulk method, both without and with publishing to Kinesis streams, works. There would certainly be merit in revisiting these approaches in the future, both in architecture and in implementation. Also of note is that the usage of Kinesis does not scale well price-wise using this approach. As part of this, AM data was deleted and recreated. This should have had no consequences, as it hadn't been published anywhere else downstream, including in monthly bulk data exports. However, it has likely broken examples for #229 , which will need to be updated before it can be worked on. In the original source AM data, there were 1286 rows. In the raw index in Elasticsearch, that resulted in 1140 documents. This is concerning, but previous investigation showed that some statement IDs are duplicated, which is invalid according to BODS 0.2, so those rows are ignored. However, in the AWS S3 file published via the Kinesis stream Firehose, that results in 1171 rows… I don't have an explanation for that. Similarly, transformation resulted in 730 BODS documents in Elasticsearch. Yet in the S3 file created via the Kinesis stream firehose, there are 940 rows. This is extremely concerning, but hasn't been looked into further. AM data can now be included in monthly bulk data exports—theoretically. Theoretically because this data is not updated during the normal monthly bulk data import cycle, and also because the files are small, so will very probably suffer from existing issues with small files caused by flaws in the Register Files Combiner, due for redesign in #213 . We might want to consider whether we even want to include AM data at this point, given that the raw data is not fully valid according to BODS 0.2. However, at present, its inclusion will be attempted, so if that isn't desirable, it will probably need to be removed again (if only temporarily…). As far as I'm aware, this completes the static BODS work in this ticket, since everything that was written is now working, and despite various inefficiencies or areas to revisit in the future, it's usable. |
Thanks for the summary, @tiredpixel - and the issues to be aware of. Can you make the necessary changes to block its inclusion if it is likely that this would interfere with the existing Register Files Combiner process? Then we can revisit this once you move on to work on #213 |
Done, @StephenAbbott . AWS Kinesis |
An issue was found during testing where some AM statements have been merged, seemingly without being related entities. Investigation led to three candidates for the cause:
In an attempt to eliminate (2) and (3) with regards to (1), Static BODS process will be extended to allow Open Corporates to be disabled for specific imports. After that, data will need reimporting (not migrating, since that's not possible here). Test cases:
|
I investigated whether disabling resolving via Open Corporates fixed the issue. It didn't. Rather, it turns out it's (1)—the incorrectish (yet valid according to BODS 0.2) identifiers, which contain neither I managed to isolate the problem and reduce it to simply 2 lines: am.jq.test-2.jsonl incorrectly merges entities: {"statementID":"0e3223ba-108f-479a-83c8-8195c869f6ef","statementType":"entityStatement","isComponent":false,"statementDate":"2021-06-17","entityType":"registeredEntity","name":"ԳԼՈԲԱԼ ԳՈԼԴ ՔՈՐՓՈՐԵՅՇՆ","addresses":[{"type":"registered","address":"NEW YORK, NEW YORK, RYE, Theodore Fremd Avenue , 555, Suite C208","country":"US","postCode":"10580"}],"alternateNames":["GLOBAL GOLD CORPORATION"],"identifiers":[{"id":"","scheme":"USA-TAXID"}],"publicationDetails":{"publicationDate":"2021-06-17","bodsVersion":"0.2","publisher":{"name":"---"}}}
{"statementID":"407947c8-f3f1-454c-aad2-4b1c9b1330b7","statementType":"entityStatement","isComponent":false,"statementDate":"2021-06-17","entityType":"registeredEntity","name":"ՖԱՅՐԲԸՐԴ ՄԵՆԵՋՄԵՆՏ ՍՊԸ","addresses":[{"type":"registered","address":"NEW YORK, NEW YORK, NEW YORK, NY, 152 WEST 57th Street 24th floor, 152, 24th floor","country":"US","postCode":"10019"}],"alternateNames":["FIREBIRD MANAGEMENT LLC"],"identifiers":[{"id":"","scheme":"USA-TAXID"}],"publicationDetails":{"publicationDate":"2021-06-17","bodsVersion":"0.2","publisher":{"name":"---"}}} Whereas manually-edited am.jq.test-2.jsonl doesn't: {"statementID":"0e3223ba-108f-479a-83c8-8195c869f6ef","statementType":"entityStatement","isComponent":false,"statementDate":"2021-06-17","entityType":"registeredEntity","name":"ԳԼՈԲԱԼ ԳՈԼԴ ՔՈՐՓՈՐԵՅՇՆ","addresses":[{"type":"registered","address":"NEW YORK, NEW YORK, RYE, Theodore Fremd Avenue , 555, Suite C208","country":"US","postCode":"10580"}],"alternateNames":["GLOBAL GOLD CORPORATION"],"publicationDetails":{"publicationDate":"2021-06-17","bodsVersion":"0.2","publisher":{"name":"---"}}}
{"statementID":"407947c8-f3f1-454c-aad2-4b1c9b1330b7","statementType":"entityStatement","isComponent":false,"statementDate":"2021-06-17","entityType":"registeredEntity","name":"ՖԱՅՐԲԸՐԴ ՄԵՆԵՋՄԵՆՏ ՍՊԸ","addresses":[{"type":"registered","address":"NEW YORK, NEW YORK, NEW YORK, NY, 152 WEST 57th Street 24th floor, 152, 24th floor","country":"US","postCode":"10019"}],"alternateNames":["FIREBIRD MANAGEMENT LLC"],"publicationDetails":{"publicationDate":"2021-06-17","bodsVersion":"0.2","publisher":{"name":"---"}}} It's not clear what's best to do about this, other than make another preprocessing change to strip identifiers which have neither |
Noting that after a conversation with @StephenAbbott, it's been decided to strip identifiers having neither |
Identifiers with neither AM data has been completely reimported. That appears to resolve that particular issue. However, there are now also entities which are not being merged. This might relate to other known issues. It hasn't been investigated, at present. |
Signed off ✅ |
Incorporate a static source of BODS data into the Open Ownership Register rather than working with data delivered via an API or bulk data products.
See PDF for details and estimates of work required:
Static_BODS.pdf
The text was updated successfully, but these errors were encountered: