Introducing the ability to add static sources of BODS data #171

StephenAbbott · 2023-08-08T11:33:06Z

Incorporate a static source of BODS data into the Open Ownership Register rather than working with data delivered via an API or bulk data products.

See PDF for details and estimates of work required:
Static_BODS.pdf

StephenAbbott · 2023-10-31T09:45:32Z

See openownership/register-sources-bods#41 for progress on this

Without this, Register including the draft AM data explodes: Dry::Types::CoercionError ([RegisterSourcesBods::EntityStatement.new] ["«AMP HOLDING»"] (Array) has invalid type for :alternateNames violates constraints (type?(String, ["«AMP HOLDING»"]) failed)):

[#171] upgrade common, sources-bods, sources-oc for types fix

tiredpixel · 2023-11-30T13:15:42Z

Most of the work on this is now done. Ingestion and transformation into production using AM data has been a success, other than known issues with the AM data itself not being fully BODS 0.2 compliant. However, only the local import method has been used, so far, not the bulk method connecting to Kinesis stream.

Remaining on this:

ingestion using bulk method
transformation using bulk method
consideration of whether AM data will be included in monthly bulk data export

tiredpixel · 2023-12-08T14:27:19Z

Kinesis data streams

The bods-am-prod data stream was created in Kinesis. Having a separate data stream for each datasource doesn't scale all that well pricing-wise, but it's the way everything else is currently working, and changing this could be complex. e.g. It's not clear how BODS statements could be segmented later, for example, given AM data supplied currently publicationDetails.publisher.name = ---.

on-demand capacity mode was considered, but in the end provisioned shards = 1 was chosen, since that's what's been chosen for the existing data streams, and I recall some mention of potential issues if the number of shards is increased to > 1.

Default data retention of 1d was chosen.

Kinesis delivery streams

The bods-am-prod delivery stream was created in Kinesis using Firehose, and configured to be similar to the existing delivery streams.

S3 destination was selected using oo-register-v2 bucket, with bods_v2/source=AM/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/ bucket prefix, and bods_v2_errors/source=AM/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/!{firehose:error-output-type}/ error output prefix.

The buffer size was overridden to be 64 MiB and the buffer interval 900s, in keeping with the existing delivery streams.

tiredpixel · 2023-12-08T15:56:56Z

Kinesis delivery streams

I've amended the buffer size to use the AWS recommended defaults of 5 MiB and 300s. This will result in more small files, which causes Register Files Combiner to explode—but this will be reworked in #213 anyway, and using the default settings should mean less waiting around.

The buffer settings also make me wonder: Is it guaranteed that running the Register Files Combiner in the monthly bulk data import will include all the relevant data? Because if it was ingested within the previous 15 minutes, might the buffer have not yet been flushed, resulting in files missing in S3 prior to combination? Realistically, this is unlikely to happen, given the manual steps necessary—but it's a wondrous thought.

tiredpixel · 2023-12-08T16:06:34Z

Kinesis data streams

I also created am-prod data stream, since it seems this is the intended place for raw data streamed during ingestion.

Kinesis delivery streams

I also created am-prod delivery stream, with S3 destination oo-register-v2 bucket, raw_data/source=AM/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/ bucket prefix, raw_data_errors/source=AM/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/!{firehose:error-output-type}/ error output prefix, default buffer settings (5 MiB, 300s).

It seems this should be used when calling ingest-bulk with a Kinesis stream.

tiredpixel · 2023-12-08T16:58:42Z

With the various AWS Kinesis work and patches written as part of this ticket, ingestion and transformation, both using the local and the bulk method, both without and with publishing to Kinesis streams, works.

There would certainly be merit in revisiting these approaches in the future, both in architecture and in implementation. Also of note is that the usage of Kinesis does not scale well price-wise using this approach.

As part of this, AM data was deleted and recreated. This should have had no consequences, as it hadn't been published anywhere else downstream, including in monthly bulk data exports. However, it has likely broken examples for #229 , which will need to be updated before it can be worked on.

In the original source AM data, there were 1286 rows. In the raw index in Elasticsearch, that resulted in 1140 documents. This is concerning, but previous investigation showed that some statement IDs are duplicated, which is invalid according to BODS 0.2, so those rows are ignored. However, in the AWS S3 file published via the Kinesis stream Firehose, that results in 1171 rows… I don't have an explanation for that. Similarly, transformation resulted in 730 BODS documents in Elasticsearch. Yet in the S3 file created via the Kinesis stream firehose, there are 940 rows. This is extremely concerning, but hasn't been looked into further.

AM data can now be included in monthly bulk data exports—theoretically. Theoretically because this data is not updated during the normal monthly bulk data import cycle, and also because the files are small, so will very probably suffer from existing issues with small files caused by flaws in the Register Files Combiner, due for redesign in #213 . We might want to consider whether we even want to include AM data at this point, given that the raw data is not fully valid according to BODS 0.2. However, at present, its inclusion will be attempted, so if that isn't desirable, it will probably need to be removed again (if only temporarily…).

As far as I'm aware, this completes the static BODS work in this ticket, since everything that was written is now working, and despite various inefficiencies or areas to revisit in the future, it's usable.

StephenAbbott · 2023-12-11T16:13:06Z

Thanks for the summary, @tiredpixel - and the issues to be aware of.

Can you make the necessary changes to block its inclusion if it is likely that this would interfere with the existing Register Files Combiner process? Then we can revisit this once you move on to work on #213

tiredpixel · 2023-12-13T13:32:39Z

Done, @StephenAbbott . AWS Kinesis bods-am-prod delivery stream temporarily reconfigured to write directly to oo-register-v2 AWS S3 bucket bods_v2_am/ prefix instead of bods_v2/, so it won't be included in monthly bulk data exports.

tiredpixel · 2023-12-19T13:10:02Z

An issue was found during testing where some AM statements have been merged, seemingly without being related entities. Investigation led to three candidates for the cause:

Incorrectish AM source data which has identifiers with neither id nor uri. (This isn't actually strictly incorrect, since BODS 0.2 says 'any of' the fields, and doesn't specify id or uri must be present. But it leads to basically useless identifiers.)
Open Corporates API potentially using this or other data and leading to an incorrect hit.
Register implementation taking first hit from Open Corporates API, without regard neither to the presence of multiple hits, nor to returned confidence scores for the hits.

In an attempt to eliminate (2) and (3) with regards to (1), Static BODS process will be extended to allow Open Corporates to be disabled for specific imports.

After that, data will need reimporting (not migrating, since that's not possible here).

Test cases:

ԳԼՈԲԱԼ ԳՈԼԴ ՔՈՐՓՈՐԵՅՇՆ
ՖԱՅՐԲԸՐԴ ՄԵՆԵՋՄԵՆՏ ՍՊԸ

tiredpixel · 2023-12-19T19:31:47Z

I investigated whether disabling resolving via Open Corporates fixed the issue. It didn't. Rather, it turns out it's (1)—the incorrectish (yet valid according to BODS 0.2) identifiers, which contain neither id nor uri (well in fact, they contain an empty string).

I managed to isolate the problem and reduce it to simply 2 lines:

am.jq.test-2.jsonl incorrectly merges entities:

{"statementID":"0e3223ba-108f-479a-83c8-8195c869f6ef","statementType":"entityStatement","isComponent":false,"statementDate":"2021-06-17","entityType":"registeredEntity","name":"ԳԼՈԲԱԼ ԳՈԼԴ ՔՈՐՓՈՐԵՅՇՆ","addresses":[{"type":"registered","address":"NEW YORK, NEW YORK, RYE, Theodore Fremd Avenue , 555, Suite C208","country":"US","postCode":"10580"}],"alternateNames":["GLOBAL GOLD CORPORATION"],"identifiers":[{"id":"","scheme":"USA-TAXID"}],"publicationDetails":{"publicationDate":"2021-06-17","bodsVersion":"0.2","publisher":{"name":"---"}}}
{"statementID":"407947c8-f3f1-454c-aad2-4b1c9b1330b7","statementType":"entityStatement","isComponent":false,"statementDate":"2021-06-17","entityType":"registeredEntity","name":"ՖԱՅՐԲԸՐԴ ՄԵՆԵՋՄԵՆՏ ՍՊԸ","addresses":[{"type":"registered","address":"NEW YORK, NEW YORK, NEW YORK, NY, 152 WEST 57th Street 24th floor, 152, 24th floor","country":"US","postCode":"10019"}],"alternateNames":["FIREBIRD MANAGEMENT LLC"],"identifiers":[{"id":"","scheme":"USA-TAXID"}],"publicationDetails":{"publicationDate":"2021-06-17","bodsVersion":"0.2","publisher":{"name":"---"}}}

am.jq.test-2.jsonl.csv

Whereas manually-edited am.jq.test-2.jsonl doesn't:

{"statementID":"0e3223ba-108f-479a-83c8-8195c869f6ef","statementType":"entityStatement","isComponent":false,"statementDate":"2021-06-17","entityType":"registeredEntity","name":"ԳԼՈԲԱԼ ԳՈԼԴ ՔՈՐՓՈՐԵՅՇՆ","addresses":[{"type":"registered","address":"NEW YORK, NEW YORK, RYE, Theodore Fremd Avenue , 555, Suite C208","country":"US","postCode":"10580"}],"alternateNames":["GLOBAL GOLD CORPORATION"],"publicationDetails":{"publicationDate":"2021-06-17","bodsVersion":"0.2","publisher":{"name":"---"}}}
{"statementID":"407947c8-f3f1-454c-aad2-4b1c9b1330b7","statementType":"entityStatement","isComponent":false,"statementDate":"2021-06-17","entityType":"registeredEntity","name":"ՖԱՅՐԲԸՐԴ ՄԵՆԵՋՄԵՆՏ ՍՊԸ","addresses":[{"type":"registered","address":"NEW YORK, NEW YORK, NEW YORK, NY, 152 WEST 57th Street 24th floor, 152, 24th floor","country":"US","postCode":"10019"}],"alternateNames":["FIREBIRD MANAGEMENT LLC"],"publicationDetails":{"publicationDate":"2021-06-17","bodsVersion":"0.2","publisher":{"name":"---"}}}

am.jq.test-3.jsonl.csv

It's not clear what's best to do about this, other than make another preprocessing change to strip identifiers which have neither id nor uri (treating the empty string as null)—even though it's technically valid BODS 0.2.

tiredpixel · 2023-12-20T09:28:51Z

Noting that after a conversation with @StephenAbbott, it's been decided to strip identifiers having neither id nor uri, despite it technically being valid BODS 0.2, and to see whether that fixes the imports.

tiredpixel · 2023-12-20T17:52:50Z

Identifiers with neither id nor uri are now stripped.

AM data has been completely reimported.

That appears to resolve that particular issue. However, there are now also entities which are not being merged. This might relate to other known issues. It hasn't been investigated, at present.

StephenAbbott · 2024-01-29T21:39:54Z

Signed off ✅

StephenAbbott added the enhancement New feature or request label Aug 8, 2023

StephenAbbott assigned spacesnottabs Aug 8, 2023

StephenAbbott added this to Open Ownership Register and BODS pipelines Aug 8, 2023

StephenAbbott moved this to Todo in Open Ownership Register and BODS pipelines Aug 8, 2023

StephenAbbott changed the title ~~Register v2: Introducing the ability to add static sources of BODS data~~ Introducing the ability to add static sources of BODS data Sep 1, 2023

spacesnottabs moved this from Todo (highest priority at top) to In Progress in Open Ownership Register and BODS pipelines Sep 17, 2023

StephenAbbott assigned tiredpixel Nov 8, 2023

StephenAbbott mentioned this issue Nov 8, 2023

Update Register to handle alternateNames to support use of field in Armenia's BODS data #222

Closed

10 tasks

tiredpixel unassigned spacesnottabs Nov 9, 2023

StephenAbbott moved this from In Progress to Todo in Open Ownership Register and BODS pipelines Nov 22, 2023

tiredpixel moved this from Todo to In Progress in Open Ownership Register and BODS pipelines Nov 27, 2023

This was referenced Nov 28, 2023

[#171] add Utils::Object compact_deep openownership/register-common#20

Merged

171 static bods openownership/register-sources-bods#44

Merged

tiredpixel mentioned this issue Nov 28, 2023

[#171] upgrade common, sources-bods, sources-oc for types fix #226

Merged

tiredpixel added a commit that referenced this issue Nov 28, 2023

Merge pull request #226 from openownership/171-static-bods

7f8640b

[#171] upgrade common, sources-bods, sources-oc for types fix

tiredpixel mentioned this issue Nov 29, 2023

[#171] add RecordDeserializer fix_parsed! to fix interests.share.exact openownership/register-sources-bods#45

Merged

tiredpixel mentioned this issue Dec 8, 2023

[#171] fix static BODS Kinesis streaming openownership/register-sources-bods#50

Merged

tiredpixel mentioned this issue Dec 8, 2023

[#171] fix Services::BulkTransformer to allow namespace openownership/register-common#21

Merged

tiredpixel moved this from In Progress to In Testing in Open Ownership Register and BODS pipelines Dec 8, 2023

tiredpixel moved this from In Testing to In Progress in Open Ownership Register and BODS pipelines Dec 19, 2023

tiredpixel mentioned this issue Dec 19, 2023

171 static bods openownership/register-sources-bods#52

Merged

tiredpixel moved this from In Progress to In Testing in Open Ownership Register and BODS pipelines Dec 20, 2023

tiredpixel mentioned this issue Dec 21, 2023

Carry out an audit of required fields openownership/data-standard#470

Open

StephenAbbott closed this as completed Jan 29, 2024

github-project-automation bot moved this from In Testing to Done in Open Ownership Register and BODS pipelines Jan 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introducing the ability to add static sources of BODS data #171

Introducing the ability to add static sources of BODS data #171

StephenAbbott commented Aug 8, 2023

StephenAbbott commented Oct 31, 2023

tiredpixel commented Nov 30, 2023 •

edited

Loading

tiredpixel commented Dec 8, 2023

tiredpixel commented Dec 8, 2023

tiredpixel commented Dec 8, 2023

tiredpixel commented Dec 8, 2023

StephenAbbott commented Dec 11, 2023

tiredpixel commented Dec 13, 2023

tiredpixel commented Dec 19, 2023 •

edited

Loading

tiredpixel commented Dec 19, 2023 •

edited

Loading

tiredpixel commented Dec 20, 2023

tiredpixel commented Dec 20, 2023

StephenAbbott commented Jan 29, 2024

Introducing the ability to add static sources of BODS data #171

Introducing the ability to add static sources of BODS data #171

Comments

StephenAbbott commented Aug 8, 2023

StephenAbbott commented Oct 31, 2023

tiredpixel commented Nov 30, 2023 • edited Loading

tiredpixel commented Dec 8, 2023

Kinesis data streams

Kinesis delivery streams

tiredpixel commented Dec 8, 2023

Kinesis delivery streams

tiredpixel commented Dec 8, 2023

Kinesis data streams

Kinesis delivery streams

tiredpixel commented Dec 8, 2023

StephenAbbott commented Dec 11, 2023

tiredpixel commented Dec 13, 2023

tiredpixel commented Dec 19, 2023 • edited Loading

tiredpixel commented Dec 19, 2023 • edited Loading

tiredpixel commented Dec 20, 2023

tiredpixel commented Dec 20, 2023

StephenAbbott commented Jan 29, 2024

tiredpixel commented Nov 30, 2023 •

edited

Loading

tiredpixel commented Dec 19, 2023 •

edited

Loading

tiredpixel commented Dec 19, 2023 •

edited

Loading