Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introducing the ability to add static sources of BODS data #171

Closed
StephenAbbott opened this issue Aug 8, 2023 · 13 comments
Closed

Introducing the ability to add static sources of BODS data #171

StephenAbbott opened this issue Aug 8, 2023 · 13 comments
Assignees
Labels
enhancement New feature or request

Comments

@StephenAbbott
Copy link
Member

Incorporate a static source of BODS data into the Open Ownership Register rather than working with data delivered via an API or bulk data products.

See PDF for details and estimates of work required:
Static_BODS.pdf

@StephenAbbott StephenAbbott added the enhancement New feature or request label Aug 8, 2023
@StephenAbbott StephenAbbott changed the title Register v2: Introducing the ability to add static sources of BODS data Introducing the ability to add static sources of BODS data Sep 1, 2023
@spacesnottabs spacesnottabs moved this from Todo (highest priority at top) to In Progress in Open Ownership Register and BODS pipelines Sep 17, 2023
@StephenAbbott
Copy link
Member Author

See openownership/register-sources-bods#41 for progress on this

@tiredpixel tiredpixel moved this from Todo to In Progress in Open Ownership Register and BODS pipelines Nov 27, 2023
tiredpixel added a commit that referenced this issue Nov 28, 2023
Without this, Register including the draft AM data explodes:

    Dry::Types::CoercionError ([RegisterSourcesBods::EntityStatement.new] ["«AMP HOLDING»"] (Array) has invalid type for :alternateNames violates constraints (type?(String, ["«AMP HOLDING»"]) failed)):
tiredpixel added a commit that referenced this issue Nov 28, 2023
[#171] upgrade common, sources-bods, sources-oc for types fix
@tiredpixel
Copy link
Contributor

tiredpixel commented Nov 30, 2023

Most of the work on this is now done. Ingestion and transformation into production using AM data has been a success, other than known issues with the AM data itself not being fully BODS 0.2 compliant. However, only the local import method has been used, so far, not the bulk method connecting to Kinesis stream.

Remaining on this:

  • ingestion using bulk method
  • transformation using bulk method
  • consideration of whether AM data will be included in monthly bulk data export

@tiredpixel
Copy link
Contributor

Kinesis data streams

The bods-am-prod data stream was created in Kinesis. Having a separate data stream for each datasource doesn't scale all that well pricing-wise, but it's the way everything else is currently working, and changing this could be complex. e.g. It's not clear how BODS statements could be segmented later, for example, given AM data supplied currently publicationDetails.publisher.name = ---.

on-demand capacity mode was considered, but in the end provisioned shards = 1 was chosen, since that's what's been chosen for the existing data streams, and I recall some mention of potential issues if the number of shards is increased to > 1.

Default data retention of 1d was chosen.

Kinesis delivery streams

The bods-am-prod delivery stream was created in Kinesis using Firehose, and configured to be similar to the existing delivery streams.

S3 destination was selected using oo-register-v2 bucket, with bods_v2/source=AM/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/ bucket prefix, and bods_v2_errors/source=AM/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/!{firehose:error-output-type}/ error output prefix.

The buffer size was overridden to be 64 MiB and the buffer interval 900s, in keeping with the existing delivery streams.

@tiredpixel
Copy link
Contributor

Kinesis delivery streams

I've amended the buffer size to use the AWS recommended defaults of 5 MiB and 300s. This will result in more small files, which causes Register Files Combiner to explode—but this will be reworked in #213 anyway, and using the default settings should mean less waiting around.

The buffer settings also make me wonder: Is it guaranteed that running the Register Files Combiner in the monthly bulk data import will include all the relevant data? Because if it was ingested within the previous 15 minutes, might the buffer have not yet been flushed, resulting in files missing in S3 prior to combination? Realistically, this is unlikely to happen, given the manual steps necessary—but it's a wondrous thought.

@tiredpixel
Copy link
Contributor

Kinesis data streams

I also created am-prod data stream, since it seems this is the intended place for raw data streamed during ingestion.

Kinesis delivery streams

I also created am-prod delivery stream, with S3 destination oo-register-v2 bucket, raw_data/source=AM/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/ bucket prefix, raw_data_errors/source=AM/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/!{firehose:error-output-type}/ error output prefix, default buffer settings (5 MiB, 300s).

It seems this should be used when calling ingest-bulk with a Kinesis stream.

@tiredpixel
Copy link
Contributor

With the various AWS Kinesis work and patches written as part of this ticket, ingestion and transformation, both using the local and the bulk method, both without and with publishing to Kinesis streams, works.

There would certainly be merit in revisiting these approaches in the future, both in architecture and in implementation. Also of note is that the usage of Kinesis does not scale well price-wise using this approach.

As part of this, AM data was deleted and recreated. This should have had no consequences, as it hadn't been published anywhere else downstream, including in monthly bulk data exports. However, it has likely broken examples for #229 , which will need to be updated before it can be worked on.

In the original source AM data, there were 1286 rows. In the raw index in Elasticsearch, that resulted in 1140 documents. This is concerning, but previous investigation showed that some statement IDs are duplicated, which is invalid according to BODS 0.2, so those rows are ignored. However, in the AWS S3 file published via the Kinesis stream Firehose, that results in 1171 rows… I don't have an explanation for that. Similarly, transformation resulted in 730 BODS documents in Elasticsearch. Yet in the S3 file created via the Kinesis stream firehose, there are 940 rows. This is extremely concerning, but hasn't been looked into further.

AM data can now be included in monthly bulk data exports—theoretically. Theoretically because this data is not updated during the normal monthly bulk data import cycle, and also because the files are small, so will very probably suffer from existing issues with small files caused by flaws in the Register Files Combiner, due for redesign in #213 . We might want to consider whether we even want to include AM data at this point, given that the raw data is not fully valid according to BODS 0.2. However, at present, its inclusion will be attempted, so if that isn't desirable, it will probably need to be removed again (if only temporarily…).

As far as I'm aware, this completes the static BODS work in this ticket, since everything that was written is now working, and despite various inefficiencies or areas to revisit in the future, it's usable.

@tiredpixel tiredpixel moved this from In Progress to In Testing in Open Ownership Register and BODS pipelines Dec 8, 2023
@StephenAbbott
Copy link
Member Author

Thanks for the summary, @tiredpixel - and the issues to be aware of.

Can you make the necessary changes to block its inclusion if it is likely that this would interfere with the existing Register Files Combiner process? Then we can revisit this once you move on to work on #213

@tiredpixel
Copy link
Contributor

Done, @StephenAbbott . AWS Kinesis bods-am-prod delivery stream temporarily reconfigured to write directly to oo-register-v2 AWS S3 bucket bods_v2_am/ prefix instead of bods_v2/, so it won't be included in monthly bulk data exports.

@tiredpixel
Copy link
Contributor

tiredpixel commented Dec 19, 2023

An issue was found during testing where some AM statements have been merged, seemingly without being related entities. Investigation led to three candidates for the cause:

  1. Incorrectish AM source data which has identifiers with neither id nor uri. (This isn't actually strictly incorrect, since BODS 0.2 says 'any of' the fields, and doesn't specify id or uri must be present. But it leads to basically useless identifiers.)
  2. Open Corporates API potentially using this or other data and leading to an incorrect hit.
  3. Register implementation taking first hit from Open Corporates API, without regard neither to the presence of multiple hits, nor to returned confidence scores for the hits.

In an attempt to eliminate (2) and (3) with regards to (1), Static BODS process will be extended to allow Open Corporates to be disabled for specific imports.

After that, data will need reimporting (not migrating, since that's not possible here).

Test cases:

  • ԳԼՈԲԱԼ ԳՈԼԴ ՔՈՐՓՈՐԵՅՇՆ
  • ՖԱՅՐԲԸՐԴ ՄԵՆԵՋՄԵՆՏ ՍՊԸ

@tiredpixel tiredpixel moved this from In Testing to In Progress in Open Ownership Register and BODS pipelines Dec 19, 2023
@tiredpixel
Copy link
Contributor

tiredpixel commented Dec 19, 2023

I investigated whether disabling resolving via Open Corporates fixed the issue. It didn't. Rather, it turns out it's (1)—the incorrectish (yet valid according to BODS 0.2) identifiers, which contain neither id nor uri (well in fact, they contain an empty string).

I managed to isolate the problem and reduce it to simply 2 lines:

am.jq.test-2.jsonl incorrectly merges entities:

{"statementID":"0e3223ba-108f-479a-83c8-8195c869f6ef","statementType":"entityStatement","isComponent":false,"statementDate":"2021-06-17","entityType":"registeredEntity","name":"ԳԼՈԲԱԼ ԳՈԼԴ ՔՈՐՓՈՐԵՅՇՆ","addresses":[{"type":"registered","address":"NEW YORK, NEW YORK, RYE, Theodore Fremd Avenue , 555, Suite C208","country":"US","postCode":"10580"}],"alternateNames":["GLOBAL GOLD CORPORATION"],"identifiers":[{"id":"","scheme":"USA-TAXID"}],"publicationDetails":{"publicationDate":"2021-06-17","bodsVersion":"0.2","publisher":{"name":"---"}}}
{"statementID":"407947c8-f3f1-454c-aad2-4b1c9b1330b7","statementType":"entityStatement","isComponent":false,"statementDate":"2021-06-17","entityType":"registeredEntity","name":"ՖԱՅՐԲԸՐԴ ՄԵՆԵՋՄԵՆՏ ՍՊԸ","addresses":[{"type":"registered","address":"NEW YORK, NEW YORK, NEW YORK, NY, 152 WEST 57th Street 24th floor, 152, 24th floor","country":"US","postCode":"10019"}],"alternateNames":["FIREBIRD MANAGEMENT LLC"],"identifiers":[{"id":"","scheme":"USA-TAXID"}],"publicationDetails":{"publicationDate":"2021-06-17","bodsVersion":"0.2","publisher":{"name":"---"}}}

am.jq.test-2.jsonl.csv

Whereas manually-edited am.jq.test-2.jsonl doesn't:

{"statementID":"0e3223ba-108f-479a-83c8-8195c869f6ef","statementType":"entityStatement","isComponent":false,"statementDate":"2021-06-17","entityType":"registeredEntity","name":"ԳԼՈԲԱԼ ԳՈԼԴ ՔՈՐՓՈՐԵՅՇՆ","addresses":[{"type":"registered","address":"NEW YORK, NEW YORK, RYE, Theodore Fremd Avenue , 555, Suite C208","country":"US","postCode":"10580"}],"alternateNames":["GLOBAL GOLD CORPORATION"],"publicationDetails":{"publicationDate":"2021-06-17","bodsVersion":"0.2","publisher":{"name":"---"}}}
{"statementID":"407947c8-f3f1-454c-aad2-4b1c9b1330b7","statementType":"entityStatement","isComponent":false,"statementDate":"2021-06-17","entityType":"registeredEntity","name":"ՖԱՅՐԲԸՐԴ ՄԵՆԵՋՄԵՆՏ ՍՊԸ","addresses":[{"type":"registered","address":"NEW YORK, NEW YORK, NEW YORK, NY, 152 WEST 57th Street 24th floor, 152, 24th floor","country":"US","postCode":"10019"}],"alternateNames":["FIREBIRD MANAGEMENT LLC"],"publicationDetails":{"publicationDate":"2021-06-17","bodsVersion":"0.2","publisher":{"name":"---"}}}

am.jq.test-3.jsonl.csv

It's not clear what's best to do about this, other than make another preprocessing change to strip identifiers which have neither id nor uri (treating the empty string as null)—even though it's technically valid BODS 0.2.

@tiredpixel
Copy link
Contributor

Noting that after a conversation with @StephenAbbott, it's been decided to strip identifiers having neither id nor uri, despite it technically being valid BODS 0.2, and to see whether that fixes the imports.

@tiredpixel
Copy link
Contributor

Identifiers with neither id nor uri are now stripped.

AM data has been completely reimported.

That appears to resolve that particular issue. However, there are now also entities which are not being merged. This might relate to other known issues. It hasn't been investigated, at present.

@StephenAbbott
Copy link
Member Author

Signed off ✅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Development

No branches or pull requests

3 participants