Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Asynchronous GeoTIFF reader #13

Open
weiji14 opened this issue Jul 19, 2024 · 17 comments
Open

Asynchronous GeoTIFF reader #13

weiji14 opened this issue Jul 19, 2024 · 17 comments

Comments

@weiji14
Copy link
Member

weiji14 commented Jul 19, 2024

Rewrite https://github.com/geospatial-jeff/aiocogeo in Rust!

We'll probably want to leave all the non-filesystem I/O to object_store, and focus on decoding IFDs asynchronously.

Help is most appreciated.

@kylebarron
Copy link
Member

I started a prototype here: https://github.com/developmentseed/aiocogeo-rs

It's able to read all TIFF and GeoTIFF metadata and decode a JPEG tile, though general read/decompressor support is not yet implemented. I definitely plan to implement full-tile-decoding support, so that we can use it from Python efficiently.

It's not at all reliant on the tiff crate (it does use some enums that I could later copy over).

One question is how to marry both synchronous and async read APIs. One good reference is the parquet crate, which has both sync and async APIs to read Parquet.

@feefladder
Copy link

feefladder commented Sep 17, 2024

Hmmm.... I wanted to use tiffs in bevy_terrain, so modified the image/tiff crate to do async.
Some nice discussions on how to merge:

Basically: Put all non-io-related code (working on [u8]s) in a top-level module and then put (async)-io-related code in a submodule. What I did in image/tiff is to create a cursor over the byte array so I modified as little code as possible. I admit it's somewhat arcane, because I didn't know if the image/tiff folks actually wanted to maintain async support 😁. But now I see (busstoptaktik/geodesy#86) that dev efforts are concentrated here rather than georaster, I'll see if I can give it a shot....

@weiji14
Copy link
Member Author

weiji14 commented Sep 17, 2024

Oh wow, thanks @feefladder for all that work (and the PR at image-rs/image-tiff#245)!

I didn't know if the image/tiff folks actually wanted to maintain async support 😁. But now I see (busstoptaktik/geodesy#86) that dev efforts are concentrated here rather than georaster, I'll see if I can give it a shot....

Not sure either, but I'd definitely like it if most of the async code could be put in image/tiff, especially if there are no 'geo'-specific parts. If I'm not mistaken, image/tiff only supports full reads and not partial reads (e.g. it can't read specific tiles within a Cloud-Optimized GeoTIFF based on a bounding box) right now? There's some experimentation going on at aiocogeo-rs that might support partial reads at some point (based on original Python implementation at https://github.com/geospatial-jeff/aiocogeo/tree/0.3.0?tab=readme-ov-file#partial-read).

Eventually though, it would be great to have less fragmentation on different ways of reading GeoTIFFs in Rust, and have all the core implementations live upstream in image/tiff or here in georust/geotiff. But there might be a period where we'll need to try different styles and see what sticks 🙂

@feefladder
Copy link

feefladder commented Sep 18, 2024

Actually, I did some work on georaster based on the async stuff... lemme see if I can make that into a PR...bEDIT: ah it turns out my files didn't save properly...
image-tiff does support partial reads, it has a load_chunk(chunk_index: u32) function. Not sure if that is in release though 😁

@feefladder
Copy link

My pr over at image-rs/image-tiff#245 got rejected because it was too big and I didn't create an issue first. Then a proposal I had for more control (image-rs/image-tiff#250) there is no maintainer bandwith to support my proposed changes (also here). So I don't know if it is feasible in the near future to have async on top of image-tiff.

Anyone here have ideas for what is a good way forward?

@weiji14
Copy link
Member Author

weiji14 commented Oct 12, 2024

@feefladder, first off, I really wish I had your energy to do such deep technical work in the (Geo)TIFF Rust ecosystem. I can understand both your frustration as an open source contributor wanting to make meaningful changes, but also feel for the image-tiff maintainer who has to review a 1000+ line PR for a crate used by 47k+ of downstream projects...

Personally, I'd prefer not to fragment the TIFF ecosystem by having X different implementations in X repos, which is why I've suggested pushing things to image-tiff. That said, me and my colleagues at DevSeed are aware that the image-tiff maintainer is stretched rather thin, which is why the aiocogeo-rs repo Kyle mentioned above was created. The idea is to have an independent implementation for an async GeoTIFF reader in that aiocogeo-rs repo (that doesn't rely on image-tiff), and once that experimental work is stabilized, we will slowly upstream parts of it to image-tiff or geotiff crate as needed. This means a lot more duplicate work, but it can be good to have a proof of concept to show and get people excited about (and hopefully the image-tiff maintainers will be more receptive to a contribution that has been battle-tested in the wild).

Of course, I don't expect anyone to trust my obviously biased intentions on getting aiocogeo-rs working since it is currently hosted under the organization I'm working for. I'm aware @feefladder that you'd maybe prefer a different arrangement/license based on georust/meta#34 (comment) and some Discord discussions, and respect it if you'd like to carve out a different space for async GeoTIFF. Keen to hear your ideas on a way forward.

@kylebarron
Copy link
Member

To add to this, I would love to figure out a good architecture such that we could use some of the decoding parts of image-tiff while managing our own data fetching. This is hard however because of image-tiff's architecture, and the fact that it lazily reads tiff tags on access instead of up front. (the reading and the decoding of the tags is combined).

But maybe there's a way to use aiocogeo-rs for the data fetching and tag decoding, but reuse image-tiff for the image decoding? That's what I was heading towards before I ran out of time (I'm doing a lot of work on open source rust vector projects)

@feefladder
Copy link

Thanks for the uplifting responses :)

I think the aiocogeo approach is a good one - having an async tiff reader specifically for COGs. That could have a smaller code/exec footprint than image/tiff without needing to worry about arcane tiffs (since being only for COGs is in the name). What I was currently thinking of is to put all lessons learned from image-tiff into such a reader, maybe slap on a writer as well. That would mean yet-another forked-and-maybe-rebranded-tiffreader. Which is exactly the problem we're trying to avoid here. What would be ideal is to have agreement between here and image-tiff on what direction image-tiff could develop, so that things made here can actually be upstreamed. What could be done as a middle ground between completely forking away and waiting on image-tiff, is to create a submodule (very c++ style) for image-tiff that we link to in our cargo_toml. However, there is quite a big change underway over at image-rs/image-tiff#242

This is hard however because of image-tiff's architecture, and the fact that it lazily reads tiff tags on access instead of up front.

How should it read tags? Read all images, including all tags Dos-style? I think it makes sense to read only the relevant tags for the current overview (thinking COG) and geo tags. Then, I dont't really see a way in which one could read the desired part of an image in less than 3 sequential requests (for which we know the overview we want to load). Doing some testing, I've found that finding the nth ifd normally fits within the (16kB) buffer from the first request. Then, when reading tag data (~60MB, further in the file), the pre-existing buffer is cleared, making further reads into other tag data slow. Now, to speed things up, I would Keep the first fetch around, since it contains all ifds, but then that is reader-implemented.

There is a slight problem also in the design there, since async doesn't mean concurrent. That is, if we make the decoder reader-agnostic (having <R: AsyncRead + AsyncSeek>) then - if I understand correctly - we cannot concurrently read tags without having multiple readers, since poll_read takes a &mut Self. What I did in image-rs/image-tiff#246 is

pub struct<R: AsyncRead + AsyncSeek> ChunkDecoder {
  image: Arc<Image>,
  reader: R
}

and then "cloning" the reader for concurrently reading multiple chunks - which is actually not really cloning, since it only clones the current buffer and not the pending read/seek -> is that too ugly? That's what ObjectStore circumvents.

At the end of the day, I do still think that image-rs/image-tiff#250 is a way forward, where the decoding functionality is exposed without dependence on a specific reader (or creating a mem_reader) <- I think there's still quite some design needed and would rather build on top of a fork where the needed functionality from image-tiff is carved out/discovered, rather than re-implementing. That would mean putting quite a large maintenance burden here (since image-tiff is likely to change).

  1. Build on top of an Extensibility: mid-level API and harmonizing encoder/decoder structs. image-rs/image-tiff#250 fork, making minimal changes
  2. aiocogeo does the async reading from object_store backend
  3. georust/geotiff adds geo functionality which is backend-agnostic
    1. Maybe create a trait TiffDecoder in here for a backend and then implement the reader on top of that trait, so that we can easily switch out backends <- This actually doesn't work so well, since one is async and the other sync

Thanks @weiji14 for reading my meta and discord concerns! Indeed, I would like to be able to direct my (free time/thesis) efforts more in a direction I want to. Then, making a reader that secretly has these license terms that is then used here would seem a bit weird, would such a thing be accepted here (say, I PR a dependency for aiocog, having a goodness-clause)?

@HeroicKatora
Copy link

How should it read tags? Read all images, including all tags Dos-style? I think it makes sense to read only the relevant tags for the current overview (thinking COG) and geo tags.

Comparing to a recent development in png: it might be simpler for core data structures in image-tiff to support EAGAIN / ErrorKind::WouldBlock. That is, if the underlying doesn't have the necessary buffer filled and can not fill it synchronously, it signals to the decoder. Then the decoder updates some list of required data ranges, let's the caller schedule some form of progress on it, and then continue. Run this in a loop until decoding progresses completely. We might call this a sans-IO style but it's not forbidden to do synchronous IO so it's a little bit of a hybrid. This presents a solution to the buffer problem: we don't need to read tags sequentially, nor do we need to read everything up-front. With a large enough buffer quite a lot of values may complete at the same time, and the decoder could also signal multiple required ranges at once.

The hard design question here is finding the right interface to communicate the needs within the file, and properly control resource usage. (We want to avoid the need for buffers to have arbitrary size, the decoder must work with a constant bound of additional in-memory data besides the requested tags and decoded data for medical images and GeoTIFF). The interfaces are a little brittle since the control flow is quite inverted. Maybe there's a better way to write that as a trait with higher-ranked-trait-bounds now.

@feefladder
Copy link

So I've mumbled up something over here in the past week:

  • I looked at aicogeo-rs and thought it didn't "look enough like image-tiff", making possible integration later more difficult
  • I put the basic idea of architecture (which is somewhat similar to image-tiff) in the tiff2 readme
  • the main idea which is not mentioned in the readme is of buffers and ranges:
    1. create a appropriately sized buffer
    2. read data in the buffer
    3. minimal processing to be able to proceed:
      • fix endianness if tag data (other operations are not needed)
      • create BTreeMap if ifd (endianness/checking if it's IfdEntry::Offset/::Value is difficult there, we need the IFD for getting any further and it's not that much processing I thought).
    4. (synchronously) check from the IFD if it's valid and what is needed to proceed
    5. fetch tag data, building Image removes IFD (map) entries and puts them in Arc/fields for fast, multiple access.
    6. fetch chunk data (multiple/async)
    7. decode chunk data (CPU-bound)
    • the underlying idea was that we could pass this buffer to the networking code, but
      I used @spoutnik1's idea of using BufferedEntry everywhere in stead of Value, because it seemed more elegant and I was working on harmonizing decoder/encoder a bit. That ended up being a rather large change which I decided to pull into my own crate, which should match image-tiff in features soon, because I'm stealing all their code 🥷. That would also make it more easier for them to steal

That is, if the underlying doesn't have the necessary buffer filled and can not fill it synchronously, it signals to the decoder. Then the decoder updates some list of required data ranges, let's the caller schedule some form of progress on it, and then continue.

If I undersand correctly, it's something like this (here in readme:

// how HeroicKatora would do it if I understand correctly:
#[tokio::test]
async fn test_concurrency_recover() {
    let decoder = CogDecoder::from_url("https://enourmous-cog.com")
        .await
        .expect("Decoder should build");
    decoder
        .read_overviews(vec![0])
        .await
        .expect("decoder should read ifds");
    // get a chunk from the highest resolution image
    let chunk_1 = decoder.get_chunk(42, 0).unwrap(); // the future doesn't hold any reference to `&self`
    // get a chunk from a lower resolution image
    if let OverviewNotLoadedError(chunk_err) = decoder.get_chunk(42, 5).unwrap_err() {
        // read_overviews changes state of the decoder to LoadingIfds
        decoder.read_overviews(chunk_err).await;
        //scope of async `&mut self` ends here 
    }
    let chunk_2 = decoder.get_chunk(42, 5);
    let data = (chunk_1.await, chunk_2.await);
}

So far, I'm not really using the Limits struct from image-tiff, but I think it could well be used. I'm mainly focussing on using as little (additional) memory as possible, with the assumption that metadata will be small enough. Otherwise I'd say that would be downstream, just like io and concurrency :P

@kylebarron
Copy link
Member

From your readme

Similar in function and planned lifespan as arrow2 crate:

I'm not sure if you intend that, now that the arrow2 crate is deprecated and it kinda splintered the community 😅

@feefladder
Copy link

ah yes, sort of: The idea was that my crate would be a try-out where the implementation is tested, to be pulled in upstream later and then deprecated. I don't know about arrow2 having splintered the community, that is not what I wanted to do, obviously :)

@feefladder
Copy link

That's also why I decided not to build it on top of aiocogeo-rs, because I want to be a (rather big) changelog away from image-tiff.

If I understand correctly, some of the async code from parquet2 did get upstreamed into parquet? and then some part of a small-code-footprint-with-limited-features got taken over by some group/community?

@kylebarron
Copy link
Member

What I was currently thinking of is to put all lessons learned from image-tiff into such a reader, maybe slap on a writer as well.

I think if we can keep the initial scope to COG and ignore strip TIFFs, we should be able to keep the maintenance burden low enough to be stable. Ideally we have one project in the georust sphere (not necessarily in the georust org) that implements this.

Personally, I don't see any progress on the tiff side, and I wouldn't expect or plan for any changes to be upstreamed to the tiff crate.

How should it read tags? Read all images, including all tags Dos-style?

I think it should read all IFDs up front, but of course not the image data. This means that you have all the metadata about your image from the initial load.

Then, when reading tag data (~60MB, further in the file), the pre-existing buffer is cleared, making further reads into other tag data slow.

What do you mean buffer? Are you working with a BufReader?

I want to support COGs, and that means making remote files a first-class citizen. I think the best way to do this is via the object_store crate.

if we make the decoder reader-agnostic (having <R: AsyncRead + AsyncSeek>)

AsyncSeek implies a stateful file. IMO it would be better to not expect the backing reader to be file-like.

We could make wrappers around the ObjectStore to handle buffered reading, especially for the initial IFD parsing. After we've read the metadata, we know all the byte offsets of all the tiles in the file, and have no need for any more buffering.

we cannot concurrently read tags without having multiple readers

Under my proposed usage of object_store, you'd use an Arc<dyn ObjectStore> to represent an S3/GCS/Azure/HTTP resource. That Arc is then easily cloned so you could concurrently read tags.

  1. Build on top of an Extensibility: mid-level API and harmonizing encoder/decoder structs. image-rs/image-tiff#250 fork, making minimal changes
  2. aiocogeo does the async reading from object_store backend
  3. georust/geotiff adds geo functionality which is backend-agnostic

You mean that aiocogeo should be like async-tiff, and delegate geo-related work to geotiff?

  1. Maybe create a trait TiffDecoder in here for a backend and then implement the reader on top of that trait, so that we can easily switch out backends <- This actually doesn't work so well, since one is async and the other sync

You may want to look at ArrowReaderBuilder, which is a wrapper around both a sync and async interface to work with Parquet files.

(say, I PR a dependency for aiocog, having a goodness-clause)?

I think aiocogeo would likely want to stick with a generic, popular, free license like MIT or Apache 2.

@kylebarron
Copy link
Member

For 1, we'd want to be done with 3 requests, while not loading the complete tag data.

I think the initial implementation for simplicity should load all tag data. I think a later implementation could improve performance by automatically loading all inline tag data and storing references to the byte ranges of values for large tags. But at least in a COG all IFDs should be up front I believe, so I think you gain a lot in simplicity by reading all tags, for not a lot of performance hit.

  • I looked at aicogeo-rs and thought it didn't "look enough like image-tiff", making possible integration later more difficult

Personally, given the huge apparent lack of maintenance bandwidth in tiff, I'd rather start anew without expectation of ever merging back into tiff.

And aiocogeo-rs isn't totally new in architecture. It's largely a port of https://github.com/geospatial-jeff/aiocogeo, and @geospatial-jeff has been working with COGs for a long time, so I really trust his direction.

  • the main idea which is not mentioned in the readme is of buffers and ranges:

I think it's simpler for the reader itself to have no knowledge of buffers, but rather just to fetch specific byte ranges as needed. A wrapper around the raw byte range reader can provide request buffering for IFDs. Essentially this Cursor object in aiocogeo-rs.

@kylebarron
Copy link
Member

kylebarron commented Nov 7, 2024

If I understand correctly, some of the async code from parquet2 did get upstreamed into parquet?

I'm not sure if it ever got upstreamed or if parquet separately implemented async.

@feefladder
Copy link

feefladder commented Nov 11, 2024

@kylebarron thanks for the in-depth answers. I've grouped them loosely by topic

image-tiff bandwidth and crate organization

What I was currently thinking of is to put all lessons learned from image-tiff into such a reader, maybe slap on a writer as well.

I think if we can keep the initial scope to COG and ignore strip TIFFs, we should be able to keep the maintenance burden low enough to be stable. Ideally we have one project in the georust sphere (not necessarily in the georust org) that implements this.

Personally, I don't see any progress on the tiff side, and I wouldn't expect or plan for any changes to be upstreamed to the tiff crate.

As far as I can see, the difference between strip and tiled tiffs is quite well covered in having a somewhat generic ChunkOffsets and ChunkType. The decoding steps (creating a decompressing reader for a compressed chunk, reading that into a decompressed buffer) for image vs strips are not that different. I did think of slicing it out - and maybe will - but so far it was simply easier to keep things more or less as they were copied over from image-tiff.

  • I looked at aicogeo-rs and thought it didn't "look enough like image-tiff", making possible integration later more difficult

Personally, given the huge apparent lack of maintenance bandwidth in tiff, I'd rather start anew without expectation of ever merging back into tiff.

And aiocogeo-rs isn't totally new in architecture. It's largely a port of https://github.com/geospatial-jeff/aiocogeo, and @geospatial-jeff has been working with COGs for a long time, so I really trust his direction.

Ok, I wasn't completely clear and maybe that deserves its own thread. The main thing aiocogeo lacked, which image-tiff had figured out was bigtiff support - and I was more acquianted with that crate. Hencewhy I decided to copy over from image-tiff. But then image-tiff doesn't support transparency mask and indeed has low bandwidth. What I did in the end - which is different from both, is take @spoutnik1's approach of Enum ProcessedEntry{FLOAT(Vec<f32>),...}, because having a byte-array gives alignment issues. I did think this was ugly, since it requires an allocation per entry, but in case endianness matches, it can be created directly from memcpy. Then also, I have less entries in the Image struct and keep non-essential entries in the ifd: Directory field. Anyways, I think the architecture is largely similar, and now I will base more of what I'm doing on aiocogeo :)

I think aiocogeo would likely want to stick with a generic, popular, free license like MIT or Apache 2.

Hmmyeah, I get that... ehh... well... Then I think I'll stick with tiff2 for now, but hereby give written permission to copy anything over from tiff2 into aiocogeo-rs there should be a nice way to figure things out, based on reciprocity, mutual understanding and trust.

  1. Build on top of an Extensibility: mid-level API and harmonizing encoder/decoder structs. image-rs/image-tiff#250 fork, making minimal changes
  2. aiocogeo does the async reading from object_store backend
  3. georust/geotiff adds geo functionality which is backend-agnostic

You mean that aiocogeo should be like async-tiff, and delegate geo-related work to geotiff?

Yes, that's what I was thinking of. Or more like tiff2/image-tiff<-geotiff<-aiocogeo, where tiff2/image-tiff provides basic building blocks for reading tiffs (e.g. the structs Ifd Entry/Offset Image and functionality to decode those structs) then geotiff adds functions for geospatial analysis of the data and aiocogeo does specific implementation of object-store.

buffering and IFD reading

How should it read tags? Read all images, including all tags Dos-style?

I think it should read all IFDs up front, but of course not the image data. This means that you have all the metadata about your image from the initial load.

Then, when reading tag data (~60MB, further in the file), the pre-existing buffer is cleared, making further reads into other tag data slow.

What do you mean buffer? Are you working with a BufReader?

two things:

  1. I was initially working with georaster, that used http_range_client which used an internal buffer, which gets cleared.
  2. I do think we should keep an initial buffer around for reading in IFDs (only inline tag data). Then the rest is - indeed - better served with a stateless API, like you said:

I think it's simpler for the reader itself to have no knowledge of buffers, but rather just to fetch specific byte ranges as needed. A wrapper around the raw byte range reader can provide request buffering for IFDs. Essentially this Cursor object in aiocogeo-rs.

I think the initial implementation for simplicity should load all tag data. I think a later implementation could improve performance by automatically loading all inline tag data and storing references to the byte ranges of values for large tags. But at least in a COG all IFDs should be up front I believe, so I think you gain a lot in simplicity by reading all tags, for not a lot of performance hit.

  1. I've already implemented the optimization over inline vs offset-to tag data, based on @HeroicKatora's comment on inversion-of-control. I think the distinction between the inline vs offset (or at least the offset struct) is needed anyways to be stored, so that we can collect all required tags to be loaded and then make a get_ranges request of the tag data. If that is done, then it is equally easy to put it in an enum IfdEntry{Offset,Value}, where Directory := BTreeMap<Tag, IfdEntry

object_store, asyncness

I want to support COGs, and that means making remote files a first-class citizen. I think the best way to do this is via the object_store crate.

Under my proposed usage of object_store, you'd use an Arc<dyn ObjectStore> to represent an S3/GCS/Azure/HTTP resource. That Arc is then easily cloned so you could concurrently read tags.

Yes, I'm fully on board with that now, where I made a homebrew trait with blanket impl for object_store which I didn't put into an Arc yet in the decoder, since I didn't test that part yet and the compiler didn't yell at me yet about that

  1. Maybe create a trait TiffDecoder in here for a backend and then implement the reader on top of that trait, so that we can easily switch out backends <- This actually doesn't work so well, since one is async and the other sync

You may want to look at ArrowReaderBuilder, which is a wrapper around both a sync and async interface to work with Parquet files.

Oh wow, they let type inference precipitate all the way up over build(). Interesting... I think step 1. is to make the two classes that come out of the different build() methods first and then maybe add it as a convenience method.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants