Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Async I/O possibilities #426

Open
ejona86 opened this issue Feb 8, 2021 · 106 comments
Open

Async I/O possibilities #426

ejona86 opened this issue Feb 8, 2021 · 106 comments

Comments

@ejona86
Copy link
Collaborator

ejona86 commented Feb 8, 2021

An async I/O discussion started at #412 (comment) . It was an idea to free up a lot of memory from write_bc, allowing it to be reused for reading. Ideally it would allow us to buffer both sides of an entire track and absorb all write delays, until the track changes.

I've created a basic cooperative scheduler and an fs async API on top. I ported HFE (reading and writing) to async to see how it may look: master...ejona86:io-thread . I implemented HFE in a KISS way. While the HFE code seems to work after some very basic testing, the point is the approach, and not the concrete code organization; lots of the code needs cleaning up.

Based on my flash drives' performance and full speed USB speed, I think ED and TD should be considered out reach for HFE independent of implementation. USB limits read speed of an ED track to at least 66 ms, assuming no overhead and latency. In my testing I think we could read an ED track in ~150 ms which leaves no room for writes. Even 100 ms would push it. We'd need more RAM to support them with HFE. ED should be easily supported with IMG instead. And TD would need us to not load both sides of the track.

I was able to steal enough memory to support 1.44 MB HD floppies with HFE, fully buffering both sides of the track. It is tight as expected, although there's still some options to shift memory around a little bit, like swapping to 8 bit dma to save 2 kB and using that memory elsewhere. I don't know whether we should assume we always have enough memory though. I have an HFE buffering approach in mind that would support HD with HFE without fully-buffering the track, but it would need to continually read the track from flash. Is it worth investing much time to support such a thing? The approach would lend itself to also support changing tracks without waiting for writes to finish committing to flash, although write latencies would still impact it.

Main questions:

  1. Does async seem worth the effort?
  2. Does the implementation approach look half-way sound? Or will we want to do it a substantially different way? I'm more concerned about the threading/io than HFE.
  3. Is it fair to limit ED/TD to IMG? How serious would we be about supporting TD?
  4. How concerned are we about the high memory use of HD floppies with HFE? It is worth letting them work similar to today where they constantly read from flash? (e.g., if we see high value in not constantly reading from flash, then we may keep memory available. Or we may encourage IMG for HD even if the HFE support code was there.)
@keirf
Copy link
Owner

keirf commented Feb 8, 2021

What can I say, that looks pretty neat for a lashup implementation.

The whole point is to be able to buffer at least a whole side in RAM, right, and preferably both sides of a DS image? If we are having to endlessly stream in from USB, there is no safe time to drop in writes without 'suspending' reads and thus delaying index pulses, I believe? In short, I don't understand your proposed not-fully-buffering scheme: Could you expand on it a bit? My current belief is that this should concentrate on what we can fully buffer, while supporting everything else no worse than we do right now.

What I really care about for full buffering is IMG up to ED, and HFE up to DD. I don't think there's much (any?) HD HFE out there that is too esoteric to support easily with IMG? That said your rejigging of memory isn't that bad, although it screws the alt/logfile build a bit -- I think we can make async-vs-not probably dynamically determined based on available memory, and maybe do less async on the alt/logfile build (albeit it makes the build less representative and thus less useful in that way). [EDIT: Or make it all async but with similar behaviour to what we have now when low on memory?]

Anyway it is basically a good idea. A bunch of code probably needs generalising out of HFE if possible, the buffering would be nice to do in the buffercache layer, integrate that with the async layer a bit more... Well I'm sure you know all this! It would be splendid to lean on the read/write_data special-case buffers much less and the buffercache much more, in general in the existing code :)

There are also probably some corner cases to care about, like not killing the io thread while it's yielding during a USB transaction. Again, I don't doubt you know this.

Yeah... This is awesome overall, thanks! This is the first time I've had someone since the Xen project who can contribute useful heavy lifting, and those guys were payrolled by their companies. It is also very healthy to get other eyes on a sizeable project like this because bad design and implementation decisions tend to fester away, unacknowledged or ignored by the original author.

@keirf
Copy link
Owner

keirf commented Feb 8, 2021

By the way, is the naked attribute sugar for the kind of bare asm function I construct for call_cancellable_fn?

@keirf
Copy link
Owner

keirf commented Feb 8, 2021

Oh I like the fact that write-ahead-of-read should be rare (line 477, hfe.c). I was worried about that case but you're right: If readahead is continuing during write, it makes sense that readahead easily keeps ahead of the write cursor, just as it does the read cursor, since they are the same cursor. D'oh. :)

@keirf
Copy link
Owner

keirf commented Feb 8, 2021

One caveat for your particular area of interest: Of course the writes still race seek operations, the writes are occasionally very slow, and you are going to possibly still have trouble issuing timely index pulses while draining writes (since you have no read data to serve up, and maybe potentially no buffer space to stash writes). I think your hard sector scenario is still hard when your bulk data is parked on crappy Flash behind a slow half-duplex link. But I'm a pessimist. ;)

@ejona86
Copy link
Collaborator Author

ejona86 commented Feb 9, 2021

In short, I don't understand your proposed not-fully-buffering scheme: Could you expand on it a bit?

EDIT: Or make it all async but with similar behaviour to what we have now when low on memory

That edit is along the lines I was thinking would be possible, but aggressive reads.

Basically, the approach of aggressively pre-filling reads still works even if we can't buffer an entire track. It just is less-good. Let's say read_data is 36 KB. That's still 150 ms worth of buffer. We could buffer 150 ms of reads which consumes ~60 ms of time to pull from flash. If there were any writes, we have at least 90 ms of time to write back some of it to open up buffer for reading. I view it as two super-imposed circular buffers, a read buffer and a write buffer. As the read advances, that opens up write buffer, and as the write advances it opens up read buffer.

Even with the fully-buffer-track approach, the mental-model of the circular buffer is helpful. It allows us to notice that we could actually seek as soon as there's few enough writes remaining to leave a substantial read buffer. And now hard sectors don't look as glum.

My current belief is that this should concentrate on what we can fully buffer, while supporting everything else no worse than we do right now.

Sounds good. I'm trying to make this as complicated as necessary, but no more.

I don't think there's much (any?) HD HFE out there that is too esoteric to support easily with IMG?

My thoughts as well.

That said your rejigging of memory isn't that bad, although it screws the alt/logfile build a bit

Oh, the log file. Right. That's why it is so large.

There's actually some wiggle room at the moment. write_data is 51076 bytes and HD HFE v2 needs 50176 so there's 900 of change. I had cut into the console before I moved the async cb handling back onto the main thread. Moving the cb handling made me feel comfortable reducing thread1_stack to 512 bytes. I'm still also not quite sure how big write_bc will need to be in the end.

the buffering would be nice to do in the buffercache layer, integrate that with the async layer a bit more... Well I'm sure you know all this!

Actually, I hadn't considered using the buffercache layer. I had been pretty happy to avoid it as it gave me "this is all your memory" "simplicity" and avoided duplicating memory. And that was probably the right thing for the PoC. But now that you mention it I could see how that could work. Devil's in the details, but definitely worth looking in to.

By the way, is the naked attribute sugar for the kind of bare asm function I construct for call_cancellable_fn?

Yeah. It is "let C handle the declaration and let asm do the rest." More precisely, it means "don't emit the function preamble and postamble."

I think your hard sector scenario is still hard when your bulk data is parked on crappy Flash behind a slow half-duplex link. But I'm a pessimist. ;)

Yeah, this is "phase 1" :). Best not to throw hard sectors into the mix yet. Eric looks at his imaginary chart. Yeah, those are phase 3 or later.

The slow, half-duplex, non-pipelined link is excruciating combined with flash latencies. The part that had me more strongly consider this approach is the high read latencies after a write that can occur; that made me reevaluate. That said, my hard sector PR is pretty fair already; it's just so much effort to fight the blocking to track down what is causing latency blips. Especially since printk consumes a lot of CPU and SWD is disabled.

@keirf
Copy link
Owner

keirf commented Feb 9, 2021

Okay, let me do an answer to minor points, and I'll follow up with what I think are the biggest up-front decisions to be made.

  1. Aggressive writes: You may have 90ms of slack in which to write, but you cannot guarantee every stick will ack an arbitrary write in 90ms. Latency spikes are common. It does depend on stick. It may depend on write size (larger writes may behave better). But I would probably want to make the aggressiveness configurable, and by default I would writeback only when the whole track is buffered, or we're seeking, as these allow us a default behaviour of don't start a new track until writes are done. Which is still a big improvement on FlashFloppy today.

  2. alt/logfile: I wouldn't let this tail wag the dog. It's not used that often. That said, I'm not sure if we really care about 300RPM 500kbps HFE. See next message, as this is probably a big up-front decision.

  3. naked: That's cool, it lets the C compiler see the decl which is an ugly aspect of direct asm coding, particularly for static functions. I will use this. :)

  4. SWD: Well tbh I've never used SWD except for device programming. Is it useful for extracting logging too? Obviously I can hook up gdb, but I don't really use debuggers much. But if there's a way to extract logging on Linux that would be cool! Yeah SWD gets disabled, but not really for any good reason. One pin is used in the corner case chgrst=pa14 in FF.CFG. Certainly we could do a SWD-enabled build if that's handy? And I'd consider code to log through that channel too.

@keirf
Copy link
Owner

keirf commented Feb 9, 2021

The big decisions to be made.

  1. Buffer cache or ring: I think this the Big One. It informs the whole shape of development really. So my view was that the buffercache would sit on top of the FS stack (it's underneath at the moment, and might still appear there for fs metadata caching) and would be the asynchronous interface. Ie the IO thread would be a buffercache server.

Upsides: Everything is pretty and unified. One copy of data (maybe with very minimal stream buffering on top for some image handlers). Maybe simplifies image handlers by pulling code into a shared subsystem.

Downsides: The buffercache grows many new appendages and gets much more complicated: dirty buffers, locked buffers (probably), ... And we'd still need a conceptual ring in the floppy layer to track how much of a track is fetched (and possibly, where the buffers are). Maybe something unified in floppy subsystem makes more sense than buffercache subsystem. Metadata overhead per block probably higher than a simple contiguous ring. Probably need either de-fragging or we need iovecs down the IO stack to do multi-block IO (the latter wouldn't actually be that bad to do imo).

  1. Do we care about 300RPM 500kbps HFE: This is the largest/tightest case we could care about, and we'd need to implement with the RAM limitation carefully in mind. Compared with not bothering and falling back to something like what we do now (large write ring, so that format-track operations remain reliable). So, who might care? I think most HD HFE use cases are either scenarios where the writeback time doesn't matter (host leaves it to its FDC to time out, which is based on index pulses, which we suppress), or the disk format is standard (can use IMG, see KORG Triton synthesizer: USB write latency problems? #362).

What about hard sectored images? I suppose the only case that might be 500kbps would be eight-inch MFM. But that would spin at 360RPM and so require only ~42kB of RAM. Much more manageable!

If we don't need to achieve 50kB available, we have a lot more latitude in design and implementation.

@ejona86
Copy link
Collaborator Author

ejona86 commented Feb 13, 2021

Aggressive writes. I have no problem with the conservative writes being a default. If we did aggressive writes we'd probably want some display feedback in the event of a buffer underrun.

SWD. No, it doesn't do logging, which means it's only for dire situations. I saw the usage of PA14 and hacked around it but still no dice. It didn't boot with 2<<24 in afio->mapr. There's still plenty of ways I could be messing it up as I've never truly used the thing and the datasheet also mentions a handshake starting with JTAG and transitioning to SWD.


Buffer cache or ring. The current buffer cache under the FS could work for buffering writes, but looks ill suited for reads. It prevents reading from cache while a readahead is ongoing and would require larger buffers in the image code just to satisfy fatfs. This approach means copies of the data into/out of the cache. We could hack around fatfs by putting the volume code into a "fake read" mode where it queues the read and returns fake read data. But there'd be no transparency into the cache and calculating amount of memory necessary for the cache is harder since metadata is mixed in. We'd probably want iovecs.

I think we really want something above the fs. I think a "ring cache" utility is the most direct approach. Most directly what we need, most memory efficient, and most versatile, with the downside of more complicated API interactions. I do think a sector-based approach could work, but I think it'd require copies to/from cache and it might use a ring internally to avoid fragmentation.

Do we care about 300RPM 500kbps HFE. It sounds like you are leaning toward supporting HFE in this case, but not bother with the fully-buffered track approach. That sounds fair; I also though the tight memory was quite restricting without much practical benefit. Since I don't think we want two copies of the HFE code (fully buffered track + current approach), that implies the caching approach will gain some complexity to handle partial tracks. That's fine with me. We can size things such that write_bc + write_data/2 >= track_side_size which allows buffering a track-side's worth of writes while also extending the amount of read buffer.

@keirf
Copy link
Owner

keirf commented Feb 15, 2021

SWD: Ah that's a shame. I haven't played with SWD except for programming, so I can't authoritatively say what could be going wrong in setting up SWD. I'm pretty sure that AFIO_MAPR field can only be set once until reset though. So if you touched it twice that could be a problem. Anyhow, if we don't know how to do logging via SWD, it's not of much interest to me personally so I guess it's moot.

Cache vs Ring: This has been a useful discussion as I would have probably gone with the superficial attractiveness of a 'unified buffer cache' and hacked it to work, which of course could be done at the cost of time and complexity! I suppose the advantages are that the cache can remember previous tracks, share with metadata, and thus overall be a little more efficient on average. But the average case is after all not what we're really interested in, but rather hitting real-time goals. And a cache is probably not the right tool for that job. So I have to agree with you, and having thought a while about it now I'm increasingly enthusiastic about that choice. It's going to be a lot simpler code too.

When A Cylinder Doesn't Fit: Ok, so specifically we don't care about fitting HD HFE. I think we agree on that. So we just need to take care with fallback behaviour. Sizing the rings as you suggest makes sense. We'll also have to probably synchronously flush writes, at least in conservative-write mode. On all image types (except HFE) we can buffer both sides of a DS disk where it fits, degrade to buffer one side where only that fits (and treat side change like a seek), and something single-sided and HFE-fallback-like (as you suggested) that is really no worse than we have now for giant tracks (some people are configuring giant IMG images for example, for better or worse).

Who Codes It Up: Well, I'm figuring you want the okay to pursue this? It's fine with me if so, we can stick it in a branch and it can aim for a new v4.x release series. It'll need carving up into a bite-sized pieces for review, to some extent. It will also need to go through FF_TestBed automated tests, which I can push it through since you won't have the setup for that. Anyway, nothing surprising here I don't think... If it's going well perhaps I should give you direct access rights to the repo.

@ejona86
Copy link
Collaborator Author

ejona86 commented Feb 16, 2021

Cache vs Ring: I've split out the ring logic from HFE into a separate API. It cleans up HFE. I've gotten far enough with the limited memory support that it's clear HFE will be minimally impacted by it.

When A Cylinder Doesn't Fit: My current plan for formats with separate track sides is for the ring reading code to update a "shadow" ring in parallel to the main ring. If the side changes, it can be "swapped in" (by changing pointers or some such; details TBD). If there's not enough memory for both sides, then it just goes back to "single ring only" mode. Having both rings is probably annoying in the ring code, but probably not hard. I'll reevaluate once I see how the code shapes up. (The main alternative to the shadow ring is to interleave the sides.)

Who Codes It Up: I've been cleaning up what I had and have been pushing it to my master...ejona86:io-thread branch. It's honestly looking like it'll be an easy review, other than the massive number of places I could introduce bugs. Be aware that the constrained memory changes are causing me to change essentially every line of the ring cache as I have it use the ring cursors instead of absolute offsets, and I'm swapping things to tracking bytes itself of sectors. So that change will be quite noisy. I'm very glad you have automated testing.

@ejona86
Copy link
Collaborator Author

ejona86 commented Feb 16, 2021

FF_TestBed looks very straight-forward. If you have an extra PCB that's the cleanest, but this would be trivial with jumper wires. I have some 1x10 and 1x7 housings, so it wouldn't even be that janky. I should be able to set that up this week.

@keirf
Copy link
Owner

keirf commented Feb 16, 2021

Okay that all sounds good. To be honest you're obviously fine to crack on and ask questions if you need to. If there are any areas you'd specifically like me to comment on, or image handlers you'd like me to port across to the new environment, let me know.

FF_TestBed is very handy, I designed it to be super simple. No IRQs, no heap, very easy to hack in new tests or diagnostics. So it is out of the box a simple set of regression tests, but you can see on branch speed_test it was easy to hack in logging on read-after-write latencies for comparison with "competing Gotek products". That sort of thing could make it into master branch, as could some tougher regression tests -- the current set are pretty vanilla, they don't bring much Pain, but were purely to pick up on the really dumb bugs I was regularly introducing a year or so back during more active development.

I do have... one or two PCBs... Ok maybe 100 actually. I panellised them and got 10x10cm PCB set done, and these are hardly popular with general FF users so they are my go-to for shimming flat-pack furniture and the like. ;) I'm happy to post you a couple if you email me a shipping address. They shouldn't take too long to get to you and it's a super-neat setup then.

@ejona86
Copy link
Collaborator Author

ejona86 commented Feb 22, 2021

I've implemented track-larger-than-ring support and cleaned up the branch. The ring_io code is more complex than I'd like, but about as complex as I expected.

I put an 'async' flag on image_handler so now the arena allocation is auto-tuned and non-async images are supported. But Direct Access is broken if disk_handler is async because write_bc is too small, which blocks FF_TestBed testing. So I think adding async versions of disk_* I/O and DA async support are the main blockers at this point.

@keirf
Copy link
Owner

keirf commented Feb 22, 2021

That's great! Maybe you can do a broader reset of the rings when switching to DA, to get into non async mode? On exit from DA we always fully remount so that direction is already covered. It's possibly an alternative to implementing more async stuff. Whatever you think is better, neater, etc.

@ejona86
Copy link
Collaborator Author

ejona86 commented Feb 22, 2021

Maybe you can do a broader reset of the rings when switching to DA, to get into non async mode?

I'll give that a try. That'd be a more expedient. I've been a little concerned about going into/out-of DA mode and losing any async writes. I feel like it is probably more likely exiting DA mode than entering, so sync I/O is probably a plus. (I've thought I might need a "flush" function or something on image_handler, but haven't really wanted to go down that road unless I have to.)

@keirf
Copy link
Owner

keirf commented Feb 22, 2021

The entry to DA mode is pretty shoddy really, tucked away in image.c. The aim originally (like, day one of FlashFloppy) iirc was that it would be possible to seamlessly move between DA and non-DA modes. That was killed off because DA mode changes the FAT filesystem and/or selected image file and so we really need a "full reset" on exit. But the entry to DA has forever remained a bit buried. It could at least perhaps be pulled out to floppy.c and may be easier to hack on there?

Also, more broadly, it would be nice to support DA mode even when no stick is inserted, or no image mounted (eg ejected). It feels a bit icky that an image must be mounted to get into DA mode. But that would need a method to force main.c out of navigation mode, and is certainly out of scope of your current work. Still it indicates the direction DA support may go and that what is there currently in terms of integration with the rest of FlashFloppy is not to be greatly respected or tiptoed round.

@ejona86
Copy link
Collaborator Author

ejona86 commented Feb 24, 2021

I received the FF_TestBed adapter board and attached male+female connectors to it; pretty easy and very clean. It took me a bit to figure out the test bed wants the DUT to be in Shugart mode and to get the S0/1 jumper right. Testing against master, hfe and adf_test(11) work fine, but dsk hangs and adf_test(22) spams WARN at amiga.c:61: "index.count >= 2" and img_test() spams Long flux @ dma=454 bc=7749: 46658-46370=288 / 36 WARN at floppy_generic.c:53: "TRUE". So I still need to play with it a bit before I can test out my changes.

@keirf
Copy link
Owner

keirf commented Feb 24, 2021

It can be a bit fussy. I will add the jumper settings to the "Running The Testbed" wiki page, where they properly belong. I will also give it a whirl myself and let you know how I get on.

@keirf
Copy link
Owner

keirf commented Feb 24, 2021

Well, I remember why I don't run it as often as I should, I always find it a pain to set up. The TestBed and DUT can backpower each other via the ribbon which can be confusing, plus for some reason (perhaps my new PC) I find it really hard to flash via USB-TTL, it takes me loads of attempts. Anyway, I did get it running. Here's a full round:

*** ROUND 11 ***

HFE TEST:
Motor-to-RDY: OFF=6us, ON=200ms
We have 4654 4489s

DSK TEST:
Period: 272 ms ;; GAP3: 84
Index delayed by 22 ms
MFM 8192 r/w sector - OK
Period: 342 ms ;; GAP3: 84
Index delayed by 3 ms
MFM 512 r/w sector - OK
Period: 342 ms ;; GAP3: 84
Index delayed by 3 ms
MFM 512 r/w sector - OK

ADF TEST:
Amiga DD - OK

ADF TEST:
Amiga HD - OK

IMG TEST:
Period: 199 ms ;; GAP3: 84
Index delayed by 3 ms
MFM 512 r/w sector - OK
Period: 199 ms ;; GAP3: 84
Index delayed by 3 ms
MFM 512 r/w sector - OK
Period: 199 ms ;; GAP3: 21
Index delayed by 2 ms
FM 256 r/w sector - OK
Period: 199 ms ;; GAP3: 84
Index delayed by 20 ms
MFM 8192 r/w sector - OK

I think I'm going to solder in a power feed from the TestBed for the DUT. That would be saner.

EDIT: Also the MOTOR-to-RDY line needs a modded Gotek DUT, to wire through the MOTOR signal. A normal Gotek will say MOTOR ignored.

@keirf
Copy link
Owner

keirf commented Feb 24, 2021

I've revamped the wiki page https://github.com/keirf/FF_TestBed/wiki/Running-The-TestBed

Please let me know if you think anything is still unclear.

@ejona86
Copy link
Collaborator Author

ejona86 commented Feb 25, 2021

I got FF_TestBed up and running. I'm seeing one DSK error about expecting 16 sectors, but parts of DSK seem to still be okay and everything else looks fair. This is "good enough" to notice if I break the world. My earlier problem was messing up when preparing the USB drive.

I have no problem programming with my USB-TLL nor ST-Link. The only problem I had was re-learning how to unlock the flash. I've seen the backpowering happen with the Greaseweazle and FluxEngine, so I was prepared for that, but it didn't cause any trouble.

Fixing up image_setup_track looks nontrivial so I'll still need to figure out an approach there. With a huge hack, I am able to test my code and HFE passes. IMG is hanging for MFM 8192, which is surprising; something to look in to.

Huge hack:

diff --git a/src/image/image.c b/src/image/image.c
index 4145296..bcb892d 100644
--- a/src/image/image.c
+++ b/src/image/image.c
@@ -101,6 +101,8 @@ bool_t image_valid(FILINFO *fp)
     return FALSE;
 }
 
+static bool_t da_mode;
+
 static bool_t try_handler(struct image *im, struct slot *slot,
                           DWORD *cltbl,
                           const struct image_handler *handler)
@@ -120,6 +122,10 @@ static bool_t try_handler(struct image *im, struct slot *slot,
     im->stk_per_rev = stk_ms(200);
 
     im->disk_handler = im->track_handler = handler;
+    if (da_mode) {
+        im->track_handler = &da_image_handler;
+        im->nr_sides = 1;
+    }
 
     mode = FA_READ | FA_OPEN_EXISTING;
     if (handler->write_track != NULL)
@@ -259,12 +265,16 @@ bool_t image_setup_track(
 
 #if !defined(QUICKDISK)
     if (!in_da_mode(im, track>>1)) {
+        da_mode = FALSE;
         /* If we are exiting D-A mode then need to re-read the config file. */
         if (h == &da_image_handler)
             return TRUE;
         h = ((track>>1) >= im->nr_cyls) ? &dummy_image_handler
             : im->disk_handler;
     } else {
+        da_mode = TRUE;
+        if (h != &da_image_handler)
+            return TRUE;
         h = &da_image_handler;
         im->nr_sides = 1;
     }

@keirf
Copy link
Owner

keirf commented Feb 25, 2021

Good news on the TestBed. Weird you still have some errors, but it's good enough for now. You can see if you break the world and I can also run tests in due course too. And maybe make some diabolical new ones.

Yes it might be better to teach floppy.c and/or floppy_generic.c a bit more about DA and mode switching. Or use a new return code from image_setup_track to remount things within floppy.c without disturbing main.c? I don't really know what the best answer is, but it's not like the current DA hooks in the rest of the code are anything great.

@ejona86
Copy link
Collaborator Author

ejona86 commented Feb 28, 2021

I've tested HFE a bit and have resolved the issues I've noticed. But I haven't figured out a good way to give it a good stress test; my cheap USB floppy controller doesn't like FlashFloppy+HFE. I think I'm to a stopping point for this phase. Would you like to review the commits one-by-one, with each having its own PR? Or one large PR? Or just dump it all in a branch? I'll revise my current branch before doing so to make it cleaner.


I reworked the DA handling to have it controlled more by floppy_generic.c. I'm disappointed it wasn't easy to put in floppy.c but it is a bit nicer than it was in image.c. The handling is awkward because DA detection is dependent on how many tracks are in the current image. So I have to load the normal image handler and then replace it with DA when DA is needed. It looks like it'd be just a few lines to enhance it so DA mode can be entered even if no image is mounted.

I noticed that my async vs blocking allocation was very, very wrong and was dereferencing a NULL pointer. This was breaking the IMG test. The allocator now assumes a blocking handler and then reallocates if it turns out the handler is async. Not happy about the circular dependency existing, but the retry is simple.

I noticed switching sides with the ring buffer was really fast when fully buffering and slow otherwise. I tracked it down to read_bc causing the reader to be in the future; when the side changes we need to rewind and re-read what's already been read. read_bc is at least 16 ms (HD) which means 10ms grace period isn't enough cover. I've solved this by limiting the amount buffered in read_bc (to 2kb, which doesn't actually reduce the size) and keeping some already-consumed data around (4kb for HFE).

@ejona86
Copy link
Collaborator Author

ejona86 commented Mar 1, 2021

SWD: I learned it has a companion SWO (TRACESWO) which does allow printf-style debugging (not a console, because it is one-way). Unfortunately enabling SWO requires pa15 and pb3 (it uses pb3, but the register configuration requires enabling JTDI as well). The cheap st-link v2 clones also don't support SWO unless you do a hardware mod. So overall no real benefit over serial for us, although for a custom board it might make more sense.

@keirf
Copy link
Owner

keirf commented Mar 1, 2021

I'm inclined to give you access rights and dump it in a branch? The aim here when ready is to branch master into stable-v3 and then merge your branch into master as new v4.x development series. The sooner we get to that point the better imo. (EDIT: Even if we're not ready at that point to immediately call v4.0a)

Yeah the side switch thing is fun eh? Glad you worked out a scheme for the non-fully-buffered case :)

@ejona86
Copy link
Collaborator Author

ejona86 commented Mar 2, 2021

I'm game for the branch approach. I've cleaned up my branch to be ready to be rebased on top of the branch point. I split out #439 which is trivial and can happen before the branch point. I took a stab at the copyright notice for the new files; I hope it looks fine to you. Mention it if you have any a concern.

How much do we want to do in v4.x before releasing? Obviously stabilize. But would you be hoping for async IMG or similar to be done before a 4.0 release? I sort of imagine you do since it seems we want an async IMG and it'll need stabilizing, but I don't know the full scope that you might hope for.

@keirf
Copy link
Owner

keirf commented Mar 2, 2021

Okay I sent you an invite for direct access. I think once you accept that you're set, though I'm not sure what granularity of access control is provided and whether I'll need to tick some boxes. I should have already asked: I assume you're set up with 2FA on your account?

The copyright/written-by notices are perfectly fine. You can feel free to add your name to any files you substantially modify. We could have an AUTHORS or CREDITS file too, though it feels a bit highfalutin for such a small number of contributors.

I'm not precisely sure where we should place the goalposts for v4.0a. I'd really like to get some test from users who have had problems with the existing "dumbio", so that's tickets:

I can ask some regular users if they have any HFE scenarios with unreliable writes. All the same, even without IMG, drawing a line for 4.0a and getting wider test on it is not a bad idea. It is also then a baseline for our own testing, both correctness and perf.

@ejona86
Copy link
Collaborator Author

ejona86 commented Mar 2, 2021

Okay, so async IMG is on the docket. The read/write paths look pretty stand-alone, but we'll need to figure out how we want to handle the sub-512 byte sector alignment. I should have an easier time stress testing IMG as well.

@keirf
Copy link
Owner

keirf commented Mar 2, 2021

I had a quick look down the list of collaborator privileges, and it looks like there's everything you could plausibly need, so that's good: https://docs.github.com/en/github/setting-up-and-managing-your-github-user-account/permission-levels-for-a-user-account-repository

@keirf
Copy link
Owner

keirf commented Mar 2, 2021

As far as I'm concerned, you are now welcome to create and manage any branches you need to get your work ready for master. When you're ready, we can branch stable-v3 and open v4 development in master.

@ejona86
Copy link
Collaborator Author

ejona86 commented Oct 13, 2021

With 709d77d DSK is now ported to async. That leaves only QD.

@keirf
Copy link
Owner

keirf commented Oct 13, 2021

That's great, I tagged v4.2a since there are other fixes since v4.1a too.

@ejona86
Copy link
Collaborator Author

ejona86 commented Oct 18, 2021

With d36bde6 QD is now ported to async. At some point we'll want some cleanup of the non-async codepath, mainly for the different buffer sizes. But I figure we'll let things sit for a bit first.

@keirf
Copy link
Owner

keirf commented Oct 18, 2021

This is awesome! Thank you very much! Is that tested against Testbed? Should I ask some QD users to give it a spin?

@ejona86
Copy link
Collaborator Author

ejona86 commented Oct 19, 2021

Yes, I used FF_TestBed. It'd be quite fair to give it a spin, especially since QD is so atypical, but it was actually a pretty straight-forward copy of the changes to HFE. In general, I'd appreciate the testing.

I'm honestly a bit more concerned about DSK. DSK was a copy of the changes to IMG, and wasn't that bad. But it has a blocking read in dsk_seek_track that is unfortunate. Technically, I can't see a way the timing is all that different than what we've done already, but it is unique and it would have a smaller write buffer on track change than before.

Oh! I missed one format. Dummy is still sync! 😅

@keirf
Copy link
Owner

keirf commented Jan 10, 2022

Hi there, perhaps we should do a v4.4a now, what do you think?

@ejona86
Copy link
Collaborator Author

ejona86 commented Jan 10, 2022

v4.4a SGTM, which would make ADF testable vs the 4.3a very broken state.

@keirf
Copy link
Owner

keirf commented Jan 10, 2022

Great, I have a couple of things to port up from v3 and then I'll do release later this week.

@keirf
Copy link
Owner

keirf commented Feb 8, 2023

FYI For now now I have swapped stable-v3 back to master, and previous master is now experimental-v4. This makes more sense to me as it makes master my primary development branch. I am really only cherry-picking to the v4 branch currently.

@ejona86
Copy link
Collaborator Author

ejona86 commented Jul 9, 2023

I've worked my way back to flashfloppy, having mostly finished some of my other projects I started while my machine was broken.

A big problem with asyncio was difficulty in building confidence in its vastly different behavior compared to stable-v3. There was also the elephant in the room of at32f435 and making use of its vast RAM, which ring_io couldn't do.

I just pushed a replacement of ring_io called file_cache, for caching a single file. It is a write-through write-back cache abusing the existing cache. There's some docs in the commit message (717bc47) and its header file. I've definitely re-destabilized things, from the point they were with ring_io. Although the code seems to be less error-prone and easier to debug.

With file_cache, ADF, DSK, IMG, and QD, have a much cleaner diff compared to static-v3 (specifically e8c0942). DA remains the same. I forgot to sync dsk_write_track and img_write_track more closely with stable-v3, so I might be able to reduce the diff further there. The point is to make it easier to copy changes between the branches.

HFE is pretty different, not because of I/O as much as HFEv3 enhancements and hard sector pulses. I have seen the rework of it on stable-v3 and will look at absorbing the changes into experiemental-v4 a bit later. Before I do that, though, I'll cherry-pick the easier commits.

@keirf
Copy link
Owner

keirf commented Jul 9, 2023

Hi, Glad to see you back! A better-integrated file cache and targetting specifically 435 and its larger RAM is definitely something I'm interested in. There are two advantages I can see to improving the cache layer:

  1. Simpler image handlers, which currently roll their own buffering
  2. Better-hidden write latency to USB
    Actually (2) is the main weakness of current stable firmware, and does bite on a few systems, and especially with sticks with a long tail on write latency. But that means we need a writeback file cache, and to defer writes until all data on this cylinder is cached (so that we can write to USB without stalling read emulation) or we are head stepping. Which is what you did for ring_io iirc. I consider improving (2) to be the essential motivator for churning the stable 435 release code. So the file_cache will need to learn about cylinders, for example, or it will need an API that permits writes to be deferred until signalled to flush. Getting the cache API correct is key, of course.

Cutting down the diff against stable is all good. Maybe some rework in experimental-v4 makes sense in this direction. This will be useful long term because I don't see 415/105 firmware ever moving to this new branch. So two branches will need to be maintained in the long term.

Looking slightly further ahead, to get the new code into a stable release I think we will need to form the new patches into an orderly queue, review them, and admit them into a new branch freshly forked from master. Hard sector support is then gravy at the tail of that queue, as far as I'm concerned for now.

Sorry, that's a bit of a brain dump in response to your reappearance. :) I just want to be clear where my thoughts lie, so that you can decide where to prioritise efforts, depending on your own longer-term aims.

@keirf
Copy link
Owner

keirf commented Jul 9, 2023

I'm also happy to commit some spare-time cycles here and help/review. There's more than a seed of useful stuff for a future "FlashFloppy Plus"[*] stable branch here.

[*]: That's the name of the 435 build. It would be nice for it to live up to its name and actually deliver more than the regular 415/105 firmware!

@ejona86
Copy link
Collaborator Author

ejona86 commented Jul 9, 2023

Doh. I said write-through cache above. I should have said it was a write-back cache. I had it right in the header file!

file_cache right now lost the "shadow" ring of ring_io which allowed it to cache a full cylinder. It wouldn't be hard to re-add. Such handling would work equally for stm32f105 as for at32f435. Only hfe at hd rate would behave differently (since qd is already weird).

I would hope that we could move at least stm32f105 to the new branch, and I don't think at32f415 is out of the question. I had file_cache rolling around in my head before at32f435 came around in order to behave more like stable-v3 when desired. But I think it is fair to have that be a "later down the road" thing once you have a better view of how things behave with file_cache.

I'll look into forming a patch queue for review. I'll have to see how bad conflicts get to figure out my approach.

@keirf
Copy link
Owner

keirf commented Jul 9, 2023

Ah sorry yes, that makes sense really as we already had a WT cache. Is there a particular reason that it makes sense to put this cache above the filesystem rather than below? I suppose it does depend what API you need to the cache, because talking to something that lives the other side of the filesystem could be weird and difficult. But it depends how rich the cache API needs to be. I should probably pull down your branch and have a look around.

EDIT: I suppose to answer that myself, it makes a lot of sense if working from the ring_io codebase with the fs already on the io (server) thread. Whether from a blank slate it is better to call the fs from client thread or server thread is unclear to me; along with file-block cache vs raw-block cache. Cache at raw block level would allow caching metadata, but then we try pretty hard to cache cluster information explicitly anyway (CREATE_LINKMAP). Cache at file-block level puts the cache API closer to the top-level caller, which may be more convenient that communicating around or through a filesystem. Overall, I'm unclear on best approach, but I have always believed broadly in the block cache approach as opposed to ring_io.

As for 105/415, it is true we could see benefit there too, actually. Especially being able to decode and buffer writes as logical sectors vs raw MFM in the write bitcell ring, for example. Being careful about lack of RAM will obviously be important. Also crucial though is the lack of Flash -- that could be a real problem as we are super tight on space at 128kB Flash, so much that we can't even do debug builds on the v4 branch. I wonder what proportion of 415 Goteks have 256kB Flash, as that could be a new build target if it's common.

@keirf
Copy link
Owner

keirf commented Jul 9, 2023

The more I think about it, you're probably right. A cache that can be interrogated directly on the client side (and which should therefore be addressable by (file,offset)), backed by I/O via a filesystem on the server side. Prefetch is more naturally offloaded to the server thread this way, since that's where the filesystem is. Since the cache is accessible directly client-side, we do not need to block on the server thread for most operations.

The cache could easily be multi-file and include metadata if we want. The block tags would need a file-id as well as block-id. We don't even need multi-file (at least not yet). A tag for file vs metadata could be useful, and that only needs a one-bit tag, to distinguish (file,offset) vs (lba,offset), latter used for filesystem metadata. The same cache could then be used client-side for accessing file data, and by the filesystem on the server side for accessing cached metadata. The lack of preemption makes the concurrency handling very simple.

@ejona86
Copy link
Collaborator Author

ejona86 commented Jul 16, 2023

I've rebased experimental-v4 on master v3.39^^, just before HFEv3: Fix and simplify opcode handling. That is visible in my file-cache-and-hard-sector-fixes branch. But you will be more interested in the subset of commits in the file-cache branch. I need to spend more time to absorb the HFE change on master.

I know the file-cache branch won't run well, especially on at32f415, until the last commit Optimize write_bc size for async frees up memory for the cache. I was mostly concerned with making it all compile up to this point and need to fiddle with it more.

The DA commit needs to be split in two to show the cost of reduced write_bc size and needs a simplification pass. Its code was an adaptation of the ADF+ringio code so supports some things it doesn't need to.

Debug builds of file-cache fit within flash, but it is tight with only 260 bytes of headroom using Ubuntu 22.04. If you remove the "Stream writes to flash" commits which handles small write_bc size, then it has 556 free. The compiler makes a 1 KiB difference to the code size, if you see my most recent commit on experimental-v4: cc8941d . I agree things are tight even if we turn on LTO and use GCC 12.

@keirf
Copy link
Owner

keirf commented Jul 17, 2023

Looks like you sorted the HFE confusion -- you don't touch hfe_rdata_flux() anyway, which is where the big changes were.

The async I/O control flow will take a bit of getting my head round. I also will be wondering whether some richer threading primitives will be helpful in simplifying the high levels. I just can't tell at the moment, as I'm too new to the code. But it does feel there's a bit of ad hoc thread nudging going on.

Here's a bit of up front code review on thread.c:

  1. Code style, function open braces on newline
  2. thread_reset definition misses void parameter
  3. thread_start is a bit ugh. Why don't you build the initial stack frame in C? There's nothing in real r7-r11 we care to propagate I believe, so you could zero those?

I remain sorely tempted to leave 105/415 on v3 stable branch forever and only develop for 435. They may be in the tiny minority of Goteks in the field, but they are readily available to anyone who actually cares about any of this stuff. That does happen to be a tiny minority of Gotek users! Most never even update their firmware.

@ejona86
Copy link
Collaborator Author

ejona86 commented Jul 22, 2023

But it does feel there's a bit of ad hoc thread nudging going on.

The I/O scheduler in file_cache requires polling to nudge it along, and so I also had it nudge the I/O thread along. That's the most important place for the nudging. I could have done the thread_yield() nearer main(), but I thought it was nicer to be closer to the code that needs it and any "blocking" operations will need the thread_yield() in a loop anyway.

Based on your other comments, I'm suspecting you might prefer the file_cache I/O scheduler (enqueue_io()) to run as "main" on the I/O thread, getting rid of fs_async and the threading code in floppy_generic.c except for thread_reset(). Interactions with F_call_cancellable() would have been my main concern, but it looks like it'd work.

Here's a bit of up front code review on thread.c

I've addressed these. Using C in thread_start is a separate commit so you can review; for the other changes I just rewrote the commits.

I remain sorely tempted to leave 105/415 on v3 stable branch forever and only develop for 435.

If v3 development is intended to essentially stop, then that sounds more attractive. The patching back-and-forth was my biggest issue. Although I have enough units that I'd probably keep a build running for a while, like until the hard-sector stuff lands on the "new v4" branch. The flash issue isn't a huge problem for me with GCC 13 and worst-case I disable SD support. On my machine the headroom of the debug build on the file-cache branch changes from 1256 to 3872 bytes if I disable SD by commenting out the two references to sd_ops.

I think there's advantages to the idea "don't let the smaller microcontrollers hold back efforts" and we see where things go.

@keirf
Copy link
Owner

keirf commented Jul 23, 2023

The I/O scheduler in file_cache requires polling to nudge it along, and so I also had it nudge the I/O thread along. That's the most important place for the nudging. I could have done the thread_yield() nearer main(), but I thought it was nicer to be closer to the code that needs it and any "blocking" operations will need the thread_yield() in a loop anyway.

Based on your other comments, I'm suspecting you might prefer the file_cache I/O scheduler (enqueue_io()) to run as "main" on the I/O thread, getting rid of fs_async and the threading code in floppy_generic.c except for thread_reset(). Interactions with F_call_cancellable() would have been my main concern, but it looks like it'd work.

Well, maybe let's hold that thought a while. Conceptually I do sort of see the file cache as the API to the async I/O, and therefore the cache and its scheduler does live in the async I/O "blob" in my head. Ultimately though, the scheduler is just a non-blocking piece of code that will prioritise among pending I/O requests and decide what gets issued next. That could run in either context, with a suitable interface chosen between threads. At the moment it looks like that interface is a ring-based queue, which does mean the I/O thread can run across multiple work items asynchronously. My initial concerns compared with running the scheduler in the I/O thread would be:

  1. Head of line blocking if a higher priority block request is issued by the client while low-priority work (prefetch?) fills the ring
  2. Asynchrony limited by queue "depth" then the scheduler in the client thread must be involved to issue more work

Are these concerns real or important? I'm not sure. Perhaps a brief design doc describing the current separation of duties among threads, the APIs, and an example I/O flow could be useful here, for example how we typically end up getting a whole track or cylinder cached, and what happens on writes? Nothing too onerous, but something a level or two above the C code.

I've addressed these. Using C in thread_start is a separate commit so you can review; for the other changes I just rewrote the commits.

Damn, that's much nicer, thanks!

If v3 development is intended to essentially stop, then that sounds more attractive. The patching back-and-forth was my biggest issue. Although I have enough units that I'd probably keep a build running for a while, like until the hard-sector stuff lands on the "new v4" branch. The flash issue isn't a huge problem for me with GCC 13 and worst-case I disable SD support. On my machine the headroom of the debug build on the file-cache branch changes from 1256 to 3872 bytes if I disable SD by commenting out the two references to sd_ops.

Well, v3 in this model would never quite stop. It would carry on much as it is: Bug fixes and incremental features. So there will always be a need for some cherry picking. However, the tricky new stuff would not be going into v3.

I suppose the model would be to bring up new_v4 as an ongoing rebase over master. Then switch new_v4 onto master, and old master becomes stable-v3 (again). All development in master first, with cherry picks back. The only exception might be bugs chased in v3.

Anyway, that's the commitment here: To aim to making development v4 first, rather than leaving it hanging as before. If that happens, the stable-v3 maintenance isn't actually that much hassle. Two people developing long term on two HEADs is bad, though, for sure,

I think there's advantages to the idea "don't let the smaller microcontrollers hold back efforts" and we see where things go.

Agreed. I have no problem telling those (fairly few) users who have trouble especially with 415 Goteks to go buy themselves a better Gotek. Apart from not needing to squeeze into 128kB Flash, we have a lot more design freedom with 384kB SRAM.

@keirf
Copy link
Owner

keirf commented Jul 23, 2023

Regarding hard sector support, even if that isn't going immediately into a new supported branch, it is sure to be easier for you to handle it as an incremental feature on top of v4-based supported master. If users interested in hard sector support end up needing a 435 based Gotek, that's probably not a big issue, right?

@keirf
Copy link
Owner

keirf commented Jul 23, 2023

What's the best way forward do you think for getting to new_v4? Should I pick patches across to a new branch which I keep synced with master via cherry picks? Then your branches could be periodically rebased on that.

@ejona86
Copy link
Collaborator Author

ejona86 commented Jul 23, 2023

What's the best way forward do you think for getting to new_v4? Should I pick patches across to a new branch which I keep synced with master via cherry picks?

That sounds fine. I have no issue keeping on rebasing stuff until we feel comfortable. Although right now we are discussing more the basics and high-level, so I expect everything up to "Add file_cache" would go in together.

I can open PRs once it makes sense, to put less pressure on this thread. Or we can swap to email-based review.

how we typically end up getting a whole track or cylinder cached, and what happens on writes

I'll work on some diagrams and docs, but I happened to notice some sketch notes of mine were committed. I've cleaned them up a bit, so "fixup file cache; clean up docs" probably provides some of what you're looking for.

If users interested in hard sector support end up needing a 435 based Gotek, that's probably not a big issue, right?

Yeah, probably not. Especially with the nice factory screen without ugly (when I do it) modding. It was mostly for myself. I have ~10 goteks and keeping them up-to-date helps to some degree on bug-finding. But once I replace the commonly-used ones with the 435, then it matters less.

My initial concerns compared with running the scheduler in the I/O thread would be:

Actually, it is easier to move file_cache to I/O thread, from a queuing perspective. The issues with the move are more tangental like "make sure to stop the thread when changing images." I'll make some diagrams, but depending on where you are on the learning curve the below might help.

In my imagined move, both threads would share most of the state. So from a multi-threading perspective it would seem horrible. But practically the state sharing is unchanged because the scheduler is asynchronous already. Instead of calling progress_io()/enqueue_io() at various points I'd just call thread_yield(), and the scheduler would use blocking I/O. That'd be the only changes and it'd behave the same. There'd be no queue. E.g., the floppy thread has a cache miss for a sector so it updates cur_sector, the scheduler reads that sector when it feels like it, and the first thread eventually finds the sector in the cache.

I could go a step further and swap from a threading model to an asymmetric coroutine model, which is essentially the same thing but sets better expectations. I kept not mentioning it because cancel_call() throws a wrench in things, but as of this morning I think I have a way to solve that.

@keirf
Copy link
Owner

keirf commented Jul 23, 2023

That sounds fine. I have no issue keeping on rebasing stuff until we feel comfortable. Although right now we are discussing more the basics and high-level, so I expect everything up to "Add file_cache" would go in together.

Okay, probably true since it's not especially a useful branch without the whole lot. I wouldn't expect to be messing with the code until it's all in, certainly. So yeah, I think from your pov it looks like a handoff or submission as one big patch series.

I'll work on some diagrams and docs, but I happened to notice some sketch notes of mine were committed. I've cleaned them up a bit, so "fixup file cache; clean up docs" probably provides some of what you're looking for.

Useful, though brief :) A doc could still be useful, but please don't waste time polishing it on my account. I can take a rough and ready design overview.

My initial concerns compared with running the scheduler in the I/O thread would be:

Actually, it is easier to move file_cache to I/O thread, from a queuing perspective. The issues with the move are more tangental like "make sure to stop the thread when changing images." I'll make some diagrams, but depending on where you are on the learning curve the below might help.

In my imagined move, both threads would share most of the state. So from a multi-threading perspective it would seem horrible. But practically the state sharing is unchanged because the scheduler is asynchronous already. Instead of calling progress_io()/enqueue_io() at various points I'd just call thread_yield(), and the scheduler would use blocking I/O. That'd be the only changes and it'd behave the same. There'd be no queue. E.g., the floppy thread has a cache miss for a sector so it updates cur_sector, the scheduler reads that sector when it feels like it, and the first thread eventually finds the sector in the cache.

I'm not sure what "share most of the state" means here. The example cache miss seems to then use a fairly straightforward shared-memory API to communicate to the scheduler.

As I understand it so far, I agree with you: Putting the I/O scheduler in the I/O thread makes sense and probably makes things simpler. And we define a file cache API rich enough to represent our desires for various I/O types: prefetches, "synchronous" reads, deferred writes, ... Which we don't have to do all in one go, but we'd have a good idea where we're headed!

I could go a step further and swap from a threading model to an asymmetric coroutine model, which is essentially the same thing but sets better expectations. I kept not mentioning it because cancel_call() throws a wrench in things, but as of this morning I think I have a way to solve that.

I mean, I dislike the modern tendency to use the coroutine name for any userspace thread runtime. But even taking the narrowest old-timer view of "asymmetric coroutines", what's the difference between that and cooperative multi-tasking between exactly two threads?

If cancel_call(), which is essentially setjmp/longjmp exception handling, is a problem, that's worth bringing up in a design doc and we can hash out a solution.

@keirf
Copy link
Owner

keirf commented Jul 23, 2023

I actually quite like the thread model. It can be beefed up in various ways should we want to. More threads, preemptive multitasking, etc. Maybe none of that will make sense, but we're not boxed into a more limited view with a threading API.

@ejona86
Copy link
Collaborator Author

ejona86 commented Jul 23, 2023

But even taking the narrowest old-timer view of "asymmetric coroutines", what's the difference between that and cooperative multi-tasking between exactly two threads?

I was meaning resume()+yield(). It would making it clearer what you are expecting to run. Instead of the "ad hoc thread nudging," the file_cache code would explicitly say "resume this coroutine". You still have un-bound yields in the volume I/O (they don't know what code will resume), but nowhere else.

I'll stick with the threads for now.

@keirf
Copy link
Owner

keirf commented Jul 23, 2023

Ok I understand.

Let's just stick with yields for now and continue to consider moving the scheduler into the I/O thread.

And I also agree, let's continue to consider branch file_cache as a cohesive series we are working to tidy up and commit as a whole.

@keirf
Copy link
Owner

keirf commented Jul 23, 2023

And let's consider a shared doc or two. I'd like to consolidate thoughts on not only design decisions for file_cache as far as it goes, but also on next steps, to get to where we need to be for this to be a useful increment in functionality as a release series. Either of us can drive that, with the other also editing or commenting, or whatever. We've probably pretty much run the useful course of this ticket discussion.

@keirf
Copy link
Owner

keirf commented Jul 25, 2023

I had a further think about this, and a brief look through some later commits. ISTM that we should agree I/O strategy first, and build cache API around that. And to me, the strategy is quite simple:

  1. Prefetch the current cylinder. Don't do any other I/O until the entire cyl is prefetched, or drive head is stepped. This prefetch may be optimised by starting at the block(s) containing the sector(s) that the "drive head(s)" is/are currently over.
  2. Buffer writes. This means waiting for the block to be prefetched, and then dirtying the block. For whole-block writes we could of course not wait on the prefetch, and instead throw away or skip the prefetch later. But that's a possibly unimportant optimisation.
  3. When prefetch is complete, or drive head is stepped, flush all dirty blocks before anything else.

I think this changes the API and some of its users at least a bit. If we can assume 435 chip and 384kB SRAM then at least some things are simplified: we can safely assume we can buffer a whole cylinder, for example, and we could allocate space in the I/O thread for "linearising" multi-block I/Os so that we can issue more than one block at a time, without requiring allocation of multi-block chunks in the cache itself.

Eg on seeking to a new cylinder, previous cyl's blocks get flushed as follows: "waiting-for-prefetch" -> free, "dirty" -> clean. Then find/allocate all blocks that will be prefetched for the new cyl. Any newly-allocated blocks marked waiting-for-prefetch. Client I/O can now sync against prefetch based on cache block flags (this might be hidden behind the cache API somehow).

Prefetch strategy could be a simple block or two at a time from each side of the cylinder in turn, so that the other side can quickly be streamed if the virtual drive head is toggled. Prefetching the blocks closest to the "virtual disk position" first so that we can serve reads/writes asap without blocking on the prefetcher.

@keirf
Copy link
Owner

keirf commented Jul 25, 2023

Based on these requirements (whole cyl buffering and deferred writeback on a big-sram 435 chip), I wonder whether there is a more incremental strategy to get there. If I were designing this from scratch I think I'd keep the LBA block cache that I have, but add state flags per block (dirty, empty, etc). I would probably prefetch from the main thread, interleaved with normal floppy work. Setting up prefetch would mean allocating blocks in the LBA cache, to be filled, and prioritising them. I would perhaps add a simple function to fatfs to help with this, just to wrap the file offset to lba conversions.

The only reason I can see for async I/O is to hide unpredictable writeback latencies if I decide to write back while still emulating the current cylinder (eg because the whole cyl is now cached so I can write without getting in the way of read i/o): Using async I can still run main-thread floppy emulation while the write-to-usb happens. I might even only use the async thread for writeback, it could be a "clean dirty blocks and exit" thread. There would be few and obvious places to join or cancel that thread. Hiding read latencies is of much more questionable value, so punting reads across to a second thread might be needlessly complex. In a sense, perhaps the less placed into the secondary thread, the easier it is to reason about and synchronise with.

This definitely sounds a bunch less impactful.

EDIT: I might even have a go at lashing it together. It's incremental plus some limited new tech off your branch, basically.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants