-
Notifications
You must be signed in to change notification settings - Fork 199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Async I/O possibilities #426
Comments
What can I say, that looks pretty neat for a lashup implementation. The whole point is to be able to buffer at least a whole side in RAM, right, and preferably both sides of a DS image? If we are having to endlessly stream in from USB, there is no safe time to drop in writes without 'suspending' reads and thus delaying index pulses, I believe? In short, I don't understand your proposed not-fully-buffering scheme: Could you expand on it a bit? My current belief is that this should concentrate on what we can fully buffer, while supporting everything else no worse than we do right now. What I really care about for full buffering is IMG up to ED, and HFE up to DD. I don't think there's much (any?) HD HFE out there that is too esoteric to support easily with IMG? That said your rejigging of memory isn't that bad, although it screws the alt/logfile build a bit -- I think we can make async-vs-not probably dynamically determined based on available memory, and maybe do less async on the alt/logfile build (albeit it makes the build less representative and thus less useful in that way). [EDIT: Or make it all async but with similar behaviour to what we have now when low on memory?] Anyway it is basically a good idea. A bunch of code probably needs generalising out of HFE if possible, the buffering would be nice to do in the buffercache layer, integrate that with the async layer a bit more... Well I'm sure you know all this! It would be splendid to lean on the read/write_data special-case buffers much less and the buffercache much more, in general in the existing code :) There are also probably some corner cases to care about, like not killing the io thread while it's yielding during a USB transaction. Again, I don't doubt you know this. Yeah... This is awesome overall, thanks! This is the first time I've had someone since the Xen project who can contribute useful heavy lifting, and those guys were payrolled by their companies. It is also very healthy to get other eyes on a sizeable project like this because bad design and implementation decisions tend to fester away, unacknowledged or ignored by the original author. |
By the way, is the |
Oh I like the fact that write-ahead-of-read should be rare (line 477, hfe.c). I was worried about that case but you're right: If readahead is continuing during write, it makes sense that readahead easily keeps ahead of the write cursor, just as it does the read cursor, since they are the same cursor. D'oh. :) |
One caveat for your particular area of interest: Of course the writes still race seek operations, the writes are occasionally very slow, and you are going to possibly still have trouble issuing timely index pulses while draining writes (since you have no read data to serve up, and maybe potentially no buffer space to stash writes). I think your hard sector scenario is still hard when your bulk data is parked on crappy Flash behind a slow half-duplex link. But I'm a pessimist. ;) |
That edit is along the lines I was thinking would be possible, but aggressive reads. Basically, the approach of aggressively pre-filling reads still works even if we can't buffer an entire track. It just is less-good. Let's say read_data is 36 KB. That's still 150 ms worth of buffer. We could buffer 150 ms of reads which consumes ~60 ms of time to pull from flash. If there were any writes, we have at least 90 ms of time to write back some of it to open up buffer for reading. I view it as two super-imposed circular buffers, a read buffer and a write buffer. As the read advances, that opens up write buffer, and as the write advances it opens up read buffer. Even with the fully-buffer-track approach, the mental-model of the circular buffer is helpful. It allows us to notice that we could actually seek as soon as there's few enough writes remaining to leave a substantial read buffer. And now hard sectors don't look as glum.
Sounds good. I'm trying to make this as complicated as necessary, but no more.
My thoughts as well.
Oh, the log file. Right. That's why it is so large. There's actually some wiggle room at the moment. write_data is 51076 bytes and HD HFE v2 needs 50176 so there's 900 of change. I had cut into the console before I moved the async cb handling back onto the main thread. Moving the cb handling made me feel comfortable reducing thread1_stack to 512 bytes. I'm still also not quite sure how big write_bc will need to be in the end.
Actually, I hadn't considered using the buffercache layer. I had been pretty happy to avoid it as it gave me "this is all your memory" "simplicity" and avoided duplicating memory. And that was probably the right thing for the PoC. But now that you mention it I could see how that could work. Devil's in the details, but definitely worth looking in to.
Yeah. It is "let C handle the declaration and let asm do the rest." More precisely, it means "don't emit the function preamble and postamble."
Yeah, this is "phase 1" :). Best not to throw hard sectors into the mix yet. Eric looks at his imaginary chart. Yeah, those are phase 3 or later. The slow, half-duplex, non-pipelined link is excruciating combined with flash latencies. The part that had me more strongly consider this approach is the high read latencies after a write that can occur; that made me reevaluate. That said, my hard sector PR is pretty fair already; it's just so much effort to fight the blocking to track down what is causing latency blips. Especially since printk consumes a lot of CPU and SWD is disabled. |
Okay, let me do an answer to minor points, and I'll follow up with what I think are the biggest up-front decisions to be made.
|
The big decisions to be made.
Upsides: Everything is pretty and unified. One copy of data (maybe with very minimal stream buffering on top for some image handlers). Maybe simplifies image handlers by pulling code into a shared subsystem. Downsides: The buffercache grows many new appendages and gets much more complicated: dirty buffers, locked buffers (probably), ... And we'd still need a conceptual ring in the floppy layer to track how much of a track is fetched (and possibly, where the buffers are). Maybe something unified in floppy subsystem makes more sense than buffercache subsystem. Metadata overhead per block probably higher than a simple contiguous ring. Probably need either de-fragging or we need iovecs down the IO stack to do multi-block IO (the latter wouldn't actually be that bad to do imo).
What about hard sectored images? I suppose the only case that might be 500kbps would be eight-inch MFM. But that would spin at 360RPM and so require only ~42kB of RAM. Much more manageable! If we don't need to achieve 50kB available, we have a lot more latitude in design and implementation. |
Aggressive writes. I have no problem with the conservative writes being a default. If we did aggressive writes we'd probably want some display feedback in the event of a buffer underrun. SWD. No, it doesn't do logging, which means it's only for dire situations. I saw the usage of PA14 and hacked around it but still no dice. It didn't boot with Buffer cache or ring. The current buffer cache under the FS could work for buffering writes, but looks ill suited for reads. It prevents reading from cache while a readahead is ongoing and would require larger buffers in the image code just to satisfy fatfs. This approach means copies of the data into/out of the cache. We could hack around fatfs by putting the volume code into a "fake read" mode where it queues the read and returns fake read data. But there'd be no transparency into the cache and calculating amount of memory necessary for the cache is harder since metadata is mixed in. We'd probably want iovecs. I think we really want something above the fs. I think a "ring cache" utility is the most direct approach. Most directly what we need, most memory efficient, and most versatile, with the downside of more complicated API interactions. I do think a sector-based approach could work, but I think it'd require copies to/from cache and it might use a ring internally to avoid fragmentation. Do we care about 300RPM 500kbps HFE. It sounds like you are leaning toward supporting HFE in this case, but not bother with the fully-buffered track approach. That sounds fair; I also though the tight memory was quite restricting without much practical benefit. Since I don't think we want two copies of the HFE code (fully buffered track + current approach), that implies the caching approach will gain some complexity to handle partial tracks. That's fine with me. We can size things such that |
SWD: Ah that's a shame. I haven't played with SWD except for programming, so I can't authoritatively say what could be going wrong in setting up SWD. I'm pretty sure that AFIO_MAPR field can only be set once until reset though. So if you touched it twice that could be a problem. Anyhow, if we don't know how to do logging via SWD, it's not of much interest to me personally so I guess it's moot. Cache vs Ring: This has been a useful discussion as I would have probably gone with the superficial attractiveness of a 'unified buffer cache' and hacked it to work, which of course could be done at the cost of time and complexity! I suppose the advantages are that the cache can remember previous tracks, share with metadata, and thus overall be a little more efficient on average. But the average case is after all not what we're really interested in, but rather hitting real-time goals. And a cache is probably not the right tool for that job. So I have to agree with you, and having thought a while about it now I'm increasingly enthusiastic about that choice. It's going to be a lot simpler code too. When A Cylinder Doesn't Fit: Ok, so specifically we don't care about fitting HD HFE. I think we agree on that. So we just need to take care with fallback behaviour. Sizing the rings as you suggest makes sense. We'll also have to probably synchronously flush writes, at least in conservative-write mode. On all image types (except HFE) we can buffer both sides of a DS disk where it fits, degrade to buffer one side where only that fits (and treat side change like a seek), and something single-sided and HFE-fallback-like (as you suggested) that is really no worse than we have now for giant tracks (some people are configuring giant IMG images for example, for better or worse). Who Codes It Up: Well, I'm figuring you want the okay to pursue this? It's fine with me if so, we can stick it in a branch and it can aim for a new v4.x release series. It'll need carving up into a bite-sized pieces for review, to some extent. It will also need to go through FF_TestBed automated tests, which I can push it through since you won't have the setup for that. Anyway, nothing surprising here I don't think... If it's going well perhaps I should give you direct access rights to the repo. |
Cache vs Ring: I've split out the ring logic from HFE into a separate API. It cleans up HFE. I've gotten far enough with the limited memory support that it's clear HFE will be minimally impacted by it. When A Cylinder Doesn't Fit: My current plan for formats with separate track sides is for the ring reading code to update a "shadow" ring in parallel to the main ring. If the side changes, it can be "swapped in" (by changing pointers or some such; details TBD). If there's not enough memory for both sides, then it just goes back to "single ring only" mode. Having both rings is probably annoying in the ring code, but probably not hard. I'll reevaluate once I see how the code shapes up. (The main alternative to the shadow ring is to interleave the sides.) Who Codes It Up: I've been cleaning up what I had and have been pushing it to my master...ejona86:io-thread branch. It's honestly looking like it'll be an easy review, other than the massive number of places I could introduce bugs. Be aware that the constrained memory changes are causing me to change essentially every line of the ring cache as I have it use the ring cursors instead of absolute offsets, and I'm swapping things to tracking bytes itself of sectors. So that change will be quite noisy. I'm very glad you have automated testing. |
FF_TestBed looks very straight-forward. If you have an extra PCB that's the cleanest, but this would be trivial with jumper wires. I have some 1x10 and 1x7 housings, so it wouldn't even be that janky. I should be able to set that up this week. |
Okay that all sounds good. To be honest you're obviously fine to crack on and ask questions if you need to. If there are any areas you'd specifically like me to comment on, or image handlers you'd like me to port across to the new environment, let me know. FF_TestBed is very handy, I designed it to be super simple. No IRQs, no heap, very easy to hack in new tests or diagnostics. So it is out of the box a simple set of regression tests, but you can see on branch I do have... one or two PCBs... Ok maybe 100 actually. I panellised them and got 10x10cm PCB set done, and these are hardly popular with general FF users so they are my go-to for shimming flat-pack furniture and the like. ;) I'm happy to post you a couple if you email me a shipping address. They shouldn't take too long to get to you and it's a super-neat setup then. |
I've implemented track-larger-than-ring support and cleaned up the branch. The ring_io code is more complex than I'd like, but about as complex as I expected. I put an 'async' flag on image_handler so now the arena allocation is auto-tuned and non-async images are supported. But Direct Access is broken if disk_handler is async because write_bc is too small, which blocks FF_TestBed testing. So I think adding async versions of |
That's great! Maybe you can do a broader reset of the rings when switching to DA, to get into non async mode? On exit from DA we always fully remount so that direction is already covered. It's possibly an alternative to implementing more async stuff. Whatever you think is better, neater, etc. |
I'll give that a try. That'd be a more expedient. I've been a little concerned about going into/out-of DA mode and losing any async writes. I feel like it is probably more likely exiting DA mode than entering, so sync I/O is probably a plus. (I've thought I might need a "flush" function or something on image_handler, but haven't really wanted to go down that road unless I have to.) |
The entry to DA mode is pretty shoddy really, tucked away in Also, more broadly, it would be nice to support DA mode even when no stick is inserted, or no image mounted (eg ejected). It feels a bit icky that an image must be mounted to get into DA mode. But that would need a method to force |
I received the FF_TestBed adapter board and attached male+female connectors to it; pretty easy and very clean. It took me a bit to figure out the test bed wants the DUT to be in Shugart mode and to get the S0/1 jumper right. Testing against master, hfe and |
It can be a bit fussy. I will add the jumper settings to the "Running The Testbed" wiki page, where they properly belong. I will also give it a whirl myself and let you know how I get on. |
Well, I remember why I don't run it as often as I should, I always find it a pain to set up. The TestBed and DUT can backpower each other via the ribbon which can be confusing, plus for some reason (perhaps my new PC) I find it really hard to flash via USB-TTL, it takes me loads of attempts. Anyway, I did get it running. Here's a full round:
I think I'm going to solder in a power feed from the TestBed for the DUT. That would be saner. EDIT: Also the MOTOR-to-RDY line needs a modded Gotek DUT, to wire through the MOTOR signal. A normal Gotek will say |
I've revamped the wiki page https://github.com/keirf/FF_TestBed/wiki/Running-The-TestBed Please let me know if you think anything is still unclear. |
I got FF_TestBed up and running. I'm seeing one DSK error about expecting 16 sectors, but parts of DSK seem to still be okay and everything else looks fair. This is "good enough" to notice if I break the world. My earlier problem was messing up when preparing the USB drive. I have no problem programming with my USB-TLL nor ST-Link. The only problem I had was re-learning how to unlock the flash. I've seen the backpowering happen with the Greaseweazle and FluxEngine, so I was prepared for that, but it didn't cause any trouble. Fixing up image_setup_track looks nontrivial so I'll still need to figure out an approach there. With a huge hack, I am able to test my code and HFE passes. IMG is hanging for MFM 8192, which is surprising; something to look in to. Huge hack: diff --git a/src/image/image.c b/src/image/image.c
index 4145296..bcb892d 100644
--- a/src/image/image.c
+++ b/src/image/image.c
@@ -101,6 +101,8 @@ bool_t image_valid(FILINFO *fp)
return FALSE;
}
+static bool_t da_mode;
+
static bool_t try_handler(struct image *im, struct slot *slot,
DWORD *cltbl,
const struct image_handler *handler)
@@ -120,6 +122,10 @@ static bool_t try_handler(struct image *im, struct slot *slot,
im->stk_per_rev = stk_ms(200);
im->disk_handler = im->track_handler = handler;
+ if (da_mode) {
+ im->track_handler = &da_image_handler;
+ im->nr_sides = 1;
+ }
mode = FA_READ | FA_OPEN_EXISTING;
if (handler->write_track != NULL)
@@ -259,12 +265,16 @@ bool_t image_setup_track(
#if !defined(QUICKDISK)
if (!in_da_mode(im, track>>1)) {
+ da_mode = FALSE;
/* If we are exiting D-A mode then need to re-read the config file. */
if (h == &da_image_handler)
return TRUE;
h = ((track>>1) >= im->nr_cyls) ? &dummy_image_handler
: im->disk_handler;
} else {
+ da_mode = TRUE;
+ if (h != &da_image_handler)
+ return TRUE;
h = &da_image_handler;
im->nr_sides = 1;
} |
Good news on the TestBed. Weird you still have some errors, but it's good enough for now. You can see if you break the world and I can also run tests in due course too. And maybe make some diabolical new ones. Yes it might be better to teach floppy.c and/or floppy_generic.c a bit more about DA and mode switching. Or use a new return code from image_setup_track to remount things within floppy.c without disturbing main.c? I don't really know what the best answer is, but it's not like the current DA hooks in the rest of the code are anything great. |
I've tested HFE a bit and have resolved the issues I've noticed. But I haven't figured out a good way to give it a good stress test; my cheap USB floppy controller doesn't like FlashFloppy+HFE. I think I'm to a stopping point for this phase. Would you like to review the commits one-by-one, with each having its own PR? Or one large PR? Or just dump it all in a branch? I'll revise my current branch before doing so to make it cleaner. I reworked the DA handling to have it controlled more by I noticed that my async vs blocking allocation was very, very wrong and was dereferencing a NULL pointer. This was breaking the IMG test. The allocator now assumes a blocking handler and then reallocates if it turns out the handler is async. Not happy about the circular dependency existing, but the retry is simple. I noticed switching sides with the ring buffer was really fast when fully buffering and slow otherwise. I tracked it down to read_bc causing the reader to be in the future; when the side changes we need to rewind and re-read what's already been read. read_bc is at least 16 ms (HD) which means 10ms grace period isn't enough cover. I've solved this by limiting the amount buffered in read_bc (to 2kb, which doesn't actually reduce the size) and keeping some already-consumed data around (4kb for HFE). |
SWD: I learned it has a companion SWO (TRACESWO) which does allow printf-style debugging (not a console, because it is one-way). Unfortunately enabling SWO requires pa15 and pb3 (it uses pb3, but the register configuration requires enabling JTDI as well). The cheap st-link v2 clones also don't support SWO unless you do a hardware mod. So overall no real benefit over serial for us, although for a custom board it might make more sense. |
I'm inclined to give you access rights and dump it in a branch? The aim here when ready is to branch master into Yeah the side switch thing is fun eh? Glad you worked out a scheme for the non-fully-buffered case :) |
I'm game for the branch approach. I've cleaned up my branch to be ready to be rebased on top of the branch point. I split out #439 which is trivial and can happen before the branch point. I took a stab at the copyright notice for the new files; I hope it looks fine to you. Mention it if you have any a concern. How much do we want to do in v4.x before releasing? Obviously stabilize. But would you be hoping for async IMG or similar to be done before a 4.0 release? I sort of imagine you do since it seems we want an async IMG and it'll need stabilizing, but I don't know the full scope that you might hope for. |
Okay I sent you an invite for direct access. I think once you accept that you're set, though I'm not sure what granularity of access control is provided and whether I'll need to tick some boxes. I should have already asked: I assume you're set up with 2FA on your account? The copyright/written-by notices are perfectly fine. You can feel free to add your name to any files you substantially modify. We could have an AUTHORS or CREDITS file too, though it feels a bit highfalutin for such a small number of contributors. I'm not precisely sure where we should place the goalposts for v4.0a. I'd really like to get some test from users who have had problems with the existing "dumbio", so that's tickets:
I can ask some regular users if they have any HFE scenarios with unreliable writes. All the same, even without IMG, drawing a line for 4.0a and getting wider test on it is not a bad idea. It is also then a baseline for our own testing, both correctness and perf. |
Okay, so async IMG is on the docket. The read/write paths look pretty stand-alone, but we'll need to figure out how we want to handle the sub-512 byte sector alignment. I should have an easier time stress testing IMG as well. |
I had a quick look down the list of collaborator privileges, and it looks like there's everything you could plausibly need, so that's good: https://docs.github.com/en/github/setting-up-and-managing-your-github-user-account/permission-levels-for-a-user-account-repository |
As far as I'm concerned, you are now welcome to create and manage any branches you need to get your work ready for master. When you're ready, we can branch |
With 709d77d DSK is now ported to async. That leaves only QD. |
That's great, I tagged v4.2a since there are other fixes since v4.1a too. |
With d36bde6 QD is now ported to async. At some point we'll want some cleanup of the non-async codepath, mainly for the different buffer sizes. But I figure we'll let things sit for a bit first. |
This is awesome! Thank you very much! Is that tested against Testbed? Should I ask some QD users to give it a spin? |
Yes, I used FF_TestBed. It'd be quite fair to give it a spin, especially since QD is so atypical, but it was actually a pretty straight-forward copy of the changes to HFE. In general, I'd appreciate the testing. I'm honestly a bit more concerned about DSK. DSK was a copy of the changes to IMG, and wasn't that bad. But it has a blocking read in dsk_seek_track that is unfortunate. Technically, I can't see a way the timing is all that different than what we've done already, but it is unique and it would have a smaller write buffer on track change than before. Oh! I missed one format. Dummy is still sync! 😅 |
Hi there, perhaps we should do a v4.4a now, what do you think? |
v4.4a SGTM, which would make ADF testable vs the 4.3a very broken state. |
Great, I have a couple of things to port up from v3 and then I'll do release later this week. |
FYI For now now I have swapped |
I've worked my way back to flashfloppy, having mostly finished some of my other projects I started while my machine was broken. A big problem with asyncio was difficulty in building confidence in its vastly different behavior compared to I just pushed a replacement of ring_io called file_cache, for caching a single file. It is a With file_cache, ADF, DSK, IMG, and QD, have a much cleaner diff compared to HFE is pretty different, not because of I/O as much as HFEv3 enhancements and hard sector pulses. I have seen the rework of it on |
Hi, Glad to see you back! A better-integrated file cache and targetting specifically 435 and its larger RAM is definitely something I'm interested in. There are two advantages I can see to improving the cache layer:
Cutting down the diff against stable is all good. Maybe some rework in experimental-v4 makes sense in this direction. This will be useful long term because I don't see 415/105 firmware ever moving to this new branch. So two branches will need to be maintained in the long term. Looking slightly further ahead, to get the new code into a stable release I think we will need to form the new patches into an orderly queue, review them, and admit them into a new branch freshly forked from master. Hard sector support is then gravy at the tail of that queue, as far as I'm concerned for now. Sorry, that's a bit of a brain dump in response to your reappearance. :) I just want to be clear where my thoughts lie, so that you can decide where to prioritise efforts, depending on your own longer-term aims. |
I'm also happy to commit some spare-time cycles here and help/review. There's more than a seed of useful stuff for a future "FlashFloppy Plus"[*] stable branch here. [*]: That's the name of the 435 build. It would be nice for it to live up to its name and actually deliver more than the regular 415/105 firmware! |
Doh. I said write-through cache above. I should have said it was a write-back cache. I had it right in the header file! file_cache right now lost the "shadow" ring of ring_io which allowed it to cache a full cylinder. It wouldn't be hard to re-add. Such handling would work equally for stm32f105 as for at32f435. Only hfe at hd rate would behave differently (since qd is already weird). I would hope that we could move at least stm32f105 to the new branch, and I don't think at32f415 is out of the question. I had file_cache rolling around in my head before at32f435 came around in order to behave more like stable-v3 when desired. But I think it is fair to have that be a "later down the road" thing once you have a better view of how things behave with file_cache. I'll look into forming a patch queue for review. I'll have to see how bad conflicts get to figure out my approach. |
Ah sorry yes, that makes sense really as we already had a WT cache. Is there a particular reason that it makes sense to put this cache above the filesystem rather than below? I suppose it does depend what API you need to the cache, because talking to something that lives the other side of the filesystem could be weird and difficult. But it depends how rich the cache API needs to be. I should probably pull down your branch and have a look around. EDIT: I suppose to answer that myself, it makes a lot of sense if working from the ring_io codebase with the fs already on the io (server) thread. Whether from a blank slate it is better to call the fs from client thread or server thread is unclear to me; along with file-block cache vs raw-block cache. Cache at raw block level would allow caching metadata, but then we try pretty hard to cache cluster information explicitly anyway ( As for 105/415, it is true we could see benefit there too, actually. Especially being able to decode and buffer writes as logical sectors vs raw MFM in the write bitcell ring, for example. Being careful about lack of RAM will obviously be important. Also crucial though is the lack of Flash -- that could be a real problem as we are super tight on space at 128kB Flash, so much that we can't even do debug builds on the v4 branch. I wonder what proportion of 415 Goteks have 256kB Flash, as that could be a new build target if it's common. |
The more I think about it, you're probably right. A cache that can be interrogated directly on the client side (and which should therefore be addressable by The cache could easily be multi-file and include metadata if we want. The block tags would need a file-id as well as block-id. We don't even need multi-file (at least not yet). A tag for file vs metadata could be useful, and that only needs a one-bit tag, to distinguish |
I've rebased experimental-v4 on I know the The DA commit needs to be split in two to show the cost of reduced write_bc size and needs a simplification pass. Its code was an adaptation of the ADF+ringio code so supports some things it doesn't need to. Debug builds of |
Looks like you sorted the HFE confusion -- you don't touch The async I/O control flow will take a bit of getting my head round. I also will be wondering whether some richer threading primitives will be helpful in simplifying the high levels. I just can't tell at the moment, as I'm too new to the code. But it does feel there's a bit of ad hoc thread nudging going on. Here's a bit of up front code review on
I remain sorely tempted to leave 105/415 on v3 stable branch forever and only develop for 435. They may be in the tiny minority of Goteks in the field, but they are readily available to anyone who actually cares about any of this stuff. That does happen to be a tiny minority of Gotek users! Most never even update their firmware. |
The I/O scheduler in file_cache requires polling to nudge it along, and so I also had it nudge the I/O thread along. That's the most important place for the nudging. I could have done the thread_yield() nearer main(), but I thought it was nicer to be closer to the code that needs it and any "blocking" operations will need the thread_yield() in a loop anyway. Based on your other comments, I'm suspecting you might prefer the file_cache I/O scheduler (
I've addressed these. Using C in
If v3 development is intended to essentially stop, then that sounds more attractive. The patching back-and-forth was my biggest issue. Although I have enough units that I'd probably keep a build running for a while, like until the hard-sector stuff lands on the "new v4" branch. The flash issue isn't a huge problem for me with GCC 13 and worst-case I disable SD support. On my machine the headroom of the debug build on the file-cache branch changes from 1256 to 3872 bytes if I disable SD by commenting out the two references to I think there's advantages to the idea "don't let the smaller microcontrollers hold back efforts" and we see where things go. |
Well, maybe let's hold that thought a while. Conceptually I do sort of see the file cache as the API to the async I/O, and therefore the cache and its scheduler does live in the async I/O "blob" in my head. Ultimately though, the scheduler is just a non-blocking piece of code that will prioritise among pending I/O requests and decide what gets issued next. That could run in either context, with a suitable interface chosen between threads. At the moment it looks like that interface is a ring-based queue, which does mean the I/O thread can run across multiple work items asynchronously. My initial concerns compared with running the scheduler in the I/O thread would be:
Are these concerns real or important? I'm not sure. Perhaps a brief design doc describing the current separation of duties among threads, the APIs, and an example I/O flow could be useful here, for example how we typically end up getting a whole track or cylinder cached, and what happens on writes? Nothing too onerous, but something a level or two above the C code.
Damn, that's much nicer, thanks!
Well, v3 in this model would never quite stop. It would carry on much as it is: Bug fixes and incremental features. So there will always be a need for some cherry picking. However, the tricky new stuff would not be going into v3. I suppose the model would be to bring up Anyway, that's the commitment here: To aim to making development v4 first, rather than leaving it hanging as before. If that happens, the
Agreed. I have no problem telling those (fairly few) users who have trouble especially with 415 Goteks to go buy themselves a better Gotek. Apart from not needing to squeeze into 128kB Flash, we have a lot more design freedom with 384kB SRAM. |
Regarding hard sector support, even if that isn't going immediately into a new supported branch, it is sure to be easier for you to handle it as an incremental feature on top of v4-based supported master. If users interested in hard sector support end up needing a 435 based Gotek, that's probably not a big issue, right? |
What's the best way forward do you think for getting to |
That sounds fine. I have no issue keeping on rebasing stuff until we feel comfortable. Although right now we are discussing more the basics and high-level, so I expect everything up to "Add file_cache" would go in together. I can open PRs once it makes sense, to put less pressure on this thread. Or we can swap to email-based review.
I'll work on some diagrams and docs, but I happened to notice some sketch notes of mine were committed. I've cleaned them up a bit, so "fixup file cache; clean up docs" probably provides some of what you're looking for.
Yeah, probably not. Especially with the nice factory screen without ugly (when I do it) modding. It was mostly for myself. I have ~10 goteks and keeping them up-to-date helps to some degree on bug-finding. But once I replace the commonly-used ones with the 435, then it matters less.
Actually, it is easier to move file_cache to I/O thread, from a queuing perspective. The issues with the move are more tangental like "make sure to stop the thread when changing images." I'll make some diagrams, but depending on where you are on the learning curve the below might help. In my imagined move, both threads would share most of the state. So from a multi-threading perspective it would seem horrible. But practically the state sharing is unchanged because the scheduler is asynchronous already. Instead of calling I could go a step further and swap from a threading model to an asymmetric coroutine model, which is essentially the same thing but sets better expectations. I kept not mentioning it because cancel_call() throws a wrench in things, but as of this morning I think I have a way to solve that. |
Okay, probably true since it's not especially a useful branch without the whole lot. I wouldn't expect to be messing with the code until it's all in, certainly. So yeah, I think from your pov it looks like a handoff or submission as one big patch series.
Useful, though brief :) A doc could still be useful, but please don't waste time polishing it on my account. I can take a rough and ready design overview.
I'm not sure what "share most of the state" means here. The example cache miss seems to then use a fairly straightforward shared-memory API to communicate to the scheduler. As I understand it so far, I agree with you: Putting the I/O scheduler in the I/O thread makes sense and probably makes things simpler. And we define a file cache API rich enough to represent our desires for various I/O types: prefetches, "synchronous" reads, deferred writes, ... Which we don't have to do all in one go, but we'd have a good idea where we're headed!
I mean, I dislike the modern tendency to use the coroutine name for any userspace thread runtime. But even taking the narrowest old-timer view of "asymmetric coroutines", what's the difference between that and cooperative multi-tasking between exactly two threads? If |
I actually quite like the thread model. It can be beefed up in various ways should we want to. More threads, preemptive multitasking, etc. Maybe none of that will make sense, but we're not boxed into a more limited view with a threading API. |
I was meaning resume()+yield(). It would making it clearer what you are expecting to run. Instead of the "ad hoc thread nudging," the file_cache code would explicitly say "resume this coroutine". You still have un-bound yields in the volume I/O (they don't know what code will resume), but nowhere else. I'll stick with the threads for now. |
Ok I understand. Let's just stick with yields for now and continue to consider moving the scheduler into the I/O thread. And I also agree, let's continue to consider branch |
And let's consider a shared doc or two. I'd like to consolidate thoughts on not only design decisions for |
I had a further think about this, and a brief look through some later commits. ISTM that we should agree I/O strategy first, and build cache API around that. And to me, the strategy is quite simple:
I think this changes the API and some of its users at least a bit. If we can assume 435 chip and 384kB SRAM then at least some things are simplified: we can safely assume we can buffer a whole cylinder, for example, and we could allocate space in the I/O thread for "linearising" multi-block I/Os so that we can issue more than one block at a time, without requiring allocation of multi-block chunks in the cache itself. Eg on seeking to a new cylinder, previous cyl's blocks get flushed as follows: "waiting-for-prefetch" -> free, "dirty" -> clean. Then find/allocate all blocks that will be prefetched for the new cyl. Any newly-allocated blocks marked waiting-for-prefetch. Client I/O can now sync against prefetch based on cache block flags (this might be hidden behind the cache API somehow). Prefetch strategy could be a simple block or two at a time from each side of the cylinder in turn, so that the other side can quickly be streamed if the virtual drive head is toggled. Prefetching the blocks closest to the "virtual disk position" first so that we can serve reads/writes asap without blocking on the prefetcher. |
Based on these requirements (whole cyl buffering and deferred writeback on a big-sram 435 chip), I wonder whether there is a more incremental strategy to get there. If I were designing this from scratch I think I'd keep the LBA block cache that I have, but add state flags per block (dirty, empty, etc). I would probably prefetch from the main thread, interleaved with normal floppy work. Setting up prefetch would mean allocating blocks in the LBA cache, to be filled, and prioritising them. I would perhaps add a simple function to fatfs to help with this, just to wrap the file offset to lba conversions. The only reason I can see for async I/O is to hide unpredictable writeback latencies if I decide to write back while still emulating the current cylinder (eg because the whole cyl is now cached so I can write without getting in the way of read i/o): Using async I can still run main-thread floppy emulation while the write-to-usb happens. I might even only use the async thread for writeback, it could be a "clean dirty blocks and exit" thread. There would be few and obvious places to join or cancel that thread. Hiding read latencies is of much more questionable value, so punting reads across to a second thread might be needlessly complex. In a sense, perhaps the less placed into the secondary thread, the easier it is to reason about and synchronise with. This definitely sounds a bunch less impactful. EDIT: I might even have a go at lashing it together. It's incremental plus some limited new tech off your branch, basically. |
An async I/O discussion started at #412 (comment) . It was an idea to free up a lot of memory from write_bc, allowing it to be reused for reading. Ideally it would allow us to buffer both sides of an entire track and absorb all write delays, until the track changes.
I've created a basic cooperative scheduler and an fs async API on top. I ported HFE (reading and writing) to async to see how it may look: master...ejona86:io-thread . I implemented HFE in a KISS way. While the HFE code seems to work after some very basic testing, the point is the approach, and not the concrete code organization; lots of the code needs cleaning up.
Based on my flash drives' performance and full speed USB speed, I think ED and TD should be considered out reach for HFE independent of implementation. USB limits read speed of an ED track to at least 66 ms, assuming no overhead and latency. In my testing I think we could read an ED track in ~150 ms which leaves no room for writes. Even 100 ms would push it. We'd need more RAM to support them with HFE. ED should be easily supported with IMG instead. And TD would need us to not load both sides of the track.
I was able to steal enough memory to support 1.44 MB HD floppies with HFE, fully buffering both sides of the track. It is tight as expected, although there's still some options to shift memory around a little bit, like swapping to 8 bit dma to save 2 kB and using that memory elsewhere. I don't know whether we should assume we always have enough memory though. I have an HFE buffering approach in mind that would support HD with HFE without fully-buffering the track, but it would need to continually read the track from flash. Is it worth investing much time to support such a thing? The approach would lend itself to also support changing tracks without waiting for writes to finish committing to flash, although write latencies would still impact it.
Main questions:
The text was updated successfully, but these errors were encountered: