You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We aren't going to submit to ISCA; I wanted to capitalize on this time to do some reflection on what we've learned and make some decisions about how to proceed.
The primary goal of Glenside is to propose a new hardware/software codesign methodology for deep learning accelerators and workloads. The core insight is to make the hardware search and the software search truly simultaneous, rather than staged. Where other codesign methodologies involve some amount of staging, where we must iterate between a software optimization step and a hardware optimization step, Glenside fuses the stages to achieve "true" codesign. This is achieved by turning the whole problem of codesign into a program rewriting problem; by capturing both software and hardware design concerns in Glenside's language, we can implement rewrites which do software exploration (by rewriting software to software) and hardware exploration (by rewriting software into hardware, or hardware into other hardware). We leverage egraphs for their efficient rewriting and their ability to store the massive space of rewritten programs compactly. Once rewrites are complete, the egraph captures a massive hardware--software search space, which can then be extracted from using an extraction procedure which is able to reason about hardware and software simulatenously. We currently use ILP, but you could imagine many other extraction procedures.
Now to discuss why we're not submitting to ISCA. While I still believe in the core goal of Glenside, I'm not sure its current state embodies the core goal, for two primary reasons. First, our current story for extracting from the egraph is not convincing and needs more work. We want to show that the egraph exposes a rich search space, which can be used by many varied extraction procedures of the user's chosing. However, we currently just use a pretty lackluster ILP formulation. Second (and a direct cause of the first) is that, as we ramped up for ISCA, we slowly shifted more and more functionality off of the egraph to get a baseline working. While we did get our baseline working, it doesn't actually show the possibilities of the egraph, as the search space captured within the egraph is not interesting.
If we were only faced with the first issue---that our extraction story is not where I want it to be---we could still have submitted, I think. Ideally, we would build at least one somewhat robust extraction method, if not more than one, to show that the interface for extraction is usable in multiple ways. However, even with just one weaker extraction method we likely could have shown the possibilities of the extraction interface. Thus, I think the second problem is our primary issue, and is where I'd like to spend most of our effort rethinking.
In the rest of this post, I will discuss some of the short- and long-term paths for research on Glenside. In Showing the Full Capabilities of the Egraph, I will discuss the primary reason we're not submitting now, and what needs to change. In Getting More Realistic About What Accelerators Look Like, I'll discuss how Glenside's current model (or lack of model) for accelerators might help or hurt us.
Showing the Full Capabilities of the Egraph
The core of the problem currently is that we have shifted work off of the egraph in order to get a baseline working, and in doing so, we are no longer showing the capabilities of the egraph.
A core example of this is that, while Glenside used to figure out how to "block up" matrix multiplications automatically, it was a resource-intensive process that required tuning. In order to make progress on other parts of Glenside (namely codegen), I decided to lean on the fact that our current accelerator infrastructure can also figure out blocking at runtime, taking much of the work off of Glenside, but also reducing the resulting search space. I need to switch back to having Glenside efficiently find matrix blocking schemes; in addition, I need to enable Glenside to explore other design considerations. These two directions---making Glenside's current search more efficient, and expanding Glenside's search to allow control over other components of the hardware and software design---roughly outline where I feel we need to focus our energy.
Making Glenside's Current Search More Efficient
There are problems with the current search as it stands. As we focus back on the search space, there will be things to redesign. An initial "smoke test" goal was to have Glenside automatically discover how to block up computations on a systolic array. Glenside can do this, but the approach we use is inefficient. The language is such that it takes many rewrite applications to discover a blocking strategy. Even worse, the number of iterations needed scales with the size of the multiplication. This is because we represent blocked matrix multiplications essentially as a linked list using "concatenate" operators, with one block of computation at each location in the linked list. Every application of one of the rewrites lengthens the linked list by one, at which point we can run another rewrite to put in a systolic array invocation. This is just one place where it's become clear to me that the design of Glenside needs reworking. In some sense, the old design allowed extremely fine-grained control of how operations are blocked up, as a single matrix multiplication could be easily represented as being blocked across multiple systolic arrays of different sizes. But, perhaps unsurprisingly in retrospect, that level of design control is unnecessary, and so we end up wasting our searching resources on expanding a useless design space.
This example hints at two principles to take forward. First, design the language for rewrite efficiency. I think this is more straightforward. For example, if the number of rewrites needed to discover a blocking strategy depends directly on the size of the matrix multiplication, there's likely a more efficient way to implement things.
Second, and more importantly, design the language for search space density. By density, I mean that, of all of the designs present in the egraph after rewrites, a high number of them should be actually usable designs, and a high number of those designs should be interesting and potentially useful. For example, the current language design makes it easy to figure out that we can block up a matrix multiply across seven different systolic arrays of different shapes, but makes it very inefficient to find a simple blocking strategy for all of the layers of Resnet. Thus, we'll end up with many useless designs, and perhaps fail to find even one usable design.
Expanding Glenside's Search Space
In addition to improving the existing language, we also need to expand the language to capture more design space knobs. Currently, the primary design knobs are the size of the systolic arrays and how matrix multiplications are blocked on those systolic arrays. There are also some miscellaneous knobs, such as whether a convolution is done using the im2col strategy. However, these are far from the most important design knobs that exist when building an accelerator; as such, we want to build in to Glenside the ability to represent more facets of an accelerator design. Off the cuff, here are some of the ideas that have been in the back of my mind for a while:
Memory layout. This is perhaps as important in tensor languages as the kernel schedules that Halide/TVM abstract and tune over. Layouts determine effectiveness of your algorithm as much or more than some algorithmic concerns, I'd bet. I'm not sure how rich the space of layouts is, but this is one place where we are well-positioned to handle a rich space, if it exists.
Memory hierarchy. I'm not sure how this will exist in Glenside, but I did want to capture memory movement as much as possible. Memory hierarchy seems like a potentially odd thing to represent in the egraph, or, at least different from computational hardware; computational hardware only has certain capabilities, whereas memory's capabilities are all the same. In some sense, anything can be stored anywhere. Representing that in the egraph (or deciding not to represent it) will be interesting.
Other chip concerns. We've baked in a lot of assumptions about the chip, such as how things are actually connected.
I need to spend more time on this---finding the right group of knobs here will be absolutely essential!
Exploring Extraction
As I noted in the intro, another component we're currently weak on right now is extraction. While I think this is secondary to enriching the search space right now, it is just as important in the long run. We have discussed many potential extraction techniques: ILP, genetic, ML-guided. We need to actually implement some of these.
Better Modeling
Extraction will depend heavily on having useful models for hardware. This is a research problem in itself. I should probably highlight this problem more than I currently am. If we don't have useful models, then capturing a search space is nearly useless!
Getting More Realistic About What Accelerators Look Like
Glenside has always been divorced from reality. I don't have a chip-building background or an FPGA-hacking background. I don't know what accelerators generally look like. In a lot of ways, I think this has helped---I've been unconstrained by questions of what's possible or not. I think that same spirit will largely continue to benefit us. But I do think some knowledge of what accelerators look like would start to become useful.
My current understanding of ML accelerators is this: there is a global buffer for activations (input data or results of computations) and an SRAM for weights. The SRAM has an interface to DRAM to load weights. There's the compute fabric itself, which interfaces with the global buffer and the SRAM, and might include various processing elements---at the very least, a systolic array. In our case, we also have a small processor on the chip. This is very different from the view that Glenside might take of chips. Glenside simply attempts to "cover" all computations in a workload, totally unfettered by thoughts of what chips usually look like. It might assign each layer a different hardware unit, which likely wouldn't make any sense in a real design, unless you could pipeline on it (which would require batch size > 1, I think?)
Batch Size Greater Than One?
We have currently been handling only batch sizes of 1. How often is batch size greater than 1? How are those cases usually handled? Does the hardware change for them? Could Glenside handle it?
Training
Training looks completely different in many ways. From the hardware aspect, it involves (at the very least) the writing of the weights, not just the reading of the weights. It also involves more computations, different ordering of computations, different data caching requirements.
Other Networks
CNNs are straightforward and easy to handle for Glenside, but might be just as straightforward to do in other ways (e.g. static analysis in Relay). That might not be the case for other networks, though, e.g. RNNs.
Conclusion
There is a lot on the roadmap for Glenside. This post is just an enumeration of some of those things. In the near term, it's pretty clear that we need to focus on enriching the egraph search space, but we also need to make some longer-term decisions about what comes after that.
The text was updated successfully, but these errors were encountered:
post.txt
We aren't going to submit to ISCA; I wanted to capitalize on this time to do some reflection on what we've learned and make some decisions about how to proceed.
The primary goal of Glenside is to propose a new hardware/software codesign methodology for deep learning accelerators and workloads. The core insight is to make the hardware search and the software search truly simultaneous, rather than staged. Where other codesign methodologies involve some amount of staging, where we must iterate between a software optimization step and a hardware optimization step, Glenside fuses the stages to achieve "true" codesign. This is achieved by turning the whole problem of codesign into a program rewriting problem; by capturing both software and hardware design concerns in Glenside's language, we can implement rewrites which do software exploration (by rewriting software to software) and hardware exploration (by rewriting software into hardware, or hardware into other hardware). We leverage egraphs for their efficient rewriting and their ability to store the massive space of rewritten programs compactly. Once rewrites are complete, the egraph captures a massive hardware--software search space, which can then be extracted from using an extraction procedure which is able to reason about hardware and software simulatenously. We currently use ILP, but you could imagine many other extraction procedures.
Now to discuss why we're not submitting to ISCA. While I still believe in the core goal of Glenside, I'm not sure its current state embodies the core goal, for two primary reasons. First, our current story for extracting from the egraph is not convincing and needs more work. We want to show that the egraph exposes a rich search space, which can be used by many varied extraction procedures of the user's chosing. However, we currently just use a pretty lackluster ILP formulation. Second (and a direct cause of the first) is that, as we ramped up for ISCA, we slowly shifted more and more functionality off of the egraph to get a baseline working. While we did get our baseline working, it doesn't actually show the possibilities of the egraph, as the search space captured within the egraph is not interesting.
If we were only faced with the first issue---that our extraction story is not where I want it to be---we could still have submitted, I think. Ideally, we would build at least one somewhat robust extraction method, if not more than one, to show that the interface for extraction is usable in multiple ways. However, even with just one weaker extraction method we likely could have shown the possibilities of the extraction interface. Thus, I think the second problem is our primary issue, and is where I'd like to spend most of our effort rethinking.
In the rest of this post, I will discuss some of the short- and long-term paths for research on Glenside. In Showing the Full Capabilities of the Egraph, I will discuss the primary reason we're not submitting now, and what needs to change. In Getting More Realistic About What Accelerators Look Like, I'll discuss how Glenside's current model (or lack of model) for accelerators might help or hurt us.
Showing the Full Capabilities of the Egraph
The core of the problem currently is that we have shifted work off of the egraph in order to get a baseline working, and in doing so, we are no longer showing the capabilities of the egraph.
A core example of this is that, while Glenside used to figure out how to "block up" matrix multiplications automatically, it was a resource-intensive process that required tuning. In order to make progress on other parts of Glenside (namely codegen), I decided to lean on the fact that our current accelerator infrastructure can also figure out blocking at runtime, taking much of the work off of Glenside, but also reducing the resulting search space. I need to switch back to having Glenside efficiently find matrix blocking schemes; in addition, I need to enable Glenside to explore other design considerations. These two directions---making Glenside's current search more efficient, and expanding Glenside's search to allow control over other components of the hardware and software design---roughly outline where I feel we need to focus our energy.
Making Glenside's Current Search More Efficient
There are problems with the current search as it stands. As we focus back on the search space, there will be things to redesign. An initial "smoke test" goal was to have Glenside automatically discover how to block up computations on a systolic array. Glenside can do this, but the approach we use is inefficient. The language is such that it takes many rewrite applications to discover a blocking strategy. Even worse, the number of iterations needed scales with the size of the multiplication. This is because we represent blocked matrix multiplications essentially as a linked list using "concatenate" operators, with one block of computation at each location in the linked list. Every application of one of the rewrites lengthens the linked list by one, at which point we can run another rewrite to put in a systolic array invocation. This is just one place where it's become clear to me that the design of Glenside needs reworking. In some sense, the old design allowed extremely fine-grained control of how operations are blocked up, as a single matrix multiplication could be easily represented as being blocked across multiple systolic arrays of different sizes. But, perhaps unsurprisingly in retrospect, that level of design control is unnecessary, and so we end up wasting our searching resources on expanding a useless design space.
This example hints at two principles to take forward. First, design the language for rewrite efficiency. I think this is more straightforward. For example, if the number of rewrites needed to discover a blocking strategy depends directly on the size of the matrix multiplication, there's likely a more efficient way to implement things.
Second, and more importantly, design the language for search space density. By density, I mean that, of all of the designs present in the egraph after rewrites, a high number of them should be actually usable designs, and a high number of those designs should be interesting and potentially useful. For example, the current language design makes it easy to figure out that we can block up a matrix multiply across seven different systolic arrays of different shapes, but makes it very inefficient to find a simple blocking strategy for all of the layers of Resnet. Thus, we'll end up with many useless designs, and perhaps fail to find even one usable design.
Expanding Glenside's Search Space
In addition to improving the existing language, we also need to expand the language to capture more design space knobs. Currently, the primary design knobs are the size of the systolic arrays and how matrix multiplications are blocked on those systolic arrays. There are also some miscellaneous knobs, such as whether a convolution is done using the im2col strategy. However, these are far from the most important design knobs that exist when building an accelerator; as such, we want to build in to Glenside the ability to represent more facets of an accelerator design. Off the cuff, here are some of the ideas that have been in the back of my mind for a while:
I need to spend more time on this---finding the right group of knobs here will be absolutely essential!
Exploring Extraction
As I noted in the intro, another component we're currently weak on right now is extraction. While I think this is secondary to enriching the search space right now, it is just as important in the long run. We have discussed many potential extraction techniques: ILP, genetic, ML-guided. We need to actually implement some of these.
Better Modeling
Extraction will depend heavily on having useful models for hardware. This is a research problem in itself. I should probably highlight this problem more than I currently am. If we don't have useful models, then capturing a search space is nearly useless!
Getting More Realistic About What Accelerators Look Like
Glenside has always been divorced from reality. I don't have a chip-building background or an FPGA-hacking background. I don't know what accelerators generally look like. In a lot of ways, I think this has helped---I've been unconstrained by questions of what's possible or not. I think that same spirit will largely continue to benefit us. But I do think some knowledge of what accelerators look like would start to become useful.
My current understanding of ML accelerators is this: there is a global buffer for activations (input data or results of computations) and an SRAM for weights. The SRAM has an interface to DRAM to load weights. There's the compute fabric itself, which interfaces with the global buffer and the SRAM, and might include various processing elements---at the very least, a systolic array. In our case, we also have a small processor on the chip. This is very different from the view that Glenside might take of chips. Glenside simply attempts to "cover" all computations in a workload, totally unfettered by thoughts of what chips usually look like. It might assign each layer a different hardware unit, which likely wouldn't make any sense in a real design, unless you could pipeline on it (which would require batch size > 1, I think?)
Batch Size Greater Than One?
We have currently been handling only batch sizes of 1. How often is batch size greater than 1? How are those cases usually handled? Does the hardware change for them? Could Glenside handle it?
Training
Training looks completely different in many ways. From the hardware aspect, it involves (at the very least) the writing of the weights, not just the reading of the weights. It also involves more computations, different ordering of computations, different data caching requirements.
Other Networks
CNNs are straightforward and easy to handle for Glenside, but might be just as straightforward to do in other ways (e.g. static analysis in Relay). That might not be the case for other networks, though, e.g. RNNs.
Conclusion
There is a lot on the roadmap for Glenside. This post is just an enumeration of some of those things. In the near term, it's pretty clear that we need to focus on enriching the egraph search space, but we also need to make some longer-term decisions about what comes after that.
The text was updated successfully, but these errors were encountered: