-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
This issue serves to collect high-level areas where we want to improve or extend Halide. Reading it will let you know what is on the minds of the core Halide developers. If there's something you think we're not considering that we should be, leave a comment. This document is a continual work in progress.
This document aims to address the following high-level questions:
- How should we organize development?
- How do we make Halide easier to use for new users?
- How do we make Halide easier to use for new contributors?
- How do we keep Halide maintainable over time?
- How do we make Halide easier to use for researchers wanting to cannibalize it, extend it, or compare to it?
- How do we make Halide more useful on current and upcoming hardware?
- How do we make Halide more useful for new types of application?
To the greatest extent possible we should attach actionable items to roadmap issues.
Documentation and education
The new user experience could use an audit (e.g. the README).
There are a large number of topics that are missing tutorials
Some examples:
- The GPU memory model (e.g. dirty bits, implicit device copies, explicitly scheduled device copies)
- Using Func::compute_with
- Effectively picking a good TailStrategy
- Scheduling atomic reductions, including horizontal vector reductions
- Generators with multiple outputs (there's a trade-off between tuples, extra channels, compute_with)
- Using (unrolled) extra reduction dimensions for scattering to multiple sites (plus the scatter/gather intrinsics)
- Using extern funcs and extern stages in generators
- Calling other generators inside a generator
- Using a Generator class defined in the process directly via JIT (Generator::realize isn't discoverable)
- Overriding the runtime
- Automatic differentiation
- Integrating with OpenCV, tensorflow, pytorch, and other popular frameworks.
- lambda
- Buffer
There is not enough educational material on the Halide expert-user development flow, looping between tweaking a schedule, benchmarking/profiling it, and examining the .stmt and assembly output.
One thing we have is this: https://www.youtube.com/watch?v=UeyWo42_PS8
Documentation for the developers
-
There should be a guide for how an external contributor should make their first pull request on Halide and what to expect. This is commonly in a
CONTRIBUTING.mdtop-level document. There are also pull request templates we can create. -
There should be a more detailed document or talk describing the entire compilation pipeline from the front-end IR to backend code to help new developers understand the entire project.
Support for extending or repurposing parts of Halide for other projects
Some things that could help:
- Robust serialization and deserialization of the front-end IR and the lowered IR
- Being able to compile libHalide without LLVM
- Being able to delegate compilation of parts of a Halide pipeline to an external sub-compiler. (e.g. see https://docs.google.com/presentation/d/1e3gsYkOrsM4XnI2IuMmFtIU6MAzTZ1zajcqWUDMDJUg/edit?usp=sharing )
Build system issues
We shouldn't assume companies have functioning build tools
Some companies build projects using a mix of duct tape and glue in a platform-varying way. Any configuration that goes into the build system is very painful for them (e.g. GeneratorParams for generator variants). Large numbers of binaries (e.g. one generator binary per generator) can also be painful (e.g. in Visual Studio). We should consider making GenGen.cpp friendlier to the build system (e.g. by implementing caching or depfiles) to help out these users.
Our buildbots aren't keeping up and require too much manual maintenance
Our buildbots are overloaded and have increasingly out-of-date hardware in them. Some can only be administered by employees at specific companies. We need to figure out how to increase capacity without requiring excessive manual management of them.
Runtime issues
-
The runtime includes a lot of global state, which is great for sharing things between all the Halide pipelines in a process, but if there are multiple types of user of Halide in the same large process things can get complicated quickly (e.g. if they want different custom allocators). One option would be removing all global state and passing the whole runtime in as a struct of function pointers.
-
While most of the important parts of the runtime can be overridden by setting function pointers, some parts of the runtime can only be overridden using weak linkage or other linker tricks, and this is problematic on some platforms in some build configurations.
-
There needs to be more top-level documentation for the runtime, describing how one may want to customize it in various situations. Currently there's just a few paragraphs at the top of HalideRuntime.h, and then documentation on the individual functions.
-
Runtime error handling is a contentious topic. The default behavior (abort on any error) is the wrong thing for production environments. There isn't much guidance or consistency on how to handle errors in production environments.
Lifecycle
Versioning
Since October 2020, Halide uses semantic versioning. The latest release is (or will soon be) v15.0.0. We should adopt some practice for keeping a changelog between versions for Halide users. Our current approach of labeling "important" PRs with release_notes has not scaled.
Packaging
Much work has been put into making Halide's CMake build amenable to third-party package maintainers. There is still more to do for cross-compiling our arm builds on x86.
We maintain a list of packaging partners here: #4660
Code reuse, modularity
How do we reuse existing Halide code without recompiling it, especially in a fast prototyping JIT environment? An extension of the extern function calls or the generators should be able to achieve this.
Building a Halide standard library
There should be a set of Halide functions people can just call or include in their programs (e.g., image resampling, FFT, winograd convolution). The longstanding issue to solve is that it's hard to compose the scheduling.
Fast prototyping
How can we make fast prototyping of algorithms in Halide easier? JIT is great for getting started, but not all platforms support it (e.g. iOS), and the step from JIT to AOT is large, in terms of what the code looks like syntactically, what the API is, and what the mental model is.
Consider typical deep learning/numerical computation workflows (PyTorch, NumPy, Matlab, etc). A user would fire up an interpreter, manipulate and visualize their data, experiments with different computation models, print out intermediate values of their program for understanding the data and debugging, and rerun the programs multiple times for different inputs and iterate.
Unfortunately, the current Halide workflow does not fit this very well, even with the Python frontend.
- JIT caches are cleared every time the program instance is terminated. Even if the Halide program has not changed, if you rerun the program (for different parameters or inputs), Halide needs to recompile the whole program. This has become a major bottleneck for fast iteration of ideas.
- Printing intermediate values of Halide programs for debugging and visualization is painful. Either you have to use the cumbersome
print()(and recompile the program) or adding the intermediate Halide function to the output (and recompile the program). - Halide's metaprogramming interface makes it less usable in a (Jupyter) notebook environment.
Two immediate work items.
- Have an option for the JIT compiling such that it can save the result to disk, and load it back automatically if it is cached. Related to the serialization effort.
- Have an interpreter for Halide (or equivalently the "eager mode" c.f. TensorFlow) that defaults to some slow schedule (e.g., compute root everything with basic parallelization).
GPU features
We should be able to place Funcs in texture memory and use texture sampling units to access them.
This is particularly relevant on mobile GPUs where you can't otherwise get things to use the texture cache. It's also necessary to interop with other frameworks that use texture memory (e.g. coreML).
An API to perform filtered texture sampling is needed. Ideally this will work, if not necessarily be blazingly fast, in a cross platform way. Validating on CPUs is very useful. There are some issues in the design having to do with the scope and cost of required sampler object allocations in many GPU APIs.
Currently this has been low priority because we don't have examples where texture sampling matters a lot. Even for cases where it obviously should (e.g. bilateral guided upsample), it doesn't seem to matter much.
A good first step is supporting texture sampling on CUDA, because it doesn't require changing the way in which the original buffer is written to or allocated. An independent first step would be supporting texture memory on some GPU API without supporting filtered texture sampling. These two things can be done orthogonally.
Past issues on this topic: #1021, #1866
We should support tensor instructions.
We have support for dot product instructions on arm and ptx via within-vector reductions. The next task is nested vectorization #4873. After that we'll need to do some backend work to recognize the right set of multi-dimensional vector reductions that map to tensor cores. A relevant paper on the topic is: https://dl.acm.org/doi/10.1145/3378678.3391880
New CPU features
ARM SVE support
Machine learning use-cases
We should be able to compile generators to tensorflow and coreML custom ops.
We can currently do this for pytorch (see apps/HelloPytorch), but it's not particularly discoverable.
We should have fully-scheduled examples of a few neural networks. We have resnet50, but it's still unscheduled.
Targeting MLIR is worth consideration as well.
This is likely a poor match, because most MLIR flavors operate at a higher level of abstraction than Halide (operations on tensors rather than loops around scalar computation).
Autoschedulers
There's lots of work to do before autoschedulers are truly useful. A list of tasks:
-
We need to figure out how to provide stable autoschedulers that work with Halide master to serve as baselines for academic work while at the same time being able to improve autoschedulers over time.
-
There needs to be a tutorial on using standalone autoschedulers, including autotuning modes for those that can autotune.
-
We need to figure out how to include them in distributions
-
There should be a hello-world autoscheduler that serves as a guide for writing a custom one.
-
There should be a click button solution for all sorts of autoscheduling scenarios (pretraining, autotuning, heuristics-based, etc).
-
For several autoschedulers, the generated schedules may or may not work for image sizes smaller than the estimate provided. This is lousy, because autoschedulers should be usable by people who don't understand the scheduling language and don't know to fix tailstrategies.
Things we can deprecate
- arm-32 (probably still need this)
- x86 without sse4.1