Commit df4601ff authored by Pierre Paleo's avatar Pierre Paleo
Browse files

Add architecture page

parent 6f4df0f3
# Processing: concepts and classes
Nabu can be used as a library, either to use simple processing utilities, or to build a full pipeline on top of it. In any case, it is useful to know how things are organized internally.
When it comes to processing, nabu organizes its entities (classes, really) according to three "classes types" from a bottom-up perspective:
- Processing
- Pipeline
- Reconstructor
This page explains what each type is about. In short, processing classes are assembled to form a pipeline, and pipeline(s) are configured/assembled by a Reconstructor.
*NB: Nabu can also be explored through another conceptual classification: its modules, which roughly reflect "steps" in a processing pipeline (i/o, pre-processing, reconstruction, post-processing).*
## Processing objects
Processing entities are functions or classes acting primarily on arrays (numpy array or extensions like pycuda/pyopencl arrays).
These functions/classes should
1. Be as straightforward to use as possible
2. Primarily act on arrays
3. Have a restricted scope ("do one thing and do it well")
This is an opinionated design decision of Nabu.
Rationale for points (1) and (2) is fast prototyping (for example from a ipython console, notebook or script), which is probably one factor of the success of [tomopy](https://tomopy.readthedocs.io). Numpy array are the ubiquitous data container in python scientific libraries.
Point (3) allows for more robustness by making the code easier to read, and the use of unit tests.
These "atomic" building blocks are meant to be chained together to form a series of processing steps (a pipeline).
Examples of such "processing classes" are `FlatField`, `PaganinPhaseRetrieval`, `Backprojector`, ...
## Pipelines
A "Nabu pipeline" is, quite naturally, an assembling of "processing objects". For example, a simple pipeline can be obtained by chaining objects `DataReader`, `FlatField` and `FBP`. This is what would correspond to a python script implementing procedural steps one after the other.
However, as needs usually vary from one beamline to the other, "Nabu pipelines" have to be flexible.
The processing objects should be used in a way that the code should not be edited if we want to use them differently. Therefore, the pipeline should be made configurable through an external user configuration (ex. configuration file). Then, the "Nabu pipeline" must ingest this user configuration and use it to configure its internal processing objects.
To sum up, Nabu pipelines are made of the following ingredients:
- Processing building blocks (eg. `FlatField`)
- Information on how to use (configure) these building blocks
- Information on the dataset
*[This page](configparsing.md) explains how to extract the user configuration, translate this user configuration to actual processing classes parameters, and parse the dataset.*
`FullFieldPipeline` and `FullRadiosPipeline` are examples of such pipelines.
Things could stop here, but in Nabu there is the following **design decision: a pipeline is used to process a (sub)volume that fits in memory**.
This seems to contrast with one primary purpose of nabu, which is to handle large amounts of data. We see in the following section how to handle this limitation.
## Reconstructor
Reconstructors are the final, ready-to-use objects to perform a full volume reconstruction. A *Reconstructor creates/configures/manages Pipeline objects*, in a similar way that a *Pipeline assembles Processing objects together*.
We may wonder why such objects are needed in the first place. After all, the "Pipeline objects" described above could be able to handle data not fitting in memory. The short answer is *work distribution*.
When it comes to distributing the work (reconstructing sub-volumes), there are two possible approaches:
1. Distribute the work within Pipeline object.
2. Distribute the work outside Pipeline object
Approach (1) means that each Pipeline class must implement the workload distribution logic. This distribution logic depends on at least the following factors:
- How data is handled (group of vertical images, horizontal slabs, etc)
- What is the target: local machine, task scheduler (SLURM), etc.
This means that *each* Pipeline class must implement as least *four* distribution logics.
Instead, we follow approach (2): a Pipeline object is bound to a certain chunk/group size, computed so that the subvolume fits in memory. We therefore need a "Pipelines manager", which in our case is called Reconstructor, to handle the logic of distributing the work.
The Reconstructor is responsible for determining how a volume will be reconstructed by one or several Pipeline objects. It notably has to estimate the available resources (host/GPU memory, number of CPU cores, etc), and possibly distribute the workload.
## Understanding the classes types through a simple example
Suppose you want to build a very simple processing pipeline consisting in the following steps:
- Read data
- Perform flat-field normalization
- Transpose the volume (to get sinograms)
- Perform FBP reconstruction
- Save the resulting image
As there are five steps, the pipeline will be obtained by chaining five "building blocks": `Reader`, `FlatField`, `Transpose`, `FBP`, `Writer` - each of them can be a custom function or built-in nabu class/function.
Our simple pipeline - let's call it `SimplePipeline` - consists in assembling the five aforementioned building blocks. This `SimplePipeline` class (or function) will have to implement some logic, ex. pass the result of `Reader` to `FlatField`.
One problem we face almost immediately is that our `SimplePipeline` cannot "ingest" (process) all a dataset in one single pass. Usually, the data volume is too big for being transposed in one step.
*NB: Note that for simpler "image processing pipelines", where no transposition is needed, this would not be a problem. The pipeline would process one image at a time (or a several simultaneously to hide disk latency) in a loop. But because of the very nature of tomography reconstruction (one output voxel needs information from all the input radios images), things are more complicated.*
Therefore, `SimplePipeline` must be able to process a **subset** (sub-volume) of the dataset. Let's assume that this pipeline processes group of radios, then it will be called as
```python
pipeline = SimplePipeline(size=100, ...) # process by group of 100 images
pipeline.process(subset=(0, 100))
pipeline.process(subset=(100, 200))
# ...
```
The `Reconstructor` classes in nabu are simply classes encapsulating the above logic. They automatically compute the `subset` size from the machine available memory, and do the successive calls to `pipeline.process()`.
......@@ -67,6 +67,7 @@ Advanced documentation
tests.md
pipeline.md
validators.md
architecture1.md
API reference
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment