Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FDataIrregular personal proposed reviews #593

Conversation

eliegoudout
Copy link
Contributor

@eliegoudout eliegoudout commented Nov 14, 2023

Hello,

I started looking at FDataIrregular out of curiosity and I propose a few modifications. I have not reviewed everything at all, but I might look at some more when I have spare time and motivation.

Current modifications:

  • Replaced occurrences of len(array.shape) with array.ndim,
  • Added properties FDataIrregular.points_split and FDataIrregular.values_split for clarity and to avoid repeatedly calling np.split(self.points, self.start_indices[1:]). As discussed in `FDataIrregular` discussion #592, it might be a good idea to make these cached_property's, but would require careful correspondng deleter implementation,
  • Cleaned FDataIrregular.restrict (better vectorization use, deleted loop over dimensions, no set-to-list or list-to-set operation) and added with_bound option for signature consistency with FDataGrid, but didn't implement it. Using with_bounds=True will raise a NotImplementedError,
  • Cleaned FDataIrregular.concatenate. It is generally not much faster, but I witnessed up to 40% gain (extreme case with lots of data and pretty high (co)dimensionality).

To do:

  • Validate start_indices (non-decreasing, allow empty samples and start_index = len(points),
  • Wrap reduceat or find alternative,
    • Implemented,
    • Choose implementation (dynamically?)
  • Use the above solution,
    • _get_sample_range_from_data + empty samples support (NaN),
      • _get_domain_range_from_sample_range NaN support,
    • ...
  • Clean _sort_by_arguments,
  • Clean _to_data_matrix,

Later work:

  • Fix __getitem__ ([] error),
  • See below range for empty samples could be (np.inf, -np.inf).

Copy link

codecov bot commented Nov 14, 2023

Codecov Report

Attention: Patch coverage is 90.58824% with 8 lines in your changes are missing coverage. Please review.

Project coverage is 86.65%. Comparing base (a875bd7) to head (cd7e73e).

Files Patch % Lines
skfda/representation/irregular.py 93.50% 5 Missing ⚠️
skfda/representation/basis/_fdatabasis.py 0.00% 2 Missing ⚠️
skfda/typing/_numpy.py 66.66% 1 Missing ⚠️
Additional details and impacted files
@@                       Coverage Diff                        @@
##           feature/irregular_operations     #593      +/-   ##
================================================================
- Coverage                         86.68%   86.65%   -0.03%     
================================================================
  Files                               156      156              
  Lines                             13326    13314      -12     
================================================================
- Hits                              11551    11537      -14     
- Misses                             1775     1777       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@vnmabus
Copy link
Member

vnmabus commented Nov 16, 2023

* Replaced occurrences of `len(array.shape)` with `array.ndim`,

I agree, this is better.

* Added properties `FDataIrregular.points_split` and `FDataIrregular.values_split` for clarity and to avoid repeatedly calling `np.split(self.points, self.start_indices[1:])`. As discussed in [`FDataIrregular` discussion #592](https://github.com/GAA-UAM/scikit-fda/discussions/592), it might be a good idea to make these `cached_property`'s, but would require careful correspondng `deleter` implementation,

I do not think this is necessary. There are three places in which split takes place:

  • _get_sample_range_from_data, where the minimum and maximum per function are computed. I think we can compute them without splitting, using np.minimum.reduceat and np.maximum.reduceat.
  • _sort_by_arguments, where the arguments per function are sorted. I think here there is no alternative, but this is done only once on object creation.
  • restrict. I do not think that it is necessary here, as you can compute the points to remove independently of the function where they belong, and then use add.reduceat to compute the new indices.
* Cleaned `FDataIrregular.restrict` (better vectorization use, deleted loop over dimensions, no set-to-list or list-to-set operation) and added `with_bound` option for signature consistency with `FDataGrid`, but didn't implement it. Using `with_bounds=True` will raise a `NotImplementedError`,

Thank you! I also did not notice that functions without values are not kept (also in the original). I think this would be surprising behavior, as the n-th function after restrict would not be the same. I would kept them, even if they have no points.

* Cleaned `FDataIrregular.concatenate`. It is generally not much faster, but I witnessed up to 40% gain (extreme case with lots of data and pretty high (co)dimensionality).

I am somehow surprised that appending to a list and compute np.concatenate is faster than allocating all the memory beforehand and assigning ranges. Any idea why? Maybe it is because the original code looped several times?

@eliegoudout
Copy link
Contributor Author

eliegoudout commented Nov 17, 2023

I do not think [points_splitand values_split properties are] necessary. There are three places in which split takes place:

* `_get_sample_range_from_data`, where the minimum and maximum per function are computed. I think we can compute them without splitting, using `np.minimum.reduceat` and `np.maximum.reduceat`.

* `_sort_by_arguments`, where the arguments per function are sorted. I think here there is no alternative, but this is done only once on object creation.

* `restrict`. I do not think that it is necessary here, as you can compute the points to remove independently of the function where they belong, and then use `add.reduceat` to compute the new indices.

Thanks for you're remarks. I was not familier with reduceat and indeed it might simplify some things. However, I would still argue that these properties do have proper interest:

  • For maintainability, much easier to read,
  • Lots of functionalities still remain to be implemented, so these might to come in handy,
  • Simply for the end user, who might not want to have to think "how can I get the individual samples?", especially if they are not very familiar with the library, or with numpy in general,
  • If we end up making them cached_propertys, there might also be some speed up (though probably negligible most of the time).

I don't have a super strong opinion about this, but I would still vote for these properties.

Regarding restrict, you're totally right, I will rewrite it, especially if we agree on your following point.

[F]unctions without values are not kept [..]. I think this would be surprising behavior, as the n-th function after restrict would not be the same. I would kept them, even if they have no points.

Hmm... that's a good point. I didn't even question the behaviour since I only meant to clean, but it might indeed be more sensible to keep empty functions. If there is a consensus, I'd gladly incorporate this behaviour.

Any idea why? Maybe it is because the original code looped several times?

Pure speculations:

  • Multiple loops indeed,
  • Also, if the total array is big, maybe it cannot all go into fast memory, making it slower to fill little by little.

@eliegoudout
Copy link
Contributor Author

eliegoudout commented Nov 17, 2023

I made another comment on purpose to avoid confusion. I would like to know what the position on start_indices is. Currently, nothing fundamentaly prohibits start_indices=[2,1,4] (decreasing and doesn't start with 0) for example. It does bug atm, but what should the data structure allow?

Personally, I think that:

  • start_indices should be non-decreasing (but may not be increasing),
  • start_indices[0] = 0 should be forced.

@eliegoudout
Copy link
Contributor Author

eliegoudout commented Nov 18, 2023

Sorry the comments are starting to pile up, but I discovered this edge case of ufunc.reduceat:
image
Which means that an "empty sample" (indicated by start_indices = [0, 2, 2, 3] for example for the second sample) would not behave as expected when using reduceat if we're not careful enough.

For example, as you suggested for the review of restrict, if we tried to use add.reduceat to compute the new indices, we could face this issue:

>>> np.add.reduceat(
...  [True, False, True, True, False, True],
...  [0, 2, 2, 3]
... )
array([1, 1, 1, 2])  # Instead of desired array([1, 0, 1, 2)

I don't know if there's any easy fix around this. But since I agree that empty samples should be allowed, reduceat might be limited for our use.

I pin this here for when I have more time to read it, I think it addresses this issue.

@vnmabus
Copy link
Member

vnmabus commented Nov 18, 2023

* For maintainability, much easier to read,

Right now, with just one use, I would argue that it is not easier.

* Lots of functionalities still remain to be implemented, so these might to come in handy,

Then we can reevaluate when we have more use cases for it. It is always easier to add features than to remove them.

* Simply for the end user, who might not want to have to think "how can I get the individual samples?", especially if they are not very familiar with the library, or with numpy in general,

But they can simply index the FDataIrregular, if what they want is individual samples, don't they?

* If we end up making them `cached_property`s, there might also be some speed up (though probably negligible most of the time).

I would say that caching something in programming is very dangerous (because cached things can easily become desynchronized), and should be used sparingly, only when the alternatives are much worse.

I don't have a super strong opinion about this, but I would still vote for these properties.

I do not discard the possibility right now, but I prefer to wait until the things are more clear.

Hmm... that's a good point. I didn't even question the behaviour since I only meant to clean, but it might indeed be more sensible to keep empty functions. If there is a consensus, I'd gladly incorporate this behaviour.

I think we should keep them, even if empty.

Pure speculations:

* Multiple loops indeed,

* Also, if the total array is big, maybe it cannot all go into fast memory, making it slower to fill little by little.

Well, until we do not have benchmark tests (#585), I think it is very difficult to justify doing micro-optimizations blindly, so I would keep your code for now.

@vnmabus
Copy link
Member

vnmabus commented Nov 18, 2023

Personally, I think that:

* `start_indices` should be non-decreasing (but may not be increasing),

* `start_indices[0] = 0` should be forced.

I agree with both of your suggestions. It was the way I was thinking about start_indices, even if that is not documented anywhere.

@vnmabus
Copy link
Member

vnmabus commented Nov 18, 2023

Sorry the comments are starting to pile up, but I discovered this edge case of ufunc.reduceat: image Which means that an "empty sample" (indicated by start_indices = [0, 2, 2, 3] for example for the second sample) would not behave as expected when using reduceat if we're not careful enough.

Wow, thank you for noticing this! This is a bit tragic, indeed. I think we can have a workaround for it (by computing the indexes with empty functions and overwriting them), but it will be a bit slower. We should probably encapsulate that into a function to prevent mistakes.

@eliegoudout
Copy link
Contributor Author

Then we can reevaluate when we have more use cases for it. It is always easier to add features than to remove them.

I get this point, it's a good one.

But they can simply index the FDataIrregular, if what they want is individual samples, don't they?

Clearly you're right 😬

I would say that caching something in programming is very dangerous (because cached things can easily become desynchronized), and should be used sparingly, only when the alternatives are much worse.

I thought so but wasn't entirely sure 👍

Overall I agree, I will remove these properties. If they ever become a bit more useful, we could still make them internal properties _points_split and _values_split.

@eliegoudout
Copy link
Contributor Author

Wow, thank you for noticing this! This is a bit tragic, indeed. I think we can have a workaround for it (by computing the indexes with empty functions and overwriting them), but it will be a bit slower. We should probably encapsulate that into a function to prevent mistakes.

It is indeed a bit tragic since, as you can see below, reduceat is blazingly faster! I tested with this (ugly in the form but standard in protocol) piece of code, on non corner cases.

Test ran

from math import ceil, floor, log10
from tabulate import tabulate
from time import perf_counter
import numpy as np


def method_reduceat(mask, idxs):
    """ reduceat """
    return np.add.reduceat(mask, idxs)

def method_split_iterate(mask, idxs):
    """ split then iterate """
    split = np.split(mask, idxs)[1:]
    return np.array([s.sum() for s in split])

def method_only_iterate(split):
    """ iterate over precomputed split """
    return np.array([s.sum() for s in split])

def time_one(n_points, n_samples):
    """ Times on execution of each in a random order. """
    rand_mask = np.random.random(n_points) > .5
    rand_idxs = np.r_[  # Starts with 0 and strictly increasing
        [0],
        np.random.choice(np.arange(1, n_points), n_samples - 1, replace=False)]
    rand_idxs.sort()
    split = np.split(rand_mask, rand_idxs)[1:]
    fs_args = [
        (method_reduceat, (rand_mask, rand_idxs)),
        (method_split_iterate, (rand_mask, rand_idxs)),
        (method_only_iterate, (split,)),
    ]
    n = len(fs_args)
    perm = np.random.permutation(n)
    res = [None] * n
    ts = np.ndarray(3)
    for i in perm:
        f, args = fs_args[i]
        t = -perf_counter()
        r = f(*args)
        t += perf_counter()
        res[i] = r
        ts[i] = t
        np.testing.assert_array_equal(r, res[perm[0]])
    return ts

# Util from https://github.com/eliegoudout/lasvegas/blob/dev/perf/__init__.py
def format_time(duration: float, num_digits: int = 4) -> str: 
    """ Formats a `float` duration in seconds to a human readable `str`.

    A few examples with `num_digits = 4` are given below, showcasing
    some special cases.
    ```
    ╭───────────────┬────────────────┬───────────────────────────────────────╮
    │   Duration    │     Result     │                  Comment              │
    ├───────────────┼────────────────┼───────────────────────────────────────┤
    │      1.5      │    1.500 ss    │ Significant 0's added                 │
    │      0.56789  │    567.9 ms    │ Last digit is rounded...              │
    │      0.99995  │    1.000 ss    │ ...which can lead to precision loss   │
    │      0.12345  │    123.4 ms    │ Rounds half to even (python built-in) │
    │   1234        │    1234. ss    │ Point is added for constant witdh     │
    │  12345        │    12345 ss    │ One more digit for longer durations   │
    │ 123456        │ AssertionError │ Exceeded max duration                 │
    │     -1        │ AssertionError │ Negative duration                     │
    │      0        │    0.000 as    │ Smallest unit for shorter durations   │
    │      5.67e-20 │    0.057 as    │ Precision is worse near 0.            │
    ╰───────────────┴────────────────┴───────────────────────────────────────╯
    ```

    Implementation heavily relies on following facts:
        - Consecutive units have constant ratio of `10 ** 3`,
        - Highest unit is the unit of `duration`'s encoding.

    Arguments:
    ----------
        duration (float): Expressed in seconds, duration to format. Must
            satisfy `0 <= duration < 10 ** (num_digits + 1) - .5`.
        num_digits (int): Number of significant digits to display.
            Larger durations can have one more and shorter durations
            less -- see examples above.

    Returns:
    --------
        (str): Formated duration -- _e.g._ `'567.9 ms'`.

    Raises:
    -------
        `AssertionError` if either `num_digits < 3` or
            `not 0 <= duration < 10 ** (num_digits + 1) - .5`
    """
    units = ['ss', 'ms', 'us', 'ns', 'ps', 'fs', 'as']
    max_pow = 3 * (len(units) - 1)
    n = num_digits
    assert n >= 3
    assert 0 <= duration < 10 ** (n+1) - .5, "Duration out of bounds."
    # Special case 0
    if duration == 0:
        return f"{0:.{n-1}f} " + units[-1]
    # Retrieve left shift for significant part
    left_shift = ceil(- log10(duration)) + n - 1
    significant = round(duration * 10 ** left_shift)
    # Special case `0.0099996` -> `'10.00ms'`
    if significant == 10 ** n:
        significant //= 10
        left_shift -= 1
    # If `duration` is barely too big: remove floating point
    if left_shift == -1:
        return f"{round(duration)} " + units[0]
    # Nominal case
    elif left_shift < max_pow + n:
        unit_index = max(0, 1 + (left_shift - n) // 3)
        y = significant * 10 ** (3 * unit_index - left_shift)
        n_left = int(log10(y) + 1)
        unit = units[unit_index]
        return f"{y:.{max(0, n-n_left)}f}{'.' if n == n_left else ''} " + unit
    # If so small that smallest unit loses precision
    else:
        return f"{duration * 10 ** max_pow:.{n-1}f} " + units[-1]

def expe(n_points, n_samples, n_runs):
    """ Runs multiple times and shows results. """
    T = np.r_[[time_one(n_points, n_samples) for _ in range(n_runs)]]
    T.sort(0)
    min_ = T.min(0)
    max_ = T.max(0)
    mean = T.mean(0)
    p = [5, 25, 50, 75, 95]
    std_p = T[ceil(len(T)*p[1]/100):floor(len(T)*p[-1]/100)+1].std(0)
    percentiles = np.percentile(T, p, 0)
    res = np.c_[min_, max_, mean, std_p, percentiles.T]
    names = ["reduceat", "split + iterate", "iterate only (pre-split)"]
    headers = ['name', 'Min', 'Max', 'Mean', f'Std {p[0]}-{p[-1]}', *map(lambda pp: f"{pp}%", p)]
    table = [[name] + list(map(format_time, line)) 
             for name, line in zip(names, res)]
    colalign = ('left',) + ('center',) * (res.shape[1])
    print(tabulate(table, headers=headers, tablefmt="rounded_outline", colalign=colalign))

n_points = 10000
n_samples = 1000
n_runs = 1000

expe(n_points, n_samples, n_runs)

Results:

>>> # Compute new `start_indices` given mask
>>> n_points = 10000
>>> n_samples = 1000
>>> n_runs = 1000
>>> 
>>> expe(n_points, n_samples, n_runs)
╭──────────────────────────┬──────────┬──────────┬──────────┬────────────┬──────────┬──────────┬──────────┬──────────┬──────────╮
│ name                     │   Min    │   Max    │   Mean   │  Std 5-95  │    5%    │   25%    │   50%    │   75%    │   95%    │
├──────────────────────────┼──────────┼──────────┼──────────┼────────────┼──────────┼──────────┼──────────┼──────────┼──────────┤
│ reduceat                 │ 34.72 us │ 3.478 ms │ 46.86 us │  6.896 us  │ 35.19 us │ 36.67 us │ 37.98 us │ 44.99 us │ 69.08 us │
│ split + iterate          │ 3.950 ms │ 15.35 ms │ 4.543 ms │  425.9 us  │ 4.028 ms │ 4.113 ms │ 4.173 ms │ 4.620 ms │ 6.090 ms │
│ iterate only (pre-split) │ 2.155 ms │ 10.45 ms │ 2.537 ms │  255.1 us  │ 2.203 ms │ 2.276 ms │ 2.317 ms │ 2.507 ms │ 3.335 ms │
╰──────────────────────────┴──────────┴──────────┴──────────┴────────────┴──────────┴──────────┴──────────┴──────────┴──────────╯

@eliegoudout
Copy link
Contributor Author

Regarding empty samples:

I think we should keep them, even if empty.

I was implementing this, but we should decide what happens to sample_range in this case.
We could:

  • make it + or - np.inf (would cause issues with domain_range if it is deduced from sample_range),
  • forbid empty samples at instanciation, which wouls mean that empty sample always inherit from a previous sample_range (I think it's a very weird behaviour, but probably the least effort solution)
  • reconsider this attribute and its interaction with domain_range (for example, saying that -np.inf, +np.inf is ignored by domain_range). This can be done manually of by keeping track of empty samples via yet another (internal) attribute...
  • ?

Any good idea?

@vnmabus
Copy link
Member

vnmabus commented Nov 24, 2023

* make it + or - `np.inf` (would cause issues with `domain_range` if it is deduced from `sample_range`),

For me np.nan makes much more sense.

* forbid empty samples at instanciation, which wouls mean that empty sample always inherit from a previous `sample_range` (I think it's a very weird behaviour, but probably the least effort solution)

I would like to allow it if possible.

* reconsider this attribute and its interaction with `domain_range` (for example, saying that `-np.inf, +np.inf` is ignored by `domain_range`). This can be done manually of by keeping track of empty samples via yet another (internal) attribute...

This is ugly. I prefer np.nan, which is more accurate and easier to ignore, I think.

@eliegoudout
Copy link
Contributor Author

eliegoudout commented Nov 24, 2023

Okay, I come with a proposition for the reduceat problem. It was kind of painful because of all the subtle edge cases...

Anyways, here's a proposition:

def _reduceat(
    array: ArrayLike,
    indices: ArrayLike,
    axis: int = 0,
    dtype=None,
    out=None,
    *,
    ufunc,
    value_empty
):
    """Wrapped `np.ufunc.reduceat` to manage edge cases.

    The edge cases are the one described in the doc of
    `np.ufunc.reduceat`. Different behaviours are the following:
        - No exception is raised when `indices[i] < 0` or
            `indices[i] >=len(array)`. Instead, the corresponding value
            is `value_empty`.
        - When not in the previous case, the result is `value_empty` if
            `indices[i] >= indices[i+1]` and otherwise, the same as
            `ufunc.reduce(array[indices[i]:indices[i+1]])`.
    """
    array, indices = map(np.asarray, [array, indices])
    axis %= array.ndim
    ax_idx = (slice(None),) * axis
    n = array.shape[axis]

    pad_width = np.full((array.ndim, 2), 0)
    pad_width[axis, 1] = 1
    extended_array = np.pad(array, pad_width, mode="empty")
    extended_indices = np.append(indices, n)

    bad = (indices < 0) | (indices >= n)
    empty = (np.diff(extended_indices) <= 0) | bad
    extended_indices[:-1][bad] = n

    out = ufunc.reduceat(
        extended_array, extended_indices, axis=axis, dtype=dtype, out=out
    )[ax_idx + (slice(-1),)]
    out[ax_idx + (empty,)] = value_empty

    return out

which is used like so:

>>> array = [[0, 1, 2],
...          [0, 2, 1],
...          [1, 0, 2],
...          [1, 2, 0],
...          [2, 0, 1],
...          [2, 1, 0]]
>>> indices = [0, 0, 100, 2, 2, -1, 2, 5, 2]
>>> 
>>> _reduceat(array, indices, dtype=float, ufunc=np.minimum, value_empty=np.nan, axis=0)
array([[nan, nan, nan],
       [ 0.,  0.,  0.],
       [nan, nan, nan],
       [nan, nan, nan],
       [nan, nan, nan],
       [nan, nan, nan],
       [ 1.,  0.,  0.],
       [nan, nan, nan],
       [ 1.,  0.,  0.]])
>>> _reduceat(array, indices, dtype=float, ufunc=np.minimum, value_empty=np.nan, axis=-1)
array([[nan,  0., nan, nan, nan, nan,  2., nan,  2.],
       [nan,  0., nan, nan, nan, nan,  1., nan,  1.],
       [nan,  0., nan, nan, nan, nan,  2., nan,  2.],
       [nan,  0., nan, nan, nan, nan,  0., nan,  0.],
       [nan,  0., nan, nan, nan, nan,  1., nan,  1.],
       [nan,  0., nan, nan, nan, nan,  0., nan,  0.]])

I essentially had 2 ideas for implementation: either by extending the array so I acn use index = len(array) for bad values. Or keep the original array, use index = len(array) - 1 and then keep track of those edge cases where I need to reapply ufunc once.
I first tried the second method as it seems a bit more efficient in case of huge arrays, but in the end, I became a bit too ugly for me.

I will next add this to irregular.py and use it whenever useful (lots of places, especially since we allow empty samples now).

Any comment is appreciated.

P.S.: I should point out that dtype is important here and the behaviour might need to be adjusted. For example, if array is only ints and value_empty = np.nan, then an error will occur because it's not convertible. The problem might really arises when the conversion silently works.

@eliegoudout
Copy link
Contributor Author

Also, the following is a problem (from __init__):

        if self.start_indices[-1] >= len(self.points):
            raise ValueError("Index in start_indices out of bounds")

How then to instanciate a fd where tailing samples are empty? I propose one of these solutions:

  • Either authorize len(self.points) (but not above),
  • Athorize any integer greater than or equal to len(self.points) and then:
    • Either clip to len(self.points),
    • Or don't.
  • Rethink start_indices to maybe have a "pairwise" syntax (meaning that we include the upper index at the end of start_indices.

@eliegoudout
Copy link
Contributor Author

Just for info, I played with a second implementation of _reduceat, simply using reduce multiple times while looping over indices. Sadly, It's pretty wild how differently they behave performance-wise.

Depending on:

  • array.ndim,
  • axis,
  • length of indices,

I can get from 10x faster to 10x slower... So I wouldn't know which to prioritize at the moment without further digging.

def _reduceat2(
    array: ArrayLike,
    indices: ArrayLike,
    axis: int = 0,
    dtype=None,
    out=None,
    *,
    ufunc,
    value_empty
) -> NDArray:
    """Wrapped `np.ufunc.reduceat` to manage edge cases.

    The edge cases are the one described in the doc of
    `np.ufunc.reduceat`. Different behaviours are the following:
        - No exception is raised when `indices[i] < 0` or
            `indices[i] >=len(array)`. Instead, the corresponding value
            is `value_empty`.
        - When not in the previous case, the result is `value_empty` if
            `indices[i] >= indices[i+1]` and otherwise, the same as
            `ufunc.reduce(array[indices[i]:indices[i+1]])`.
    """
    if not isinstance(axis, int):
        raise NotImplementedError

    array, indices = map(np.asarray, [array, indices])
    ndim = array.ndim
    assert -ndim <= axis < ndim
    axis %= ndim
    pre, (n,), post = map(tuple, np.split(array.shape, [axis, axis + 1]))
    shape = pre + (len(indices),) + post

    if dtype is None:
        dtype = array.dtype

    if out is None:
        out = np.empty(shape, dtype=dtype)
    else:
        out = out.astype(dtype)

    ii = [slice(None)] * ndim
    for i, (a, b) in enumerate(itertools.pairwise(np.append(indices, n))):
        ii[axis] = i
        ii_out = tuple(ii)
        if a < 0 or a >= min(b, n):  # Nothing to reduce
            out[ii_out] = value_empty
        else:
            ii[axis] = slice(a, b)
            ii_array = tuple(ii)
            out[ii_out] = ufunc.reduce(array[ii_array], axis=axis)
    return out

@vnmabus
Copy link
Member

vnmabus commented Nov 30, 2023

Any comment is appreciated.

I wonder how that compares with my own try (that only attempts to work when the indices are non-decreasing):

def _reduceat_vnmabus(
    ufunc,
    array: ArrayLike,
    indices: ArrayLike,
    axis: int = 0,
    dtype=None,
    out=None,
    *,
    value_empty
):
    """
    Wrapped `np.ufunc.reduceat` to manage edge cases.

    The edge cases are the one described in the doc of
    `np.ufunc.reduceat`. Different behaviours are the following:
        - No exception is raised when `indices[i] < 0` or
            `indices[i] >=len(array)`. Instead, the corresponding value
            is `value_empty`.
        - When not in the previous case, the result is `value_empty` if
            `indices[i] >= indices[i+1]` and otherwise, the same as
            `ufunc.reduce(array[indices[i]:indices[i+1]])`.
    """
    array = np.asarray(array)
    indices = np.asarray(indices)

    n = array.shape[axis]
    good_axis_idx = (indices >= 0) & (indices < n) & (np.diff(indices, append=n) > 0)

    n_out = len(indices)
    out_shape = list(array.shape)
    out_shape[axis] = n_out
    out = np.full_like(array, value_empty, shape=out_shape)

    good_idx = [slice(None)] * array.ndim
    good_idx[axis] = good_axis_idx
    good_idx = tuple(good_idx)

    reduce_at_out = ufunc.reduceat(
        array,
        indices[good_axis_idx],
        axis=axis,
        dtype=dtype,
    )
    out[good_idx] = reduce_at_out

A few (small) comments:

  • I think that the first argument should be the ufunc.
  • Using map hurts readability here, I think. Moreover, I also think that we can assume that this function will only be used with ndarray inputs, so we could remove the conversion step.
  • I am still a bit unsure on why did you need to extend the arrays.
  • The second version iterates over the functions using a Python for loop. I would say that is definitely going to be slower when the number of functions is high.

@vnmabus
Copy link
Member

vnmabus commented Nov 30, 2023

* I chose to allow `start_index = len(points)` to instanciate empty samples,

I agree.

* I temporarily included both implementations of `_reduceat` with hardocded `MODE = 1` variable (`2` also possible). Later benchmarks might decide what we do. It is a possibility that depending on the dimension of the arguments, we prefer mode `1` over `2` or vice versa, at run time

I do not want to overcomplicate things. I would rather have just one version, the one that performs better with lots of data (when performance really matters). This is likely version 1 or my own implementation (I did not benchmark it but should perform similarly). Version 2 has a Python loop over the functions, which will be noticeable when the number of functions is high.

* I'm wondering if, after all, the best value for `_sample_range` would not be `(np.inf, -np.inf)` for empty samples? It breaks monoticity, but is coherent with the idea than `min(empty_set) = np.inf`. Still, atm I chose `np.nan` as previously discussed.

I think that you are probably right here. It even generalizes better to other data types when we want to support them in the future, e.g. using the max and min representable integers.

I would like to raise personal concern regarding the current choice of using common points for mean and var. I thought the natural use case for FDataIrregular would be to have multiple recordings at slightly different time stamps because of natural uncertainty in the timing process. Then, the best solution, imo, would simply be to interpolate on a grid to determine, and then compute the operation. With the current implementation, one would probably get literally zero common point if they used real-world measures. Furthermore, the common_points with an empty set would also be an empty set. While an interpolated version of an empty set on a common grid would simple be all NaN's.

TLDR: Is the idea to work on common points possible to question? I think a more natural approach is to find an appropriate commone grid (equally spaced or not, to discuss...) and then to interpolate.

I agree with you that in practice the mean and variance computed this way would be empty. And it indeed concerns me too. However, I would argue that finding a common grid and interpolating is not something that should be automated in mean and var, unless there is a unique or best way to do it. Otherwise, we prefer the package to be "dumb" and force the user to make a conscious choice.

We have implemented a behavior similar to the one found in Tidyfun (which may not be a sensible choice). Our statistician has told us that usually in FDA the data are not kept in an irregular representation for much time, converting them to a basis representation as soon as possible, and performing the computations (such as mean and variance) in that representation. Thus, probably these methods will not be used much. We have currently a member of the team researching the best way to implement the conversion to a basis representation described in https://academic.oup.com/edited-volume/42134/chapter-abstract/356192087 , in which a mixed effects model is used to leverage information from all observations in the functional dataset in order to convert each one (so that the conversion makes sense even for very sparse data), instead of converting each observation individually.

@eliegoudout
Copy link
Contributor Author

eliegoudout commented Nov 30, 2023

Just a quick partial answer until I can get around a computer.

I am still a bit unsure on why did you need to extend the arrays.

If you're referring to MODE == 1: because I want to use native reduceat on slices i:len(array), and that natively raises exception. Of course, if you only work with non-decreasing indexes, it removes the problem, but the goal of my fix is to encompass all cases mentioned in the doc string.

I would say that is definitely going to be slower when the number of functions is high.

I agree, but the question becomes "when is it tangible?", which is deeply related to "how much data points do we expect for each sample for the average case?", which is rather subjective.


I will incorporate your proposal and propose a few bench results soon.

In the mean time, I think the primary focus should be, for the package as a whole in relation with #585 : let's identify covering use case scenarios.

What I mean by that is that we have the following (most important) degrees of freedom:

  • dim
  • codim
  • n_points
  • n_samples

And I think we should partition the possibilities in a meaningful way.
For example, I propose something like (completely fake proposition for demonstration purpose, not a real proposition):

  • scenario_0: n_points * (dim + codim) < 1000
  • scenario_1: dim == 1 and n_point < 100 * n_samples
  • scenario_2: dim == 1 and (not scenario_1
  • scenario 3: everything else.

I think this would allow for:

  • better benchmarking in general
  • better identification of the use cases,
  • possibly dynamically identifying the scenario to use the right implementations.

I really think this is one brick of what #585 requires.

@eliegoudout
Copy link
Contributor Author

Our statistician has told us that usually in FDA the data are not kept in an irregular representation for much time, converting them to a basis representation as soon as possible

Hmmm, then I would ask the following: what is the intended purpose of FDataIrregular? If this is just a proxy for constructing Grid or basis representation, then I would assume this choice entirely and not implement at all processing methods. What do you think?

@vnmabus
Copy link
Member

vnmabus commented Dec 1, 2023

Our statistician has told us that usually in FDA the data are not kept in an irregular representation for much time, converting them to a basis representation as soon as possible

Hmmm, then I would ask the following: what is the intended purpose of FDataIrregular? If this is just a proxy for constructing Grid or basis representation, then I would assume this choice entirely and not implement at all processing methods. What do you think?

Well, I think there is enough functionality that is meaningful for FDataIrregular (plotting, evaluation, etc) to justify it being a FData subclass. But yeah, reduction operations, such as the mean are a big problem. I nevertheless suspect that there MAY be a better way of computing the them taking the correlation between points into account (but without doing smoothing). However, I currently have no time to dedicate to such a derivation. If we discover some way to compute them better, that does not depend in heuristics or smoothing parameters, we could try to change it.

@vnmabus
Copy link
Member

vnmabus commented Jan 31, 2024

Sorry for the delay, but I was on vacation + job search so I did not have time. As far as I remember, this is only pending the choice for the _reduceat algorithm, backed by benchmarks if possible. I would like to finish this set of changes as soon as possible, to unlock and merge PR #536.

@eliegoudout
Copy link
Contributor Author

Hi :)

No problem, I'm sorry I also got quite busy and left this PR stale! I think that if you want to merge this quickly then we'd better not think too much about the test scenarios I suggested and simply go with your approach. I bet that handling exclusively non-decreasing arrays makes the implementation _reduceat_vnmabus much more efficient than the ones I proposed, no need to double check that imo.

As for the rest:

  • Do we leave the sample range for empty samples with NaN's? We agreed that (np.inf, -np.inf) would make a lot of sense, but I'm not confident enough that it wouldn't brake other things somewhere in your package (mainly because it's not non-decreasing anymore).
  • I think there's still a problem with __getitem__ as I wrote it in the opening message. Can we leave this for later?
  • I think I will not review the remaining parts of the data structure for now, mainly because as discussed, I'm uncomfortable with the current approach to mean and var for example.

If we agree on this, I can do a quick final push with your reduceat and we should be good to go!

Cheers!

@eliegoudout
Copy link
Contributor Author

Hi,

So I tested _reduceat_vnmabus after fixing the handing of dtype and out. I ran experiments with np.add on a boolean mask with len(mask)=1_000 and respectively len(indices)=30, 300, 3000 and it seemed consistently 30% faster. Since I think handling decreasing indices is not useful anyways, I chose it for this PR.

I undrafted the PR, ready for review/merge @vnmabus :)

@eliegoudout eliegoudout marked this pull request as ready for review February 2, 2024 14:15
Copy link
Member

@vnmabus vnmabus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple of minor things before merging.

skfda/representation/irregular.py Outdated Show resolved Hide resolved
skfda/representation/irregular.py Outdated Show resolved Hide resolved
skfda/representation/irregular.py Outdated Show resolved Hide resolved
skfda/representation/irregular.py Outdated Show resolved Hide resolved
skfda/representation/irregular.py Outdated Show resolved Hide resolved
Copy link
Member

@vnmabus vnmabus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we dropped support for Python 3.9, tests now pass. So, I will merge it with the other branch, and if everything goes smoothly, I will merge the other with develop.

Thank @eliegoudout for all the review and programming effort that you have put here!

@vnmabus vnmabus merged commit 21f7bad into GAA-UAM:feature/irregular_operations Mar 11, 2024
9 of 10 checks passed
@vnmabus
Copy link
Member

vnmabus commented Mar 11, 2024

@all-contributors please add @eliegoudout for code, review and ideas.

Copy link
Contributor

@vnmabus

I've put up a pull request to add @eliegoudout! 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants