FeaturesCollection

py3r.behaviour.features.features_collection.FeaturesCollection ¶

FeaturesCollection(features_dict: dict[str, Features])

Bases: BaseCollection

Collection of Features objects, keyed by name. note: type-hints refer to Features, but factory methods allow for other classes these are intended ONLY for subclasses of Features, and this is enforced

Examples:

>>> import tempfile, shutil
>>> from pathlib import Path
>>> from py3r.behaviour.util.docdata import data_path
>>> from py3r.behaviour.tracking.tracking_collection import TrackingCollection
>>> with tempfile.TemporaryDirectory() as d:
...     d = Path(d)
...     with data_path('py3r.behaviour.tracking._data', 'dlc_single.csv') as p:
...         _ = shutil.copy(p, d / 'A.csv'); _ = shutil.copy(p, d / 'B.csv')
...     tc = TrackingCollection.from_dlc({'A': str(d/'A.csv'), 'B': str(d/'B.csv')}, fps=30)
>>> fc = FeaturesCollection.from_tracking_collection(tc)
>>> list(sorted(fc.keys()))
['A', 'B']

each `instance-attribute` ¶

each: Features

each_forcebatch `instance-attribute` ¶

each_forcebatch: Features

features_dict `property` ¶

features_dict

loc `property` ¶

loc

iloc `property` ¶

iloc

is_grouped `property` ¶

is_grouped

True if this collection is a grouped view.

Examples:

>>> import tempfile, shutil
>>> from pathlib import Path
>>> from py3r.behaviour.util.docdata import data_path
>>> from py3r.behaviour.tracking.tracking_collection import TrackingCollection
>>> with tempfile.TemporaryDirectory() as d:
...     d = Path(d)
...     with data_path('py3r.behaviour.tracking._data', 'dlc_single.csv') as p:
...         a = d / 'A.csv'; b = d / 'B.csv'
...         _ = shutil.copy(p, a); _ = shutil.copy(p, b)
...     coll = TrackingCollection.from_dlc({'A': str(a), 'B': str(b)}, fps=30)
>>> coll.is_grouped
False

groupby_tags `property` ¶

groupby_tags

The tag names used to form this grouped view (or None if flat).

group_keys `property` ¶

group_keys

Keys for the groups in a grouped view. Empty list if not grouped.

Examples:

>>> import tempfile, shutil
>>> from pathlib import Path
>>> from py3r.behaviour.util.docdata import data_path
>>> from py3r.behaviour.tracking.tracking_collection import TrackingCollection
>>> with tempfile.TemporaryDirectory() as d:
...     d = Path(d)
...     with data_path('py3r.behaviour.tracking._data', 'dlc_single.csv') as p:
...         a = d / 'A.csv'; b = d / 'B.csv'
...         _ = shutil.copy(p, a); _ = shutil.copy(p, b)
...     coll = TrackingCollection.from_dlc({'A': str(a), 'B': str(b)}, fps=30)
...     coll['A'].add_tag('group','G1'); coll['B'].add_tag('group','G2')
>>> g = coll.groupby('group')
>>> sorted(g.group_keys)
[('G1',), ('G2',)]

from_tracking_collection `classmethod` ¶

from_tracking_collection(
    tracking_collection: TrackingCollection,
    feature_cls=Features,
)

Create a FeaturesCollection from a TrackingCollection.

Examples:

>>> import tempfile, shutil
>>> from pathlib import Path
>>> from py3r.behaviour.util.docdata import data_path
>>> from py3r.behaviour.tracking.tracking_collection import TrackingCollection
>>> with tempfile.TemporaryDirectory() as d:
...     d = Path(d)
...     with data_path('py3r.behaviour.tracking._data', 'dlc_single.csv') as p:
...         _ = shutil.copy(p, d / 'A.csv'); _ = shutil.copy(p, d / 'B.csv')
...     tc = TrackingCollection.from_dlc({'A': str(d/'A.csv'), 'B': str(d/'B.csv')}, fps=30)
>>> fc = FeaturesCollection.from_tracking_collection(tc)
>>> isinstance(fc['A'], Features) and isinstance(fc['B'], Features)
True

concat `classmethod` ¶

concat(
    collections: list[FeaturesCollection],
    *,
    reindex: Literal[
        "rezero", "follow_previous", "keep_original"
    ] = "follow_previous",
) -> FeaturesCollection

Concatenate multiple FeaturesCollections along the time (frame) axis.

Each collection must have the same handles (keys). For each handle, the corresponding Features objects are concatenated in order. Supports both flat and grouped collections.

Parameters:

Name	Type	Description	Default
`collections` ¶	`list[FeaturesCollection]`	List of FeaturesCollection objects to concatenate, in temporal order. All must have matching keys (handles) and feature columns.	required
`reindex` ¶	`('rezero', 'follow_previous', 'keep_original')`	How to handle frame indices: - "rezero": Reindex all frames starting from 0 (0, 1, 2, ...). - "follow_previous": Each chunk continues from where the previous ended. If chunk 1 ends at frame n, chunk 2 starts at n+1. - "keep_original": Leave indices untouched; duplicates are allowed.	`"rezero"`

Returns:

Type	Description
`FeaturesCollection`	A new collection with concatenated Features objects for each handle.

Raises:

Type	Description
`ValueError`	If collections is empty, keys don't match, or grouping structure differs.

Notes

For context-dependent features (normalization, embeddings with temporal windows, etc.), consider whether you need to recompute features on concatenated Tracking data rather than concatenating pre-computed features.

Examples:

Concatenate two flat collections:

>>> import tempfile, shutil
>>> from pathlib import Path
>>> import pandas as pd
>>> from py3r.behaviour.util.docdata import data_path
>>> from py3r.behaviour.tracking.tracking_collection import TrackingCollection
>>> from py3r.behaviour.features.features_collection import FeaturesCollection
>>> with tempfile.TemporaryDirectory() as d:
...     d = Path(d)
...     with data_path('py3r.behaviour.tracking._data', 'dlc_single.csv') as p:
...         _ = shutil.copy(p, d / 'A.csv'); _ = shutil.copy(p, d / 'B.csv')
...     tc1 = TrackingCollection.from_dlc({'A': str(d/'A.csv'),
...                                       'B': str(d/'B.csv')}, fps=30)
...     tc2 = TrackingCollection.from_dlc({'A': str(d/'A.csv'),
...                                        'B': str(d/'B.csv')}, fps=30)
>>> fc1 = FeaturesCollection.from_tracking_collection(tc1)
>>> fc2 = FeaturesCollection.from_tracking_collection(tc2)
>>> # Add a feature to all
>>> for f in list(fc1.values()) + list(fc2.values()):
...     s = pd.Series(range(len(f.tracking.data)), index=f.tracking.data.index)
...     f.store(s, 'counter', meta={})
>>> combined = FeaturesCollection.concat([fc1, fc2])
>>> len(combined['A'].data) == len(fc1['A'].data) + len(fc2['A'].data)
True
>>> 'concat' in combined['A'].meta
True

from_list `classmethod` ¶

from_list(features_list: list[Features])

Create a FeaturesCollection from a list of Features objects, keyed by handle

Examples:

>>> from py3r.behaviour.util.docdata import data_path
>>> from py3r.behaviour.tracking.tracking import Tracking
>>> with data_path('py3r.behaviour.tracking._data', 'dlc_single.csv') as p:
...     t1 = Tracking.from_dlc(str(p), handle='A', fps=30)
...     t2 = Tracking.from_dlc(str(p), handle='B', fps=30)
>>> f1, f2 = Features(t1), Features(t2)
>>> fc = FeaturesCollection.from_list([f1, f2])
>>> list(sorted(fc.keys()))
['A', 'B']

cluster_embedding ¶

cluster_embedding(
    embedding_dict: dict[str, list[int]],
    n_clusters: int,
    random_state: int = 0,
    *,
    normalize: bool = False,
    feature_weights: dict[str, float] | None = None,
    lowmem: bool = False,
    decimation_factor: int = 10,
    missing_policy: Literal[
        "drop", "impute_weight"
    ] = "drop",
    auto_normalize: bool = False,
    rescale_factors: dict | None = None,
    custom_scaling: dict[str, dict] | None = None,
)

Perform k-means clustering using the specified embedding.

Returns (BatchResult, centroids, scaling_factors) where scaling_factors is a dict of one float per embedding column — the combined effect of normalisation and feature weights.

Parameters:

Name	Type	Description	Default
`normalize` ¶	`bool`	Divide each base feature by its global std before embedding.	`False`
`feature_weights` ¶	`dict[str, float] \| None`	Substring → weight mapping, e.g. `{"speed": 4.0, "accel": 2.0}`. Each key is matched against embedding column names by substring; matched columns are multiplied by the value. Resolved internally via :func:`~py3r.behaviour.util.series_utils.build_column_weights`. Raises if a rule matches no column (likely typo).	`None`

Examples:

>>> import tempfile, shutil
>>> from pathlib import Path
>>> import pandas as pd
>>> from py3r.behaviour.util.docdata import data_path
>>> from py3r.behaviour.tracking.tracking_collection import TrackingCollection
>>> with tempfile.TemporaryDirectory() as d:
...     d = Path(d)
...     with data_path('py3r.behaviour.tracking._data', 'dlc_single.csv') as p:
...         _ = shutil.copy(p, d / 'A.csv'); _ = shutil.copy(p, d / 'B.csv')
...     tc = TrackingCollection.from_dlc({'A': str(d/'A.csv'), 'B': str(d/'B.csv')}, fps=30)
>>> fc = FeaturesCollection.from_tracking_collection(tc)
>>> # Create a trivial feature 'counter' in each Features to embed
>>> for f in fc.values():
...     s = pd.Series(range(len(f.tracking.data)), index=f.tracking.data.index)
...     f.store(s, 'counter')
>>> batch, centroids, norm = fc.cluster_embedding(
...     {'counter':[0]}, n_clusters=2, lowmem=True)
>>> isinstance(centroids, pd.DataFrame)
True
>>> batch, centroids, norm = fc.cluster_embedding(
...     {'counter':[0]}, n_clusters=2, lowmem=True,
...     missing_policy='impute_weight')
>>> isinstance(centroids, pd.DataFrame)
True
>>> batch, centroids, norm = fc.cluster_embedding(
...     {'counter':[0]}, n_clusters=2, lowmem=True,
...     missing_policy='drop')
>>> isinstance(centroids, pd.DataFrame)
True

cluster_embedding_stream ¶

cluster_embedding_stream(
    embedding_dict: dict[str, list[int]],
    n_clusters: int,
    random_state: int = 0,
    *,
    normalize: bool = False,
    feature_weights: dict[str, float] | None = None,
    missing_policy: Literal[
        "drop", "impute_weight"
    ] = "drop",
    chunk_size: int = 10000,
    n_epochs: int = 3,
    batch_size: int = 1024,
)

Memory-friendly clustering via streaming MiniBatchKMeans.

Unlike cluster_embedding, this never builds a combined DataFrame. Embeddings are extracted one Features at a time, sliced into fixed-size chunks, and fed to MiniBatchKMeans.partial_fit. Multiple epochs improve convergence; uniform chunk sizes prevent large recordings from dominating centroid updates.

Normalisation is computed on base feature columns (before embedding) so that all time-shifts of the same feature share the same std. The returned scaling_factors is a dict of one float per embedding column — the combined effect of normalisation and feature weights. Multiply raw embedding values by these to reproduce the transform.

Returns (BatchResult, centroids, scaling_factors).

Parameters:

Name	Type	Description	Default
`embedding_dict` ¶	`dict[str, list[int]]`	Feature columns and their time shifts for the embedding.	required
`n_clusters` ¶	`int`	Number of clusters.	required
`random_state` ¶	`int`	Seed for reproducibility.	`0`
`normalize` ¶	`bool`	Divide each base feature by its global std before embedding.	`False`
`feature_weights` ¶	`dict[str, float] \| None`	Substring → weight mapping, e.g. `{"speed": 4.0}`. Resolved internally into per-column weights. Raises if a rule matches no column (likely typo).	`None`
`missing_policy` ¶	`('drop', 'impute_weight')`	How to handle NaN rows.	`"drop"`
`chunk_size` ¶	`int`	Max rows per partial_fit call.	`10000`
`n_epochs` ¶	`int`	Number of full passes over the data.	`3`
`batch_size` ¶	`int`	MiniBatchKMeans internal mini-batch size.	`1024`

Examples:

>>> import tempfile, shutil
>>> from pathlib import Path
>>> import pandas as pd
>>> from py3r.behaviour.util.docdata import data_path
>>> from py3r.behaviour.tracking.tracking_collection import TrackingCollection
>>> with tempfile.TemporaryDirectory() as d:
...     d = Path(d)
...     with data_path('py3r.behaviour.tracking._data', 'dlc_single.csv') as p:
...         _ = shutil.copy(p, d / 'A.csv'); _ = shutil.copy(p, d / 'B.csv')
...     tc = TrackingCollection.from_dlc({'A': str(d/'A.csv'), 'B': str(d/'B.csv')}, fps=30)
>>> fc = FeaturesCollection.from_tracking_collection(tc)
>>> for f in fc.values():
...     s = pd.Series(range(len(f.tracking.data)), index=f.tracking.data.index)
...     f.store(s, 'counter')
>>> batch, centroids, norm = fc.cluster_embedding_stream(
...     {'counter': [0]}, n_clusters=2)
>>> isinstance(centroids, pd.DataFrame) and centroids.shape[0] == 2
True

cluster_diagnostics ¶

cluster_diagnostics(
    labels_result,
    n_clusters: int | None = None,
    *,
    low: float = 0.05,
    high: float = 0.9,
    verbose: bool = True,
)

Compute diagnostic stats for cluster label assignments.

Parameters:

Name	Type	Description	Default
`labels_result` ¶		Mapping from handle (or group->handle) to FeaturesResult of integer labels (with NA). Accepts the return shape of `cluster_embedding(...)[0]` (BatchResult or dict).	required
`n_clusters` ¶	`int \| None`	Optional number of clusters. If None, inferred from labels (max label + 1).	`None`
`low` ¶	`float`	Prevalence thresholds for low/high cluster labels per recording.	`0.05`
`high` ¶	`float`	Prevalence thresholds for low/high cluster labels per recording.	`0.05`
`verbose` ¶	`bool`	If True, print a compact summary.	`True`

Returns:

Type	Description
`dict with:`	'global': {'cluster_prevalence': {label: frac, ...}, 'percent_nan': frac} 'per_recording': DataFrame, rows per recording, cols ['percent_nan', 'num_missing', 'num_low', 'num_high'] 'summary': min/median/max for the per_recording columns if grouped: 'per_group': {group_key: {'per_recording': df, 'summary': {...}}}