# RoPE Rotary Positional Embedding ([Su et al., 2021](https://arxiv.org/abs/2104.09864)) — encodes both absolute position and relative distance between tokens by rotating pairs of elements in the last dimension. The rotation angle is derived from the token's absolute position; either the embedding is split in half or its alternate elements are paired (`interleaved=True`) before the rotation matrix is applied. Three ways to drive the rotation, in priority order: 1. Pass precomputed `cos` and `sin` directly. 2. Pass `position_ids` (and optionally `freqs`) — the op constructs `cos`/`sin` internally. 3. Pass nothing extra — the op derives `position_ids` from `offset` + `scale` and `freqs` from `base`. Use `position_ids` when your model computes position indices externally (custom sequence packing or variable-length inputs). Use `cos`/`sin` when you have precomputed frequency tensors. Use `offset` for KV-cache decoding steps where only a single new token position is needed. ## Constructor ```python RoPE(scale=1.0, base=1e4, dims=None, interleaved=False) ``` | Parameter | Type | Default | Description | |---|---|---|---| | `scale` | `float` | `1.0` | Frequency scaling factor applied to positions. | | `base` | `float` | `1e4` | Base for the geometric frequency sequence. | | `dims` | `int \| None` | `None` | Number of dimensions to rotate. If `None`, rotates all dimensions. | | `interleaved` | `bool` | `False` | If `True`, uses interleaved (Hugging Face-style) rotation; if `False`, uses split-half rotation. | ## Forward ```python def forward( self, input: torch.Tensor, cos: torch.Tensor | None = None, sin: torch.Tensor | None = None, position_ids: torch.Tensor | None = None, freqs: torch.Tensor | None = None, offset: torch.Tensor | None = None, ) -> torch.Tensor ``` | Argument | Description | |---|---| | `input` | Tensor of rank ≥ 3, shape `[batch, ..., seq_len, embed]`. | | `cos`, `sin` | Precomputed cosines / sines, broadcastable to `[batch, ..., seq_len, embed/2]`. If both are provided, all other position-construction arguments are ignored. | | `position_ids` | Position indices, broadcastable to `[batch, ..., seq_len]`. Ignored if `cos` and `sin` are provided. | | `freqs` | Custom angular frequencies of shape `[embed/2]`. Useful for advanced variants like SuScaledRoPE. Ignored if `cos` and `sin` are provided. | | `offset` | Starting position for the sequence. Tensor of shape `[]` / `[1]` / `[batch]` / `[batch, 1]`, or an `int` attribute. If a tensor is provided alongside the int attribute, the tensor wins. Default: `0`. | When `cos`/`sin` are not provided, position ids are constructed as `position_ids = (offset + arange(seq_len)) * scale`, and frequencies are `freqs[i] = 1 / base ** (i / (embed/2))`. ## Optional input resolution order 1. If `cos` and `sin` are both provided, use them directly. 2. Else, build `cos`/`sin` from `position_ids` and `freqs`: - `position_ids`: use the argument if provided; otherwise construct from `offset` and `scale`. - `freqs`: use the argument if provided; otherwise construct from `base`. ## Input names variants | Arguments provided | `input_names` in IR | |---|---| | `input` only | `["input"]` | | `input`, `cos`, `sin` | `["input", "cos", "sin"]` | | `input`, `freqs` | `["input", "freqs"]` | | `input`, `offset` | `["input", "offset"]` | | `input`, `position_ids` | `["input", "position_ids"]` | ## ExternalizeSpec ```python ExternalizeSpec( target_class=RoPE, composite_op_name="rope", composite_attrs=["scale", "base", "dims", "interleaved"], ) ``` ## Partial rotation (`dims`) When `dims` is set to a positive even integer smaller than `embed`, only the first `dims` features are rotated; the rest pass through unchanged: ```python y_partial = rope(input[..., :dims]) output = torch.cat([y_partial, input[..., dims:]], dim=-1) ``` When `dims` is `None` or `dims >= embed`, the full last dimension is rotated. ## Data types | Tensor | Allowed dtypes | |---|---| | `input`, `cos`, `sin`, `freqs`, `output` | `fp32`, `fp16`, `bf16` | | `position_ids`, `offset` | integer | `input`, `cos`, and `sin` dtypes must be promotable; the output dtype is the promoted type. ## Reference - [Su et al., RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/abs/2104.09864)