Recipe Reference
A recipe is a YAML file that defines what data to fetch, how to train, and where to write the artifact. One recipe produces one model and one /predict/{name} endpoint.
Top-level fields
| Field | Type | Required | Description |
|---|---|---|---|
name | string | yes | Endpoint name. Pattern: ^[A-Za-z0-9_-]{1,64}$. Becomes /predict/{name}. |
source | object | yes | Data source config. type field is the discriminator (csv, parquet, bigquery, or any plugin). Validated in two stages: the rest of the recipe is parsed first, then the source dict is dispatched to the plugin's Config class. As a result, errors in source.* surface after errors elsewhere in the recipe; an unknown source.type raises a DataSourceError listing all registered type names. |
schema | object | yes | Column mapping. |
cleansing | object | no | Data quality gates. |
item_metadata | object | no | Metadata joined into predict responses. |
training | object | yes | Algorithm and tuning settings. |
output | object | yes | Artifact path and versioning. |
name is validated at YAML load via the ^[A-Za-z0-9_-]{1,64}$ regex. The Recipe pydantic model uses validate_assignment=True, so any post-construction mutation of name re-runs the validator and raises ValidationError on illegal values. The helper recotem.recipe.models.validate_for_filesystem(name) is exported for callers who construct names programmatically without pydantic.
source
source.type: csv (also parquet)
source:
type: csv
path: gs://bucket/interactions.csv.gz
delimiter: "," # default ","
encoding: utf-8 # default utf-8
header: 0 # row index of the header row, default 0
dtype: # optional explicit column dtypes
user_id: str
item_id: str| Field | Type | Default | Notes |
|---|---|---|---|
path | string | required | Local path, file://, s3://, gs://, az://, abfs(s)://, http://, or https:// URI. HTTP/HTTPS requires a sha256 integrity pin; see Path rules and data-sources/csv. |
delimiter | string | "," | Passed straight to pandas sep=. Multi-character separators trigger pandas' Python parser (slower); a single character uses the C parser. CSV only. |
encoding | string | "utf-8" | Any encoding accepted by pandas. |
header | int | 0 | Row number of the header. |
dtype | map | null | Key = column name, value = pandas dtype string. |
sha256 | string | optional (required when path is http:// or https://) | 64-char lowercase hex; verified against the fetched bytes; mismatch raises DataSourceError |
For Parquet files use type: parquet. Only path and (optional) sha256 are accepted — delimiter, encoding, header, and dtype are not valid keys on a parquet source and will fail recipe load.
source.type: bigquery
source:
type: bigquery
query: |
SELECT user_pseudo_id AS user_id, item_id, TIMESTAMP_MICROS(event_timestamp) AS ts
FROM `proj.analytics_123.events_*`
WHERE _TABLE_SUFFIX BETWEEN @start_date AND @end_date
query_parameters:
start_date: "20260401"
end_date: "20260507"
project: my-gcp-project # optional; falls back to ADC project| Field | Type | Default | Notes |
|---|---|---|---|
query | string | required | SQL. Trusted code — not env-expanded. Use @param for dynamic values. |
query_parameters | map | {} | BigQuery named parameters bound to @name placeholders. |
project | string | null | GCP project ID. Falls back to ADC ambient project. |
Install the extra: pip install "recotem[bigquery]".
Environment variable expansion is never performed inside query or query_parameters. Use @param placeholders to keep SQL injection foreclosed.
schema
schema:
user_column: user_id # required
item_column: item_id # required
time_column: ts # required when split.scheme is time_user or time_global| Field | Type | Required | Notes |
|---|---|---|---|
user_column | string | yes | Column name in the fetched DataFrame. |
item_column | string | yes | Column name in the fetched DataFrame. |
time_column | string | conditional | Required for time_user and time_global split schemes. |
time_unit | string | conditional | Required when time_column contains integer (numeric) values. One of s, ms, us, ns. Omitting this field for a numeric time column raises a TrainingError (code: time_unit_required) to avoid silent nanosecond interpretation of Unix timestamps. String and datetime columns are unaffected by this field. |
cleansing
cleansing:
drop_null_ids: true # default true
dedup: keep_last # keep_first | keep_last | none
min_rows: 1000 # exit 4 with min_data_violation if below
min_users: 10
min_items: 10| Field | Type | Default | Notes |
|---|---|---|---|
drop_null_ids | bool | true | Drop rows where user_id or item_id is null. |
dedup | string | keep_last | How to handle duplicate (user, item) pairs. |
min_rows | int | null (no check) | Minimum row count after cleansing. |
min_users | int | null (no check) | Minimum distinct user count. |
min_items | int | null (no check) | Minimum distinct item count. |
Violation of any min_* threshold exits with code 4 and "code": "min_data_violation" in the JSON error line.
dedup values:
| Value | Behaviour |
|---|---|
keep_first | Keep the first occurrence of each (user, item) pair. |
keep_last | Keep the last occurrence of each (user, item) pair by row order in the source DataFrame. |
none | No deduplication. |
keep_first / keep_last use the row order returned by the data source — they do not sort by time_column. If you need time-ordered deduplication, sort in the source query (BigQuery ORDER BY ts) or pre-sort the CSV before training.
item_metadata
item_metadata:
type: parquet # csv | parquet
path: gs://bucket/items.parquet
fields: [title, category, image_url] # non-empty allow-list
on_field_missing: error # error | null (default error)| Field | Type | Default | Notes |
|---|---|---|---|
type | string | required | csv or parquet. |
path | string | required | See Path rules. |
fields | list[string] | required | Non-empty. Only listed fields are returned in predict responses. |
on_field_missing | string | error | What to do if a fields entry is absent in the file. error fails the model load (at startup the recipe registers as loaded=false with last_load_error set; on hot-swap the previous model keeps serving and the failure is surfaced via /health and the recotem_artifact_load_failures_total metric); null fills the column with null. |
sha256 | string | optional (required when path is http:// or https://) | 64-char lowercase hex; verified against the fetched bytes; mismatch raises DataSourceError |
item_id_column | string | "item_id" | Column name in the metadata file that holds item identifiers. Override when your metadata file uses a different column name (e.g. product_id). Must be a non-empty, non-whitespace string. |
Server-side field suppression is also available via RECOTEM_METADATA_FIELD_DENY (comma-separated column names), applied as a post-join column drop.
training
training:
algorithms: [IALS, CosineKNN, TopPop] # at least one required
metric: ndcg # ndcg | map | recall | hit
cutoff: 20
n_trials: 40
per_algorithm_trials: # optional per-algorithm budget
IALS: 24
CosineKNN: 12
TopPop: 4
per_trial_timeout_seconds: 600
timeout_seconds: 1800
parallelism: 1
storage_path: "" # "" = in-memory Optuna; path = SQLite resume
split:
scheme: time_user # random | time_global | time_user
heldout_ratio: 0.1
test_user_ratio: 1.0
seed: 42| Field | Type | Default | Notes |
|---|---|---|---|
algorithms | list[string] | required | IALS, CosineKNN (alias CosinekNN), TopPop, RP3beta, DenseSLIM, TruncatedSVD, BPRFM. Full irspack class names (e.g. IALSRecommender) are also accepted. Hyperparameter ranges come from each recommender's default_suggest_parameter in irspack — they are not user-tunable from the recipe. |
metric | string | ndcg | One of ndcg, map, recall, hit. |
cutoff | int | 20 | Recommendation list length for evaluation (must be ≥ 1). |
n_trials | int | 40 | Total Optuna trial budget (must be ≥ 1). |
per_algorithm_trials | map | null | Per-algorithm trial overrides. Explicit 0 disables that algorithm (it is dropped from the search entirely). Algorithms in algorithms that are unspecified in this map split whatever budget remains after honouring the explicit values. If the explicit values sum to more than n_trials, positive values are scaled down proportionally (each remains ≥ 1 when at least n_trials slots exist; otherwise the first n_trials non-zero classes get one trial each and the remainder are skipped — the total budget never exceeds n_trials). Unknown algorithm keys are rejected at recipe-load time with a ValidationError — each key must be a valid alias or class name present in algorithms. When parallelism > 1, the actual per-algorithm trial count may exceed the configured budget by up to parallelism - 1 trials due to in-flight concurrent trials; a warning is logged on each run where this condition applies. |
per_trial_timeout_seconds | int | null | Soft per-trial wall-clock cap. Implemented by running the trial in a worker thread; if it overshoots, Optuna prunes the trial but the underlying thread is daemonised and may continue until it finishes naturally (CPU/memory still spent). The count of threads still running at the time the study finishes is reported as n_orphaned in the train_done structured log event. Operators can monitor this field to detect trials that consistently exceed the timeout and adjust per_trial_timeout_seconds or timeout_seconds accordingly. |
timeout_seconds | int | null | Overall tuning wall-clock cap. |
parallelism | int | 1 | Optuna n_jobs (Python threads, not processes). Algorithms whose hot loop is GIL-bound see little speed-up; native-code learners (IALS, RP3beta) benefit most. |
storage_path | string | "" | Empty = in-memory (no resume). A bare path becomes a SQLite URL (sqlite:///<path>); explicit sqlite://, postgresql://, postgres://, and mysql:// URLs are also accepted. Study name is recotem_<recipe_name>_<run_id> and load_if_exists=True, so a fresh run_id per train invocation always starts a new study (resume requires reusing the same run_id — pass recotem train --run-id <stable>). SQLite over NFS corrupts — keep SQLite databases on a local filesystem. URLs must not embed credentials (postgresql://user:pass@host/db is rejected with SearchError so userinfo cannot leak through SQLAlchemy tracebacks). Provide credentials via PGPASSFILE / ~/.pgpass / SQLAlchemy env vars instead. |
split.scheme | string | random | random, time_global, or time_user. See semantics below. |
split.heldout_ratio | float | 0.1 | Fraction of interactions held out. Must be in (0, 1). |
split.test_user_ratio | float | 1.0 | Fraction of users included in the test split. Must be in (0, 1]. |
split.seed | int | 42 | Random seed for the split (passed to irspack as random_state). |
Split scheme semantics:
random— interactions are held out uniformly at random per user.time_columnis unused.time_user— for each user, the most recentheldout_ratioof that user's interactions (ranked bytime_column) are held out. Cutoff is computed per user.time_global— a single global cutoff at the1 - heldout_ratioquantile oftime_columnover the whole dataset; every interaction at or after the cutoff is held out, regardless of user. Users with no post-cutoff interactions become train-only.
time_user and time_global require schema.time_column. Missing time_column with these schemes is a recipe validation error and exits with code 2.
If a search produces no completed trials, training exits with code 4 and "code": "no_completed_trials". If every completed trial scores exactly 0.0, exit 4 with "code": "zero_score" (typically caused by too short a per_trial_timeout_seconds or a too-small validation set).
output
output:
path: ./artifacts/news_articles.recotem
versioning: append_sha # always_overwrite | append_sha (default append_sha)| Field | Type | Default | Notes |
|---|---|---|---|
path | string | required | Artifact destination. See Path rules. |
versioning | string | append_sha | How artifacts are written. |
versioning modes:
| Mode | Behaviour |
|---|---|
always_overwrite | Writes directly to <path>. |
append_sha | Writes to <path>.<sha8>.recotem, then atomically updates a pointer file at <path>. The server reads through the pointer. |
Path rules
Applies to output.path, source.path, and item_metadata.path.
Path schemes for source.path and item_metadata.path are restricted to an explicit allow-list: bare local path (no scheme prefix), file://, s3://, gs://, az://, abfs://, abfss://, http://, https://. Schemes are explicitly enumerated rather than relying on fsspec's full registry to prevent unvetted handlers from being reachable via recipe content. Chained fsspec protocols (paths containing ::) are also rejected. Schemes http:// and https:// additionally require an sha256 integrity pin on the same config block.
Decompressed-size cap not enforced.
RECOTEM_MAX_DOWNLOAD_BYTEScaps raw I/O bytes only. Compressed CSV and columnar Parquet sources can expand to a multiple of the raw size after decompression; the resulting DataFrame is not size-capped. Runrecotem traininside a cgroup or Kubernetes Pod with a memory limit to contain the impact. See security — Decompressed-size cap not enforced.
output.path is restricted to the following schemes: bare local path (no prefix), file://, s3://, gs://, az://, abfs://, abfss://. Other schemes are rejected: http://, https://, ftp://, and ftps:// because Recotem does not support writing artifacts over those protocols; memory:// because it is process-local and would not survive past the training run.
Embedded credentials (s3://AKIA...:secret@bucket/) are rejected at recipe load on every path field.
Local paths are resolved to absolute. If RECOTEM_ARTIFACT_ROOT is set, output.path must resolve to a path under it after realpath resolution (symlink escapes are rejected).
Environment variable expansion
Syntax: ${RECOTEM_RECIPE_VAR}. Only variables matching the prefix RECOTEM_RECIPE_* are expanded. Matching is case-insensitive (the upper-cased name is checked against the prefix and blacklist). Additional values can be injected without exporting to the shell environment using recotem train --env-var KEY=VALUE (repeatable). The KEY must still start with RECOTEM_RECIPE_ and pass the blacklist check. Example: recotem train recipe.yaml --env-var RECOTEM_RECIPE_DATE=20260501.
Blacklisted (never expanded regardless of prefix): exact names RECOTEM_SIGNING_KEYS and RECOTEM_API_KEYS; names starting with AWS_, GCP_, GOOGLE_, or AZURE_; and any name containing the substrings SECRET, PASSWORD, PASSWD, TOKEN, KEY, AUTH, BEARER, CRED, or PRIVATE (all comparisons case-insensitive).
The *KEY* substring match is intentionally broad — any variable whose uppercased name contains the substring KEY (no underscore boundary) is rejected. This includes RECOTEM_RECIPE_PARTITION_KEY, RECOTEM_RECIPE_APIKEY, and RECOTEM_RECIPE_KEYBOARD. Use a name that does not contain KEY (e.g. RECOTEM_RECIPE_PARTITION_COLUMN).
Expansion is never performed inside any key named query or query_parameters at any nesting level (not just under source). All other strings — including source.path, output.path, and item_metadata.path — are expanded.
Prefix vs. blacklist interaction
The RECOTEM_RECIPE_ prefix check is applied to the full variable name. Only the tail portion (after RECOTEM_RECIPE_) is subject to the blacklist substring rules. For example, RECOTEM_RECIPE_GCP_PROJECT satisfies the prefix check; it is not blocked by the GCP_* blacklist-prefix rule because that rule matches only names whose uppercased form starts with GCP_ (e.g. GCP_SOMETHING). The variable RECOTEM_RECIPE_GCP_PROJECT starts with RECOTEM_RECIPE_, not GCP_. The examples/ga4-bigquery/ recipe uses this pattern legitimately. However, it would be blocked if its name contained KEY, TOKEN, SECRET, or any other blacklisted substring (case-insensitive).
Expansion is single-pass and runs once at YAML load time. There is no escape syntax (a literal ${...} in the YAML cannot be preserved unless the variable name fails the prefix check, which raises an error), no default-value syntax (${VAR:-default} is not supported and would attempt to expand the literal name VAR:-default), and substituted values are not re-scanned for further ${...} references.
A missing, malformed, or blacklisted variable produces a RecipeError (exit 2). The error message names the variable but never includes its value.
Loading a directory of recipes
recotem serve --recipes <dir> and load_recipes_directory() enumerate only direct *.yaml children of <dir> (non-recursive). Subdirectories are ignored. Each recipe file must remain inside the directory after realpath resolution — symlinks pointing outside are rejected.
Duplicate name field handling differs by call site:
recotem train/load_recipes_directory()(strict): a duplicatenameacross any two files raisesRecipeErrorimmediately and aborts the entire load.recotem serve/load_recipes_directory_lenient()(lenient): the first file loaded wins; any subsequent file with the samenameis skipped and arecipe_duplicate_name_skippedwarning is emitted to the structured log. The serve process continues with the surviving recipe.
Full example
name: news_articles
source:
type: bigquery
query: |
SELECT user_pseudo_id AS user_id,
(SELECT value.int_value FROM UNNEST(event_params) WHERE key='article_id') AS item_id,
TIMESTAMP_MICROS(event_timestamp) AS ts
FROM `proj.analytics_123.events_*`
WHERE _TABLE_SUFFIX BETWEEN @start_date AND @end_date
AND event_name = 'select_content'
query_parameters:
start_date: "20260401"
end_date: "20260507"
project: my-gcp-project
schema:
user_column: user_id
item_column: item_id
time_column: ts
cleansing:
drop_null_ids: true
dedup: keep_last
min_rows: 5000
min_users: 100
min_items: 50
item_metadata:
type: parquet
path: gs://my-bucket/items.parquet
fields: [title, category]
on_field_missing: error
training:
algorithms: [IALS, CosineKNN, TopPop]
metric: ndcg
cutoff: 20
n_trials: 40
timeout_seconds: 1800
split:
scheme: time_user
heldout_ratio: 0.1
seed: 42
output:
path: gs://my-bucket/artifacts/news_articles.recotem
versioning: append_sha