Skip to content

Recipe Reference

A recipe is a YAML file that defines what data to fetch, how to train, and where to write the artifact. One recipe produces one model and one /predict/{name} endpoint.

Top-level fields

FieldTypeRequiredDescription
namestringyesEndpoint name. Pattern: ^[A-Za-z0-9_-]{1,64}$. Becomes /predict/{name}.
sourceobjectyesData source config. type field is the discriminator (csv, parquet, bigquery, or any plugin). Validated in two stages: the rest of the recipe is parsed first, then the source dict is dispatched to the plugin's Config class. As a result, errors in source.* surface after errors elsewhere in the recipe; an unknown source.type raises a DataSourceError listing all registered type names.
schemaobjectyesColumn mapping.
cleansingobjectnoData quality gates.
item_metadataobjectnoMetadata joined into predict responses.
trainingobjectyesAlgorithm and tuning settings.
outputobjectyesArtifact path and versioning.

name is validated at YAML load via the ^[A-Za-z0-9_-]{1,64}$ regex. The Recipe pydantic model uses validate_assignment=True, so any post-construction mutation of name re-runs the validator and raises ValidationError on illegal values. The helper recotem.recipe.models.validate_for_filesystem(name) is exported for callers who construct names programmatically without pydantic.


source

source.type: csv (also parquet)

yaml
source:
  type: csv
  path: gs://bucket/interactions.csv.gz
  delimiter: ","         # default ","
  encoding: utf-8        # default utf-8
  header: 0              # row index of the header row, default 0
  dtype:                 # optional explicit column dtypes
    user_id: str
    item_id: str
FieldTypeDefaultNotes
pathstringrequiredLocal path, file://, s3://, gs://, az://, abfs(s)://, http://, or https:// URI. HTTP/HTTPS requires a sha256 integrity pin; see Path rules and data-sources/csv.
delimiterstring","Passed straight to pandas sep=. Multi-character separators trigger pandas' Python parser (slower); a single character uses the C parser. CSV only.
encodingstring"utf-8"Any encoding accepted by pandas.
headerint0Row number of the header.
dtypemapnullKey = column name, value = pandas dtype string.
sha256stringoptional (required when path is http:// or https://)64-char lowercase hex; verified against the fetched bytes; mismatch raises DataSourceError

For Parquet files use type: parquet. Only path and (optional) sha256 are accepted — delimiter, encoding, header, and dtype are not valid keys on a parquet source and will fail recipe load.

source.type: bigquery

yaml
source:
  type: bigquery
  query: |
    SELECT user_pseudo_id AS user_id, item_id, TIMESTAMP_MICROS(event_timestamp) AS ts
    FROM `proj.analytics_123.events_*`
    WHERE _TABLE_SUFFIX BETWEEN @start_date AND @end_date
  query_parameters:
    start_date: "20260401"
    end_date: "20260507"
  project: my-gcp-project   # optional; falls back to ADC project
FieldTypeDefaultNotes
querystringrequiredSQL. Trusted code — not env-expanded. Use @param for dynamic values.
query_parametersmap{}BigQuery named parameters bound to @name placeholders.
projectstringnullGCP project ID. Falls back to ADC ambient project.

Install the extra: pip install "recotem[bigquery]".

Environment variable expansion is never performed inside query or query_parameters. Use @param placeholders to keep SQL injection foreclosed.


schema

yaml
schema:
  user_column: user_id    # required
  item_column: item_id    # required
  time_column: ts         # required when split.scheme is time_user or time_global
FieldTypeRequiredNotes
user_columnstringyesColumn name in the fetched DataFrame.
item_columnstringyesColumn name in the fetched DataFrame.
time_columnstringconditionalRequired for time_user and time_global split schemes.
time_unitstringconditionalRequired when time_column contains integer (numeric) values. One of s, ms, us, ns. Omitting this field for a numeric time column raises a TrainingError (code: time_unit_required) to avoid silent nanosecond interpretation of Unix timestamps. String and datetime columns are unaffected by this field.

cleansing

yaml
cleansing:
  drop_null_ids: true        # default true
  dedup: keep_last           # keep_first | keep_last | none
  min_rows: 1000             # exit 4 with min_data_violation if below
  min_users: 10
  min_items: 10
FieldTypeDefaultNotes
drop_null_idsbooltrueDrop rows where user_id or item_id is null.
dedupstringkeep_lastHow to handle duplicate (user, item) pairs.
min_rowsintnull (no check)Minimum row count after cleansing.
min_usersintnull (no check)Minimum distinct user count.
min_itemsintnull (no check)Minimum distinct item count.

Violation of any min_* threshold exits with code 4 and "code": "min_data_violation" in the JSON error line.

dedup values:

ValueBehaviour
keep_firstKeep the first occurrence of each (user, item) pair.
keep_lastKeep the last occurrence of each (user, item) pair by row order in the source DataFrame.
noneNo deduplication.

keep_first / keep_last use the row order returned by the data source — they do not sort by time_column. If you need time-ordered deduplication, sort in the source query (BigQuery ORDER BY ts) or pre-sort the CSV before training.


item_metadata

yaml
item_metadata:
  type: parquet            # csv | parquet
  path: gs://bucket/items.parquet
  fields: [title, category, image_url]   # non-empty allow-list
  on_field_missing: error  # error | null (default error)
FieldTypeDefaultNotes
typestringrequiredcsv or parquet.
pathstringrequiredSee Path rules.
fieldslist[string]requiredNon-empty. Only listed fields are returned in predict responses.
on_field_missingstringerrorWhat to do if a fields entry is absent in the file. error fails the model load (at startup the recipe registers as loaded=false with last_load_error set; on hot-swap the previous model keeps serving and the failure is surfaced via /health and the recotem_artifact_load_failures_total metric); null fills the column with null.
sha256stringoptional (required when path is http:// or https://)64-char lowercase hex; verified against the fetched bytes; mismatch raises DataSourceError
item_id_columnstring"item_id"Column name in the metadata file that holds item identifiers. Override when your metadata file uses a different column name (e.g. product_id). Must be a non-empty, non-whitespace string.

Server-side field suppression is also available via RECOTEM_METADATA_FIELD_DENY (comma-separated column names), applied as a post-join column drop.


training

yaml
training:
  algorithms: [IALS, CosineKNN, TopPop]    # at least one required
  metric: ndcg                              # ndcg | map | recall | hit
  cutoff: 20
  n_trials: 40
  per_algorithm_trials:                     # optional per-algorithm budget
    IALS: 24
    CosineKNN: 12
    TopPop: 4
  per_trial_timeout_seconds: 600
  timeout_seconds: 1800
  parallelism: 1
  storage_path: ""                          # "" = in-memory Optuna; path = SQLite resume
  split:
    scheme: time_user                       # random | time_global | time_user
    heldout_ratio: 0.1
    test_user_ratio: 1.0
    seed: 42
FieldTypeDefaultNotes
algorithmslist[string]requiredIALS, CosineKNN (alias CosinekNN), TopPop, RP3beta, DenseSLIM, TruncatedSVD, BPRFM. Full irspack class names (e.g. IALSRecommender) are also accepted. Hyperparameter ranges come from each recommender's default_suggest_parameter in irspack — they are not user-tunable from the recipe.
metricstringndcgOne of ndcg, map, recall, hit.
cutoffint20Recommendation list length for evaluation (must be ≥ 1).
n_trialsint40Total Optuna trial budget (must be ≥ 1).
per_algorithm_trialsmapnullPer-algorithm trial overrides. Explicit 0 disables that algorithm (it is dropped from the search entirely). Algorithms in algorithms that are unspecified in this map split whatever budget remains after honouring the explicit values. If the explicit values sum to more than n_trials, positive values are scaled down proportionally (each remains ≥ 1 when at least n_trials slots exist; otherwise the first n_trials non-zero classes get one trial each and the remainder are skipped — the total budget never exceeds n_trials). Unknown algorithm keys are rejected at recipe-load time with a ValidationError — each key must be a valid alias or class name present in algorithms. When parallelism > 1, the actual per-algorithm trial count may exceed the configured budget by up to parallelism - 1 trials due to in-flight concurrent trials; a warning is logged on each run where this condition applies.
per_trial_timeout_secondsintnullSoft per-trial wall-clock cap. Implemented by running the trial in a worker thread; if it overshoots, Optuna prunes the trial but the underlying thread is daemonised and may continue until it finishes naturally (CPU/memory still spent). The count of threads still running at the time the study finishes is reported as n_orphaned in the train_done structured log event. Operators can monitor this field to detect trials that consistently exceed the timeout and adjust per_trial_timeout_seconds or timeout_seconds accordingly.
timeout_secondsintnullOverall tuning wall-clock cap.
parallelismint1Optuna n_jobs (Python threads, not processes). Algorithms whose hot loop is GIL-bound see little speed-up; native-code learners (IALS, RP3beta) benefit most.
storage_pathstring""Empty = in-memory (no resume). A bare path becomes a SQLite URL (sqlite:///<path>); explicit sqlite://, postgresql://, postgres://, and mysql:// URLs are also accepted. Study name is recotem_<recipe_name>_<run_id> and load_if_exists=True, so a fresh run_id per train invocation always starts a new study (resume requires reusing the same run_id — pass recotem train --run-id <stable>). SQLite over NFS corrupts — keep SQLite databases on a local filesystem. URLs must not embed credentials (postgresql://user:pass@host/db is rejected with SearchError so userinfo cannot leak through SQLAlchemy tracebacks). Provide credentials via PGPASSFILE / ~/.pgpass / SQLAlchemy env vars instead.
split.schemestringrandomrandom, time_global, or time_user. See semantics below.
split.heldout_ratiofloat0.1Fraction of interactions held out. Must be in (0, 1).
split.test_user_ratiofloat1.0Fraction of users included in the test split. Must be in (0, 1].
split.seedint42Random seed for the split (passed to irspack as random_state).

Split scheme semantics:

  • random — interactions are held out uniformly at random per user. time_column is unused.
  • time_user — for each user, the most recent heldout_ratio of that user's interactions (ranked by time_column) are held out. Cutoff is computed per user.
  • time_global — a single global cutoff at the 1 - heldout_ratio quantile of time_column over the whole dataset; every interaction at or after the cutoff is held out, regardless of user. Users with no post-cutoff interactions become train-only.

time_user and time_global require schema.time_column. Missing time_column with these schemes is a recipe validation error and exits with code 2.

If a search produces no completed trials, training exits with code 4 and "code": "no_completed_trials". If every completed trial scores exactly 0.0, exit 4 with "code": "zero_score" (typically caused by too short a per_trial_timeout_seconds or a too-small validation set).


output

yaml
output:
  path: ./artifacts/news_articles.recotem
  versioning: append_sha     # always_overwrite | append_sha (default append_sha)
FieldTypeDefaultNotes
pathstringrequiredArtifact destination. See Path rules.
versioningstringappend_shaHow artifacts are written.

versioning modes:

ModeBehaviour
always_overwriteWrites directly to <path>.
append_shaWrites to <path>.<sha8>.recotem, then atomically updates a pointer file at <path>. The server reads through the pointer.

Path rules

Applies to output.path, source.path, and item_metadata.path.

Path schemes for source.path and item_metadata.path are restricted to an explicit allow-list: bare local path (no scheme prefix), file://, s3://, gs://, az://, abfs://, abfss://, http://, https://. Schemes are explicitly enumerated rather than relying on fsspec's full registry to prevent unvetted handlers from being reachable via recipe content. Chained fsspec protocols (paths containing ::) are also rejected. Schemes http:// and https:// additionally require an sha256 integrity pin on the same config block.

Decompressed-size cap not enforced. RECOTEM_MAX_DOWNLOAD_BYTES caps raw I/O bytes only. Compressed CSV and columnar Parquet sources can expand to a multiple of the raw size after decompression; the resulting DataFrame is not size-capped. Run recotem train inside a cgroup or Kubernetes Pod with a memory limit to contain the impact. See security — Decompressed-size cap not enforced.

output.path is restricted to the following schemes: bare local path (no prefix), file://, s3://, gs://, az://, abfs://, abfss://. Other schemes are rejected: http://, https://, ftp://, and ftps:// because Recotem does not support writing artifacts over those protocols; memory:// because it is process-local and would not survive past the training run.

Embedded credentials (s3://AKIA...:secret@bucket/) are rejected at recipe load on every path field.

Local paths are resolved to absolute. If RECOTEM_ARTIFACT_ROOT is set, output.path must resolve to a path under it after realpath resolution (symlink escapes are rejected).


Environment variable expansion

Syntax: ${RECOTEM_RECIPE_VAR}. Only variables matching the prefix RECOTEM_RECIPE_* are expanded. Matching is case-insensitive (the upper-cased name is checked against the prefix and blacklist). Additional values can be injected without exporting to the shell environment using recotem train --env-var KEY=VALUE (repeatable). The KEY must still start with RECOTEM_RECIPE_ and pass the blacklist check. Example: recotem train recipe.yaml --env-var RECOTEM_RECIPE_DATE=20260501.

Blacklisted (never expanded regardless of prefix): exact names RECOTEM_SIGNING_KEYS and RECOTEM_API_KEYS; names starting with AWS_, GCP_, GOOGLE_, or AZURE_; and any name containing the substrings SECRET, PASSWORD, PASSWD, TOKEN, KEY, AUTH, BEARER, CRED, or PRIVATE (all comparisons case-insensitive).

The *KEY* substring match is intentionally broad — any variable whose uppercased name contains the substring KEY (no underscore boundary) is rejected. This includes RECOTEM_RECIPE_PARTITION_KEY, RECOTEM_RECIPE_APIKEY, and RECOTEM_RECIPE_KEYBOARD. Use a name that does not contain KEY (e.g. RECOTEM_RECIPE_PARTITION_COLUMN).

Expansion is never performed inside any key named query or query_parameters at any nesting level (not just under source). All other strings — including source.path, output.path, and item_metadata.path — are expanded.

Prefix vs. blacklist interaction

The RECOTEM_RECIPE_ prefix check is applied to the full variable name. Only the tail portion (after RECOTEM_RECIPE_) is subject to the blacklist substring rules. For example, RECOTEM_RECIPE_GCP_PROJECT satisfies the prefix check; it is not blocked by the GCP_* blacklist-prefix rule because that rule matches only names whose uppercased form starts with GCP_ (e.g. GCP_SOMETHING). The variable RECOTEM_RECIPE_GCP_PROJECT starts with RECOTEM_RECIPE_, not GCP_. The examples/ga4-bigquery/ recipe uses this pattern legitimately. However, it would be blocked if its name contained KEY, TOKEN, SECRET, or any other blacklisted substring (case-insensitive).

Expansion is single-pass and runs once at YAML load time. There is no escape syntax (a literal ${...} in the YAML cannot be preserved unless the variable name fails the prefix check, which raises an error), no default-value syntax (${VAR:-default} is not supported and would attempt to expand the literal name VAR:-default), and substituted values are not re-scanned for further ${...} references.

A missing, malformed, or blacklisted variable produces a RecipeError (exit 2). The error message names the variable but never includes its value.

Loading a directory of recipes

recotem serve --recipes <dir> and load_recipes_directory() enumerate only direct *.yaml children of <dir> (non-recursive). Subdirectories are ignored. Each recipe file must remain inside the directory after realpath resolution — symlinks pointing outside are rejected.

Duplicate name field handling differs by call site:

  • recotem train / load_recipes_directory() (strict): a duplicate name across any two files raises RecipeError immediately and aborts the entire load.
  • recotem serve / load_recipes_directory_lenient() (lenient): the first file loaded wins; any subsequent file with the same name is skipped and a recipe_duplicate_name_skipped warning is emitted to the structured log. The serve process continues with the surviving recipe.

Full example

yaml
name: news_articles

source:
  type: bigquery
  query: |
    SELECT user_pseudo_id AS user_id,
           (SELECT value.int_value FROM UNNEST(event_params) WHERE key='article_id') AS item_id,
           TIMESTAMP_MICROS(event_timestamp) AS ts
    FROM   `proj.analytics_123.events_*`
    WHERE  _TABLE_SUFFIX BETWEEN @start_date AND @end_date
      AND  event_name = 'select_content'
  query_parameters:
    start_date: "20260401"
    end_date: "20260507"
  project: my-gcp-project

schema:
  user_column: user_id
  item_column: item_id
  time_column: ts

cleansing:
  drop_null_ids: true
  dedup: keep_last
  min_rows: 5000
  min_users: 100
  min_items: 50

item_metadata:
  type: parquet
  path: gs://my-bucket/items.parquet
  fields: [title, category]
  on_field_missing: error

training:
  algorithms: [IALS, CosineKNN, TopPop]
  metric: ndcg
  cutoff: 20
  n_trials: 40
  timeout_seconds: 1800
  split:
    scheme: time_user
    heldout_ratio: 0.1
    seed: 42

output:
  path: gs://my-bucket/artifacts/news_articles.recotem
  versioning: append_sha