Understanding Enhanced Metadata

Enhanced metadata describes a track through two complementary outputs: a set of fields that capture musical and contextual attributes, and a 768-dimension embedding vector that captures sonic similarity. This page explains the terminology, how to interpret each output, and when to use which.

For the full list of fields and their score types, see the Enhanced Metadata overview. For the full event schema and example payload, see Enhanced Metadata Event Schema.

Terminology

Field: a single predicted attribute of a track (e.g. Main Genre, Mood, Tempo, Loudness). Each field appears as one entry in the response.

Class: one of the possible predicted categories within a field. For example, Main Genre has 16 classes (rock, pop, jazz, …) and Tempo has 3 classes (slow, mid-tempo, fast).

Label: the predicted class returned for a track. Some fields return a single label string; multi-prediction fields return a labels array with up to 2 or 3 entries, in descending order of confidence.

Score: a numeric output describing the model's prediction. Scores come in two main forms: a distribution of probabilityScores (0 to 1) across every class in a field, or a single value on a defined range (typically -1 to 1).

The term "tag" is sometimes used informally to mean a field, a class, or a returned label. To avoid ambiguity, this documentation uses field, class, and label exclusively.

Field Score Types

Fields fall into two categories based on how their scores are structured.

Distribution-scored fields return a distribution object with a probabilityScore (0 to 1) for every class, alongside one or more predicted labels. Used for Content, Genres, Mood, Vocals, Instruments, Recording Environment, Sound Generation, Harmony, Scale, Key, Tonality, Rhythm, Tempo, Audience, Origin, and Use Case.

Interval-scored fields return a continuous value on a defined range, paired with a discretised label (e.g. low / moderate / high arousal). Used for Arousal, Valence, Engagement, Pleasantness, Loudness, Roughness, Timbre, Space, Texture, and Grooviness.

BPM returns a raw numeric value with no class.

Note that Loudness is measured directly from the audio per the ITU-R BS.1770-4 standard (in LKFS) rather than predicted by the model. Its value is a deterministic measurement, and itslabel is a categorisation of that measurement, not a model output.

Reading Distribution Scores

A probabilityScore divides the model's belief across all classes in a field, so the meaning of a given score depends on the class count.

A 0.5 on a 2-class field is borderline; a 0.5 on a 135-class field indicates very high affinity, since one class has claimed half of the total weight out of 135 possible options.

This is most visible in Genres. Main Genre (16 classes) tends to produce higher, more concentrated scores because the categories are broad and distinct. Sub Genre (135 classes) produces lower, more distributed scores, reflecting the natural overlap between adjacent sub-styles such as deep house and ambient house.

Lower sub-genre scores are expected, not a sign of low confidence.

Confidence Threshold and Returned Labels

For multi-label fields (Main Genre, Sub Genre, Mood, Instruments, Audience Age, Audience Region, Use Case), only classes above the model's internal confidence threshold are returned in thelabels array, capped at the field's maximum (2 or 3 depending on the field).

For single-label fields, exactly one label is always returned.

If a track's probability is spread across many classes without any single one clearly dominating, only the classes above the threshold will be returned. In that case, thelabels array may contain fewer entries than the field's maximum (e.g. one Main Genre label instead of three).

The Embedding

Each track also returns a 768-dimension embedding vector that encodes its sonic essence as a single point in a high-dimensional space. Tracks that sound similar sit close together in that space, making the embedding the foundation for similarity search, recommendation, and classification.

The 768 dimensions are not individually meaningful. No single dimension corresponds to e.g. "tempo" or "brightness".

Instead, the model distributes sonic attributes such as pitch, timbre, frequency patterns, rhythmic feel across the full vector. Meaning emerges from the relationship between vectors, not from inspecting any one component.

Comparing Embeddings

Similarity between two tracks is measured by comparing their vectors. A standard metric for audio embedding similarity is cosine similarity, which measures the angle between two vectors and ranges from -1 (opposite) to 1 (identical direction). Higher values mean closer sonic match.

The absolute scale matters less than relative ranking. Embeddings are most useful for ordering candidates (nearest-neighbour search) rather than producing standalone confidence numbers.

Fields vs. the Embedding

Fields and the embedding are complementary.

Fields are discrete, named, and bounded by their class lists, making them best for filtering, faceting, and explainability.

The embedding is continuous and unbounded, making it best for similarity, search, and recommendation.

Most production systems combine the two: fields narrow the candidate pool, embeddings rank within it.

Because both outputs are derived from audio rather than listener behaviour, they work for brand-new tracks with no listening history. This is a key advantage over collaborative-filtering recommendation systems.