Curiosity - Federated Search

Federated Search

Federated search extends workspace search to include results from external systems. When a user searches, the workspace executes your custom code alongside its normal retrieval — calling any API, database, or service you choose — and blends those results into the unified result list.

This gives users a single search box that spans everything: internal graph data and external services like documentation platforms, issue trackers, ERP systems, knowledge bases, or public APIs.

How it works

A federated search index is a Custom Code Search Index. For every search query, the workspace:

Executes your C# search code with the user's query string.
Your code calls the external system and returns a ranked list of result identifiers.
The workspace creates virtual nodes — temporary in-memory graph nodes that hold the result data.
Those virtual nodes appear in search results alongside native graph nodes.
When a virtual node is actually displayed (or scrolled to during pagination), the workspace calls your materialize code to hydrate it with full content on demand.

Virtual nodes are never persisted to storage and are automatically cleared from memory after a period of inactivity.

Creating a federated search index

In the Workspace Admin UI:

Go to Build → Code Indexes.
Click Create federated search index (the wave-magnifier button in the toolbar).
Enter a Virtual Node Type name. This is the schema the virtual nodes will use; the data is held in memory only and never written to the database. The examples below use Ticket — the same node type as the support-graph examples — so federated issues from an external tracker blend in with the workspace's native tickets.
Write the Search code and, if you want lazy pagination, the Materialize code.
Save.

Search code

The search code runs on every query. It receives the raw query string from SearchQuery.OriginalQuery, calls your external system, creates virtual nodes for each result, and returns a KeyedScoredUIDs map of UIDs to scores.

Available variables:

Variable	Type	Description
`Graph`	`Mosaik.GraphDB.Safe.Graph`	Graph instance for creating virtual nodes.
`SearchQuery`	`ISearchExpression`	The search expression. Use `.OriginalQuery` for the raw query string.
`UserUID`	`UID128`	UID of the user performing the search.
`VirtualNodeSource`	`IVirtualNodeSource`	Marker to attach to non-materialized virtual nodes.
`CancellationToken`	`CancellationToken`	Always pass this to async calls.
`CreateHttpClient()`	method	Returns an `HttpClient` for outbound HTTP calls.

Return type: KeyedScoredUIDs — a map of TypedUID128 → float scores. Use KeyedScoredUIDs.Rent() to get an instance. Return KeyedScoredUIDs.Empty() if the external system returned no results.

Minimal example — federating an external issue tracker into Ticket virtual nodes:

var query = SearchQuery.OriginalQuery;
var scoredUIDs = KeyedScoredUIDs.Rent();

using var httpClient = CreateHttpClient();
var response = await httpClient.GetAsync(
    $"https://issues.example.com/api/search?q={Uri.EscapeDataString(query)}&limit=20",
    CancellationToken);

response.EnsureSuccessStatusCode();

var results = await response.Content.ReadFromJsonAsync<IssueSearchResponse>(CancellationToken);

if (results?.Issues == null || results.Issues.Count == 0)
    return KeyedScoredUIDs.Empty();

int rank = 0;
foreach (var issue in results.Issues)
{
    var node = await Graph.GetOrAddLockedAsync(N.Ticket.Type, issue.Id);
    node[N.Ticket.Subject] = issue.Subject;
    node[N.Ticket.Body]    = issue.Body;
    node[N.Ticket.Url]     = issue.Url;
    node.Timestamp = new Time(issue.CreatedAt);
    await Graph.CommitVirtualNodeAsync(node);

    scoredUIDs[new TypedUID128(node.TypeUID, node.UID)] = 1f / (rank + 1);
    rank++;
}

return scoredUIDs;

public record IssueSearchResponse(List<Issue> Issues);
public record Issue(string Id, string Subject, string Body, string Url, DateTimeOffset CreatedAt);

N.Ticket is the strongly-typed accessor for the virtual node type — whatever name you entered as the Virtual Node Type becomes N.<YourType>. Here the index was created with Ticket, so N.Ticket.Subject, N.Ticket.Body, and N.Ticket.Url line up with the fields of the Ticket node used elsewhere in the docs. A virtual node type can declare any fields you want to populate (Url above isn't part of the ingested Ticket schema — virtual nodes are in-memory, so the shape is whatever your code writes).

Scoring

The score you assign each UID determines its initial rank within the federated results. A common convention is 1f / (rank + 1) (reciprocal rank), which preserves the ordering the external system returned. If the external system gives you an actual relevance score, use it directly.

Lazy materialization

For most paginated APIs, you don't want to fetch every result page up front — only the first page. For the remaining virtual nodes (those that may never be viewed), store only a synthetic key that encodes enough to fetch the content later. The workspace calls your materialize code only when it actually needs to display one of those nodes.

Pattern — eager first page, lazy remainder:

// In the search code
int pageSize = 20;
int index = 0;

foreach (var issue in firstPageIssues)         // first page already in memory
{
    var id = $"{query}|page:0|pos:{index}";
    var node = await Graph.GetOrAddLockedAsync(N.Ticket.Type, id);
    node[N.Ticket.Subject] = issue.Subject;     // eagerly populated
    node[N.Ticket.Url]     = issue.Url;
    await Graph.CommitVirtualNodeAsync(node);

    scoredUIDs[new TypedUID128(node.TypeUID, node.UID)] = 1f / (index + 1);
    index++;
}

// Remaining results — virtual shells only, materialized on demand
int maxResults = Math.Min(200, totalCount);
for (; index < maxResults; index++)
{
    int pageIndex  = index / pageSize;
    int posInPage  = index % pageSize;
    var id = $"{query}|page:{pageIndex}|pos:{posInPage}";

    var node = await Graph.GetOrAddLockedAsync(N.Ticket.Type, id);
    node[IVirtualNodeSource.FieldName] = VirtualNodeSource;  // marks node as lazy
    await Graph.CommitVirtualNodeAsync(node);

    scoredUIDs[new TypedUID128(node.TypeUID, node.UID)] = 1f / (index + 1);
}

Materialize code

The materialize code runs when a lazy virtual node actually needs to be shown. It does not receive Graph — it only reads from and writes to Content.

Available variables:

Variable	Type	Description
`Content`	`NodeContent`	The node being materialized. Read the ID, populate fields.
`CancellationToken`	`CancellationToken`	Pass to any async calls.
`CreateHttpClient()`	method	Returns an `HttpClient`.
`Logger`	`ILogger`	Standard logger.

Minimal example:

var id = Content[N.Ticket.ID];

// Parse page and position from the synthetic key
var parts      = id.Split('|');
var pageIndex  = int.Parse(parts[1].Split(':')[1]);
var posInPage  = int.Parse(parts[2].Split(':')[1]);

using var httpClient = CreateHttpClient();
var response = await httpClient.GetAsync(
    $"https://issues.example.com/api/results?page={pageIndex}&limit=20",
    CancellationToken);
response.EnsureSuccessStatusCode();

var results = await response.Content.ReadFromJsonAsync<List<Issue>>(CancellationToken);
var issue = results[posInPage];

Content[N.Ticket.Subject] = issue.Subject;
Content[N.Ticket.Body]    = issue.Body;
Content[N.Ticket.Url]     = issue.Url;

public record Issue(string Subject, string Body, string Url);

Caching pages in materialize code

If your API is paginated and fetching one result means fetching a whole page, cache the page so the next materialization call for the same page is free.

Cache.ClearStale();

var entries = await Cache.Pages
    .GetOrAdd($"page:{pageIndex}", _ => (DateTimeOffset.UtcNow, FetchPageAsync(pageIndex)))
    .task;

var issue = entries[posInPage];
Content[N.Ticket.Subject] = issue.Subject;
Content[N.Ticket.Body]    = issue.Body;
Content[N.Ticket.Url]     = issue.Url;

async Task<List<Issue>> FetchPageAsync(int page)
{
    using var httpClient = CreateHttpClient();
    var response = await httpClient.GetAsync(
        $"https://issues.example.com/api/results?page={page}&limit=20",
        CancellationToken);
    response.EnsureSuccessStatusCode();
    return await response.Content.ReadFromJsonAsync<List<Issue>>(CancellationToken);
}

public record Issue(string Subject, string Body, string Url);

public static class Cache
{
    public static readonly ConcurrentDictionary<string, (DateTimeOffset created, Task<List<Issue>> task)> Pages = new();

    public static void ClearStale()
    {
        var cutoff = TimeSpan.FromMinutes(30);
        foreach (var kv in Pages)
        {
            if (DateTimeOffset.UtcNow - kv.Value.created > cutoff)
                Pages.TryRemove(kv.Key, out _);
        }
    }
}

Keep TTLs short (15–30 minutes). The materialize scope is shared across concurrent users — always use ConcurrentDictionary and always call ClearStale() to avoid unbounded memory growth.

When to use federated search

Good use cases:

External knowledge bases — Confluence, SharePoint, Notion, and similar tools that have a search API but whose content you don't want to duplicate.
Public or partner APIs — academic repositories, patent databases, regulatory filing systems, supplier catalogs.
Legacy systems without an export path — if the system has a search endpoint, federated search can surface its results immediately.
Live data — results that must be real-time and can't tolerate the lag of an ingestion pipeline (stock levels, live case statuses).

Poor use cases:

Frequently queried, mostly static content — ingest it via a connector so the workspace can tokenize, embed, and rank it together with everything else.
High-traffic corpora — every workspace query triggers a call to the external API. At hundreds of queries per second, rate limits and latency become serious problems. Use a pull-based connector instead.
Content that needs facets or graph relationships — virtual nodes are temporary and cannot be traversed or filtered the way real graph nodes can.

Limits and behavior notes

Virtual nodes are in-memory only. They are never persisted and will be cleared after inactivity. Do not rely on them for durability.
A slow external API delays the entire search response for that user. Always pass CancellationToken and set httpClient.Timeout.
ACL filters do not automatically apply to virtual nodes. If the external system has its own permission model, filter results for UserUID inside the search code before adding them to scoredUIDs.
Federated results participate in type-based facets (the node type you assign), but cannot be filtered by graph-derived facets because they have no persistent graph edges.
The result count shown to users is limited to however many virtual nodes you return — it may not match the external system's totalResults.

Common pitfalls

No timeout on HTTP calls. A slow external API blocks the search response for every user. Set httpClient.Timeout and pass CancellationToken to every async call.
Hardcoded secrets in code. Use environment variables or the workspace secrets store. Never embed API keys directly in the index code.
Too many lazy nodes. Returning 10 000 virtual shells means 10 000 potential materialization calls. Cap results at what users will actually page through — 200–500 is usually more than enough.
Throwing on API errors. If the external API is down, return KeyedScoredUIDs.Empty() rather than throwing. A failed federated index should degrade gracefully and still return native results.
Cache without cleanup. An unbounded cache in the materialize scope leaks memory permanently. Always call ClearStale() at the top of the materialize code.