Curiosity - Connectors

Connectors

A connector is an external program (typically a long-running C# console app) that reads from a source system and writes nodes/edges/ACLs into a Curiosity Workspace using the Curiosity.Library SDK. Connectors are the canonical way to keep a workspace in sync with the truth in your source systems.

If your data lives somewhere standard and you don't need custom mapping, the built-in integrations in the UI may be sufficient. Connectors give you full control over schemas, keys, edges, and ACL ingestion — which you'll want as soon as the data shape matters.

Connector lifecycle

A well-formed connector run does five things:

Authenticate to the workspace with an API token scoped to ingestion.
Register schemas (idempotent).
Read deltas from the source (initial sync = full; subsequent runs = incremental).
Upsert nodes and edges with stable keys.
Commit in bounded batches; record a cursor for the next run.

Available built-in integrations

Configurable from Settings → Integrations, no code required:

Filesystem — local files and network shares.
Web — crawl and index public/internal websites.
Database — JDBC-style connections to PostgreSQL, MySQL, SQL Server, etc.
SaaS connectors — popular business systems (Slack, Jira, ServiceNow, Confluence, Microsoft 365, Google Drive, and others depending on your license).

For systems not in the list — or when you need custom mapping, custom keys, or ACL ingestion that the built-in connector doesn't model — build a custom connector.

Minimal connector (C#)

The smallest end-to-end connector that ingests a typed entity with edges and ACLs:

using Curiosity.Library;

[Node]
public class Customer
{
    [Key] public string Id { get; set; }
    [Property] public string Name { get; set; }
    [Property] public string Tier { get; set; }
}

[Node]
public class Ticket
{
    [Key] public string Id { get; set; }
    [Property] public string Subject { get; set; }
    [Property] public string Body { get; set; }
    [Timestamp] public DateTimeOffset CreatedAt { get; set; }
}

public static class Edges
{
    public const string HasTicket = nameof(HasTicket);
    public const string TicketOf  = nameof(TicketOf);
}

using var workspace = await Workspace.ConnectAsync(
    baseUrl:  Environment.GetEnvironmentVariable("WORKSPACE_URL"),
    apiToken: Environment.GetEnvironmentVariable("WORKSPACE_TOKEN"));

var graph = workspace.Graph;

await graph.CreateNodeSchemaAsync<Customer>();
await graph.CreateNodeSchemaAsync<Ticket>();
await graph.CreateEdgeSchemaAsync(typeof(Edges));

var enterprise = await graph.CreateTeamAsync("Enterprise Support", "Enterprise customers");

await foreach (var row in source.StreamSinceAsync(lastCursor))
{
    var customer = graph.TryAdd(new Customer { Id = row.CustomerId, Name = row.CustomerName, Tier = row.Tier });
    var ticket   = graph.TryAdd(new Ticket   { Id = row.TicketId,   Subject = row.Subject, Body = row.Body, CreatedAt = row.CreatedAt });

    graph.Link(customer, ticket, Edges.HasTicket, Edges.TicketOf);

    if (row.Tier == "Enterprise")
        graph.RestrictAccessToTeam(ticket, enterprise);

    if (row.Index % 500 == 0)
        await graph.CommitPendingAsync();
}

await graph.CommitPendingAsync();
await source.SaveCursorAsync();

Run it with a scoped API token:

export WORKSPACE_URL=http://localhost:8080
export WORKSPACE_TOKEN=<ingestion-scoped token>
dotnet run --project FirstApp.Connector

For an end-to-end developer walkthrough (with NLP extraction, embeddings, and a UI), see Build your first enterprise AI app.

Connector responsibilities

A production-grade connector needs to do all of these. None are optional in real environments:

Responsibility	What "good" looks like
Schemas	Registered once at startup; evolution handled with versioned migrations.
Keys	Stable IDs from source. Never random. Never depend on row order.
Edges	Created explicitly with named edge types. Both directions when readability matters.
ACLs	`RestrictAccessToTeam` / `RestrictAccessToUser` mirroring source-system permissions.
Commits	Batched (100–500 items). One `CommitPendingAsync()` per batch; one final flush.
Cursors	A persistent watermark (timestamp + sequence) so reruns are idempotent.
Deletes	Tombstone propagation, or periodic reconciliation against source.
Observability	Per-batch counts, durations, error counts. Failures surface clearly.
Retries	Exponential backoff on transient failures (network, `429`, `5xx`).
Secrets	API token + source credentials in a secret manager, never in source.

Permission ingestion patterns

ACLs are why a workspace connector is fundamentally different from "pump data into a search index". You typically have one of three shapes:

Source-mirrored ACLs (recommended)

Read the source's permission model (groups, sharing rules, projects) and call RestrictAccessToTeam / RestrictAccessToUser to mirror it. Membership changes flow on the next run.

foreach (var share in row.Shares)
{
var team = await graph.CreateTeamAsync(share.GroupName, share.GroupDescription);
graph.RestrictAccessToTeam(ticket, team);
}

Tier-based ACLs

The source doesn't have a permission model, but you have a known segmentation rule (free vs paid, region, business unit).

if (row.Tier == "Enterprise")
graph.RestrictAccessToTeam(ticket, enterpriseTeam);

Public-by-default with explicit private overrides

Most content is public; a small subset is restricted.

if (row.IsConfidential)
graph.RestrictAccessToTeam(ticket, restrictedTeam);
// else: default visibility = public

See Access Control Model and Permission model architecture.

Incremental sync patterns

Pattern	When to use	Notes
Full refresh	Small datasets, weekly runs	Simplest. Expensive at scale.
Watermark-based incremental	Sources with reliable timestamps	Pull `updated_after = <last cursor>`. Most common pattern.
Change-feed / webhook	Sources with native change feeds	Near-real-time. Most complex; needs idempotent writers.
Reconciliation pass	Anywhere deletes are critical	Periodic full scan that tombstones missing records.

Whatever pattern you pick, the writes must be idempotent: re-running the connector should not create duplicates or change node counts.

Delete handling

The graph engine doesn't auto-delete data when source rows disappear. You have to do it explicitly. Three workable approaches:

Tombstone column in source. Soft-delete in graph (row.IsDeleted = true).
Reconciliation pass comparing source primary keys to graph nodes; delete the difference.
Audit-driven deletes triggered by source webhooks.

Hard-delete nodes with graph.RemoveNode(uid) if you need them gone (vs. soft-deleted). Both forms remove them from search.

Connector testing checklist

Schema registration is idempotent (run twice; no errors).
Re-running ingestion does not change node/edge counts.
Cursor advances forward only.
Source credentials and the workspace API token are read from env vars (or a secret manager), never embedded.
Deletes in source are reflected in the workspace within one run cycle.
An end-user test account sees the data it should and only that data.
Failure modes (source down, API token expired, body parse error) surface as exceptions with enough context to debug.

Connector packaging

Local dev: dotnet run.
CI / scheduled job: package as a self-contained dotnet publish or a small Docker image.
Inside Kubernetes: a CronJob or a long-running Deployment with a sidecar.
From within the workspace: as a Scheduled Task. For light, periodic ingestion, this avoids deploying a separate service.

Common pitfalls

Unstable keys cause duplicate nodes on every run. The single most common ingestion bug.
Missing edges make the graph unusable for navigation, faceting, and graph-scoped search.
Ingesting unstructured text into one giant property — split into appropriate fields so search and embeddings can do their job.
No ACL ingestion — every user sees every record. Set RestrictAccessTo* from day one.
Unbounded commits — calling CommitPendingAsync() once at the end of a million-row run will use too much memory. Commit in batches.

Next steps

Walk a complete tutorial: Build your first enterprise AI app.
Design your schema first: Schema Design.
Operationalize ingestion: Ingestion Pipelines and Pipeline Orchestration.
Implement a custom connector with the SDK: Data Connector SDK.
Reference: Token scopes for the right API token, Schema Reference for the schema attributes.