XML Connector

Stream XML documents (RSS feeds, SOAP exports, legacy data dumps) into the graph using System.Xml.XmlReader from the .NET BCL — no third-party package required. For small documents the example also shows the LINQ-to-XML alternative (XDocument), which is more convenient when memory isn't a concern.

Packages

Curiosity.Library on NuGet

System.Xml is part of the BCL — no additional package needed.

dotnet add package Curiosity.Library

Expected source shape

A flat list of <article> elements inside a root:

<?xml version="1.0" encoding="utf-8"?>
<feed>
  <article id="A-1001">
    <title>Battery life improvements</title>
    <author>Alice Anders</author>
    <publishedAt>2025-10-21T09:00:00Z</publishedAt>
    <body>...</body>
  </article>
  <article id="A-1002">
    <title>Q4 roadmap</title>
    <author>Bob Boniface</author>
    <publishedAt>2025-10-22T14:30:00Z</publishedAt>
    <body>...</body>
  </article>
</feed>

Connector code — streaming with `XmlReader`

Program.cs

using System.Xml;
using Curiosity.Library;

[Node]
public class Article
{
    [Key]       public string         Id          { get; set; }
    [Property]  public string         Title       { get; set; }
    [Property]  public string         Author      { get; set; }
    [Property]  public string         Body        { get; set; }
    [Timestamp] public DateTimeOffset PublishedAt { get; set; }
}

using var graph = Graph.Connect(
    endpoint:      Environment.GetEnvironmentVariable("CURIOSITY_ENDPOINT")!,
    token:         Environment.GetEnvironmentVariable("CURIOSITY_TOKEN")!,
    connectorName: "xml-articles");

await graph.CreateNodeSchemaAsync<Article>();
graph.SetAutoCommitCost(everyNodes: 10_000);

var path = args.Length > 0 ? args[0] : "feed.xml";

var settings = new XmlReaderSettings
{
    Async              = true,
    IgnoreWhitespace   = true,
    IgnoreComments     = true,
    DtdProcessing      = DtdProcessing.Ignore,   // safety: never resolve external DTDs
};

using var reader = XmlReader.Create(path, settings);

var ingested = 0;
while (await reader.ReadAsync())
{
    if (reader.NodeType != XmlNodeType.Element || reader.Name != "article") continue;

    var article = new Article
    {
        Id = reader.GetAttribute("id"),
    };

    // Read children of <article>.
    using var subtree = reader.ReadSubtree();
    while (await subtree.ReadAsync())
    {
        if (subtree.NodeType != XmlNodeType.Element) continue;
        switch (subtree.Name)
        {
            case "title":       article.Title       = await subtree.ReadElementContentAsStringAsync(); break;
            case "author":      article.Author      = await subtree.ReadElementContentAsStringAsync(); break;
            case "body":        article.Body        = await subtree.ReadElementContentAsStringAsync(); break;
            case "publishedAt":
                var raw = await subtree.ReadElementContentAsStringAsync();
                article.PublishedAt = DateTimeOffset.Parse(raw);
                break;
        }
    }

    if (string.IsNullOrEmpty(article.Id)) continue;

    graph.AddOrUpdate(article);
    ingested++;
}

await graph.CommitPendingAsync();
Console.WriteLine($"Ingested {ingested} articles from {path}");

How it works

XmlReader is a forward-only cursor over the XML stream — only the current node is in memory. ReadSubtree() returns a reader scoped to the current element and its descendants, so the inner while loop walks children of <article> without leaving the element. After the inner loop, the outer reader is positioned just past </article>, ready for the next sibling.

DtdProcessing = DtdProcessing.Ignore is a security hardening step: it stops the reader from resolving external DTD references (the XXE attack vector). Never leave this at the default Parse for untrusted input.

Small-file alternative — `XDocument`

For files small enough to fit comfortably in memory, LINQ-to-XML is much more readable:

using System.Xml.Linq;

var doc = XDocument.Load(path);

foreach (var el in doc.Root!.Elements("article"))
{
    graph.AddOrUpdate(new Article
    {
        Id          = (string) el.Attribute("id"),
        Title       = (string) el.Element("title"),
        Author      = (string) el.Element("author"),
        Body        = (string) el.Element("body"),
        PublishedAt = (DateTimeOffset) el.Element("publishedAt"),
    });
}

await graph.CommitPendingAsync();

The explicit casts (string) element / (DateTimeOffset) element handle missing children by returning null / throwing — choose your operator based on whether the field is optional.

Use XDocument when the file is under ~100 MB. Past that, switch back to XmlReader to avoid loading the whole DOM.

Namespace-aware reading

Real-world feeds (RSS, Atom, OData) use XML namespaces. With XmlReader check LocalName instead of Name, and NamespaceURI to disambiguate. With XDocument use XName.Get("article", "http://example.com/feed") or the XNamespace + "article" idiom.

XNamespace ns = "http://example.com/feed";
foreach (var el in doc.Root!.Elements(ns + "article"))
{
    var title = (string) el.Element(ns + "title");
    // ...
}

Notes & pitfalls

XXE. Always set DtdProcessing = DtdProcessing.Ignore (or Prohibit) for untrusted XML. The default opens the door to denial-of-service and file-disclosure attacks.
Large files + LINQ. XDocument.Load loads the whole tree. A 2 GB XML file is a 10+ GB DOM. Use XmlReader instead.
Mixed content. If an element contains text and child elements (<p>Hello <b>world</b></p>), ReadElementContentAsString throws. Use ReadInnerXml to get the raw content or implement a recursive walk.
Encoding declarations. XmlReader.Create(path, ...) reads the encoding from the XML prolog. If you've already opened the file as a StreamReader with the wrong encoding, the BOM mismatch will cause silent character corruption — let XmlReader handle the file directly.
CDATA. Treated as text content. ReadElementContentAsString strips the <![CDATA[…]]> wrapper for you.