Parquet / Avro recipe
Source: ParquetSample/ · columnar batch files (Parquet or Avro), auto-detected by extension.
Owns in the academic graph: course grades — Grade (composite key <student>/<course>/<term>), linked to Student, Course, Subject, Term.
What it teaches
- Column projection — declare exactly the columns you need; the reader skips the rest. Often a 5–10× speedup on wide files.
- Row-group streaming — bounded memory regardless of file size.
- A format abstraction (
IColumnarSourcereturningColumnarRowdictionaries) with two implementations:ParquetSource(Parquet.Net) andAvroSource(Apache.Avro). - Auto-detection of the source format from the file extension.
Column projection + typed read
public static readonly string[] Columns = new[]
{
"student_id", "course_code", "subject", "term",
"letter_grade", "gpa_points", "credit_hours",
};
public static void Ingest(Graph graph, ColumnarRow row)
{
var studentId = row.Get<string>("student_id") ?? string.Empty;
var courseCode = row.Get<string>("course_code") ?? string.Empty;
var letter = row.Get<string>("letter_grade") ?? string.Empty;
var gpaPoints = row.Get<double>("gpa_points");
var gradeKey = $"{studentId}/{courseCode}/{termName}";
var grade = graph.AddOrUpdate(new Nodes.Grade
{
Id = gradeKey,
Letter = letter,
GpaPoints = gpaPoints,
CreditHours = credits,
});
var student = graph.TryAdd(new Nodes.Student { Id = studentId });
graph.Link(student, grade, Edges.Received, Edges.ReceivedBy);
}
Configuration
| Variable | Purpose | Default |
|---|---|---|
RECIPE_DATA_PATH |
Parquet or Avro file path | data/grades.parquet |
Reuse notes
- For Parquet, set the row-group size on the producer side (~100k rows) for predictable I/O.
- For Avro, prefer schema-evolution-aware reads when your schema changes over time.
row.Get<T>(name)returnsdefault(T)if a column is missing — validate at ingestion when columns are not guaranteed to exist.