Building a Production Neurosymbolic Pipeline for Scientific Discourse Graphs

LLMs are competent at reading scientific papers. They are not yet competent at producing knowledge graphs that downstream systems can trust. The output is plausible, well-formatted, and frequently wrong in ways that only become visible after the graph is queried, joined to other data, or used to train another model.

Through HADI Technology, I worked with Fylo on the ingestion pipeline that turns scientific papers into a typed discourse graph of claims, evidence, hypotheses, and research questions. The interesting engineering was not the LLM extraction. It was the symbolic scaffolding around it that turned a noisy generative process into a graph reliable enough to ship.

This post covers the three design decisions that mattered most: a ShEx schema as executable contract, a phased extraction pipeline that closes the validator-LLM loop, and a three-tier merge policy that handles cross-document identity correctly. The graph neural network and infrastructure architecture behind the related striff.io code review system follows a similar neurosymbolic staging pattern, described in a companion infrastructure post.

Schema as Executable Contract

Discourse graphs need a stricter schema than most domain models. Nodes carry semantic role (claim, evidence, hypothesis), strict cardinality on outgoing edges, and IRI references that must resolve within the same graph. An Association claim missing its arg0 and arg1 factors is structurally meaningless. A schema that lets these through is documentation, not validation.

We chose ShEx (Shape Expressions) over JSON Schema or pydantic because ShEx natively expresses graph-level constraints: cardinality across edges, IRI references, shape disjunctions. The schema lives in a single .shex file, is machine-checkable through pyshex, and is the single source of truth that everything else derives from.

The schema participates in two places. A curated excerpt is injected into the LLM system prompt so the model sees the constraints before generating. After extraction, pyshex’s ShExEvaluator runs against the produced subgraph and rejects nodes that violate their shape definitions. The prompt-time injection biases the model toward valid output; the post-extraction validation catches the cases where the bias was insufficient.

Edge confidence scores are clamped to [0, 1] and rounded at the validator boundary. Different LLM calls produce different confidence ranges, and without normalization at this point downstream analytics build on numerically unstable signals.

ShEx handles structural constraints well but struggles with logical constraints across the graph, the kind that say “a hypothesis cannot be confirmed unless at least two non-overlapping evidence lines support it.” The architecture I would build today layers ShEx as the cheap structural gate that runs every pass, with Z3 as the more expensive semantic gate that runs at canonicalization time.

Closing the Validator-LLM Loop

The standard approach has the LLM extract once and the validator either pass or fail the whole batch. The problem is that LLMs produce confident but malformed output, and re-running the same prompt with the same document reproduces the same errors.

The pipeline I shipped at Fylo decomposes a single chunk into three sequential extraction phases, each with a specialized prompt and scope. Phase 1 extracts entities only, with the prompt explicitly forbidding edge calls. This produces a clean inventory that can be validated against node-level shapes before any relational reasoning happens. Phase 2 extracts relations only, with the Phase 1 entities passed back as context. New nodes are permitted only when an edge would otherwise be dangling. Phase 3 repairs orphans: any node that ended up disconnected gets a final pass where the LLM is shown the dangling nodes specifically and asked to either form connecting associations or accept that they should drop.

This decomposition matters more than the prompts themselves. Phase 1 outputs are validated before Phase 2 starts, so Phase 2 produces edges guaranteed to reference real, validated nodes. Phase 3 addresses the failure mode where graphs become a pile of disconnected entities, by giving the model an explicit second chance with focused context.

The regex-based LLM output parser had to go. Regex cannot reliably handle nested parentheses inside string literals, and the old parser silently dropped well-formed extractions whenever a string contained an unescaped quote. The replacement uses ast.literal_eval rather than eval for safety, and ships with full test coverage.

The other change that mattered was tightening the schema prompt with explicit cardinality language. The Phase 1 prompt for Association nodes makes it unambiguous: every Association must include both an arg0 and an arg1 factor edge, or it will fail validation. This is the prompt-engineering equivalent of writing good error messages – tell the model exactly what will go wrong, before it goes wrong.

Cross-Document Canonicalization

The obvious approach to cross-document deduplication is one similarity threshold and one merge action. This breaks immediately on a real schema. The right behaviour for a Factor node like “Protein X” is aggressive merging across documents because the same protein in two papers is the same protein. The right behaviour for an Association claim is the opposite. Two papers asserting that “Protein X reduces inflammation” might be expressing the same association or two subtly different claims with different evidence. Merging them collapses signal you cannot recover.

The pipeline classifies node types into three merge policies. Aggressive types (factors, variables of interest, study designs, statistical methods) go through direct deduplication. When a new node arrives, its embedding is compared against role-filtered ANN candidates from the existing graph, and if the top candidate clears the threshold, the two nodes merge. Provenance is unioned, embeddings are weight-averaged based on accumulated confidence, and the lower-confidence node’s ID is rewritten across all incident edges.

Cautious types (associations, hypotheses, research questions) use canonicalization rather than direct merging. Once enough cautious nodes of a given type accumulate, a synthetic canonical node is created with a CANON_ prefix. Subsequent nodes either attach to the canonical via a mentions edge, or trigger creation of a new canonical when they represent a distinct cluster. The originals stay in the graph with their citations and confidence scores intact. The canonical becomes the queryable hub; the originals remain the evidence behind it. A small set of structural types are never merged at all, because doing so would lose per-paper provenance.

Role-filtered ANN search matters as much as the policy itself. A FAISS HNSW index returns candidates fast, but the candidate list will mix node types unless you filter at the search boundary. We restrict candidates to the same type as the query node before similarity scoring runs. Embedding maintenance under merge is the other non-obvious piece: the survivor’s embedding is a weight-averaged combination of the source vectors, upserted back into the ANN index so subsequent comparisons see the consolidated state.

The test of cross-document canonicalization is whether the graph converges. A pipeline that adds papers and produces a graph that grows without becoming more connected is not building a knowledge graph.

Making the Symbolic Layer Visible

Symbolic system improvements are abstract until you can see the graph. Telling a stakeholder that edge confidence scores are now properly normalized, or that cautious-type canonicalization is reducing duplicate hypotheses, communicates nothing. The same change rendered as a side-by-side visual diff communicates immediately.

This is a cropped neighbourhood from FyloVisualizer, a D3.js force-directed graph viewer I built to close that loop. The full viewer renders the complete discourse graph with 167 nodes across 4 types, edges weighted by confidence, hover for metadata and ontology tags, and filtering by semantic role. It loads graph JSON directly from the extraction pipeline output, no manual curation.

This pattern recurs across my work. The enriched SVG diagrams striff.io renders for code architecture diffs serve the same purpose: they translate backend ML quality changes into something a non-technical user can immediately judge. In a neurosymbolic system where most of the engineering is invisible, the visualization layer is part of the production debugging surface.

What I Would Change and What I Would Keep

A schema is not documentation; it is an executable contract that participates in both prompting and validation. LLMs converge toward valid output when you close the feedback loop with phase-specific prompts that match the structure of the schema, not when you re-prompt with the same instructions.

The neurosymbolic staging pattern generalizes to any domain where structured output requirements meet a model that cannot quite hit them alone. Code analysis, scientific discourse graphs, legal document analysis, regulatory compliance.