
Kurt FischmanFounder, Marshal
Kurt is the CEO of Marshal, a Managed AI Ops service built for small businesses. That means AI agents doing the work, leads coming from answer engines, and a team that keeps your business running at full speed.

Endpoints give LLM retrieval systems structured access to your brand data. A dedicated JSON-LD fact endpoint, properly hosted and versioned, turns opaque marketing content into machine-readable truth that crawlers, knowledge graphs, and retrieval-augmented generation pipelines can ingest without parsing HTML. This article covers the architecture of public fact APIs, their role in the AI citation pipeline, and how llms.txt coordinates crawler itineraries to prioritize your structured data.
Traditional Schema.org snippets are micro patches embedded in messy HTML. A JSON-LD fact endpoint inverts that approach entirely. It is a dedicated route that outputs a machine-readable document unpolluted by CSS, marketing scripts, or rendering logic. The route might be /facts.jsonld, /schema, or a versioned path like /v1/ontology. The document presents every triple, identifier, and canonical URL to crawlers, graph builders, and LLM retrieval pipelines on a clean surface.
By decoupling structured data from the presentation layer, you guarantee purity of signal, lower parse overhead, and instant update latency. No DOM gymnastics required. The endpoint serves as a headless, server-hosted data layer where the facts speak for themselves. When a crawler hits your endpoint, it gets clean UTF-8 triples. When it hits your marketing page, it gets JavaScript hairballs that require rendering before extraction can begin.
LLMs run on tokens, but retrieval pipelines run on graphs. A JSON-LD document already is a graph. Pipe it through a triple store, vectorize the literals, and you have ground-truth context chunks that reduce vector-store hallucinations and tighten grounding when models answer factual queries about your organization. When your endpoint includes source URLs via citation or mainEntityOfPage, re-ranking engines can surface the originating link, giving you attribution in chat answers.
The historical context matters. In 2006, Tim Berners-Lee articulated the five-star Linked Data vision: publish raw data, use URIs, link outward. The web mostly ignored that vision for two decades. The GPT era resurrected it with venture-scale pragmatism. JSON-LD endpoints are cheap to host, trivial to version, and instantly embeddable in RAG workflows. The semantic web finally found its killer application in the existential need for verifiable, canonical fact streams that prevent AI hallucination.
Crawl budget is a triage ward where sites with slow latency or JavaScript rendering requirements languish. A lean JSON-LD endpoint, pure UTF-8, compressed, cache-friendly, functions as a VIP pass. Crawlers fetch it, parse it, and merge its triples into the Knowledge Graph with none of the heuristics normally required to strip noise from HTML. Fewer fetch cycles mean fresher data in indices, which produces ranking stability for entity queries.
When your competitor's product release still lives in a blog post, your endpoint's ReleaseEvent object is already in the graph, ready to surface in knowledge panels, voice assistants, and LLM answers. Speed plus structure equals outsized visibility, especially for long-tail factual queries where users never bother to click through traditional search results.
The rule of semantic dominance is boring explicitness. Define your Organization, Product, Person (founders, key hires), Offer, and CreativeWork objects like you are writing documentation that must survive future audits. Include invariant identifiers: Wikidata Q-codes, LEIs, ISINs, Git commit hashes. These allow knowledge graphs to de-duplicate your entity across the open web. Capture temporal truth with startDate, endDate, and temporalCoverage fields so provenance is machine-verifiable.
Link outward via sameAs to Crunchbase, GitHub, Wikipedia, and any registrar where your entity has a canonical record. The more interlinking, the higher your authority score in graph centrality metrics. This is PageRank for facts: the more verified connections your entity node maintains, the more weight retrieval systems assign to your claims.
| Data Format | Self-Describing | Crawler Friendliness | Best Use Case |
|---|---|---|---|
| JSON-LD Endpoint | Yes: @context and @type declare meaning upfront | Highest: no handshake, no rendering, pure triples | Immutable reference data (bios, certifications, entity facts) |
| GraphQL API | No: requires schema introspection and query construction | Low: interactive but heavyweight for passive crawlers | User dashboards and interactive front-end queries |
| REST API (ad-hoc JSON) | No: semantics must be reverse-engineered from field names | Medium: parseable but requires heuristic interpretation | Dynamic application data with frequent mutations |
| Embedded Schema Snippets (in HTML) | Partially: declared but entangled with rendering logic | Medium-low: requires full page render before extraction | Traditional rich results eligibility (star ratings, FAQ dropdowns) |
Sloppy DevOps defaults sabotage your own signal. Enable permissive CORS so third-party scrapers and browser-based tools can fetch without obstacles. Set Cache-Control: public, max-age=3600 because stale data damages credibility more than it saves bandwidth. Serve Content-Type: application/ld+json with a Schema.org profile declaration. Support Accept: application/json that downgrades gracefully for clients that do not handle linked data.
Version your endpoint with semver: /facts/1.2.3.jsonld. Maintain a redirect from /facts/latest to the current version. Nothing triggers trust in automated systems like explicit versioning semantics. Auditors and CI pipelines can wire up diff checks that compare each release, creating an observable change history that both humans and machines can verify.
The proposed llms.txt file functions as a treasure map for language-model crawlers: a root-level document that curates the specific URLs on your site you most want LLMs to read at inference time. Unlike robots.txt, which governs access, llms.txt governs prioritization. It spotlights high-value, machine-friendly resources like your JSON-LD fact sheet, API documentation, and policy pages.
Because your /facts.jsonld route already exposes canonical triples, the most effective first line in llms.txt is a direct link to that file. Crawlers that honor llms.txt will hit the JSON-LD endpoint first, cache its graph, and only then decide whether they need the verbose human-readable page. That gives your facts pole position in any answer-reranking pipeline. The strategic upside of pairing a public fact API with an llms.txt pointer is control over both the data payload and the crawler itinerary.
Reality check: llms.txt remains a community proposal. No major LLM vendor has formally committed to parsing it. A growing directory of technology companies (Cloudflare, Anthropic, Mintlify) already publishes one. Treat it as a low-cost experiment that cannot hurt crawlability. Do not delete your robots.txt or XML sitemap. Think of llms.txt as an overlay for inference-time curation, not a replacement for discovery or access control.
Publishing structured facts sounds altruistic until you recognize the competitive dynamics. In a world where AI answers overwrite traditional search results, controlling the ground truth is equivalent to owning the narrative before anyone else can contest it. JSON-LD endpoints are cheap to host but expensive to dislodge once they are entrenched in knowledge graphs. Competitors must cite your URIs or risk factual inconsistency that LLM evaluators will downgrade.
The business case extends beyond visibility. Investors and journalists ask the same questions: "When did you raise Series A?" "Who is on your board?" Instead of PDF tear sheets, send them /facts.jsonld. Automated vendor onboarding tools can scrape compliance information without processing a 30-page security questionnaire. Support bots resolve customer queries with live product specifications straight from your endpoint. The ROI lives in operational efficiency, not vanity metrics.
JSON-LD Endpoint → Machine-Readable Truth LayerA dedicated /facts.jsonld route decouples structured data from rendering logic, giving crawlers and retrieval pipelines pure semantic signal without HTML parsing overhead.Canonical URIs → Knowledge Graph De-duplicationInvariant identifiers (Wikidata Q-codes, LEIs, ISINs) and sameAs links allow knowledge graphs to resolve your entity across the open web and prevent orphan entity creation.JSON-LD Structure → RAG Hallucination ReductionJSON-LD documents function as pre-built graphs. Vectorized literals provide ground-truth context chunks that reduce hallucination in retrieval-augmented generation pipelines.Crawl Budget Efficiency → Fresher Index DataLean, compressed JSON-LD endpoints consume minimal crawl budget compared to JavaScript-heavy pages, producing faster index updates and more stable entity-query rankings.llms.txt → Crawler Prioritization OverlayAn llms.txt file that points to your JSON-LD endpoint gives language-model crawlers a curated itinerary, positioning your structured facts as the first stop in retrieval pipelines.Semver Versioning → Trust SignalExplicit versioning (/facts/1.2.3.jsonld) with redirects from /facts/latest creates observable change history that automated systems, auditors, and CI pipelines can verify.Ontology Ownership → Competitive MoatOnce your URIs are cited in public knowledge graphs, competitors must reference your facts or risk inconsistency that LLM evaluators downgrade, creating a structural advantage from data ownership.Public Fact API → Operational ROIThe same endpoint that drives AI citations also serves investor communications, vendor onboarding automation, and support bot grounding, reducing operational overhead across multiple business functions.
What is a JSON-LD fact endpoint and how does it differ from embedded schema markup?
A JSON-LD fact endpoint is a dedicated route (/facts.jsonld or similar) that serves machine-readable structured data unpolluted by CSS, JavaScript, or rendering logic. Unlike embedded schema snippets that require full page rendering before extraction, a fact endpoint delivers clean triples directly to crawlers and retrieval pipelines with zero DOM parsing overhead.
How does a JSON-LD endpoint reduce hallucination in LLM retrieval systems?
JSON-LD documents function as pre-built knowledge graphs. When piped through a triple store and vectorized, the literals provide ground-truth context chunks that retrieval-augmented generation pipelines use to ground responses. The structured format with canonical URIs and explicit source attribution reduces the confidence gap that causes models to generate fabricated information.
What entities should a business include in its public fact API?
At minimum: Organization (with legal name, identifiers, and founding date), Person (founders and key executives with sameAs links), Product (with specifications and pricing where applicable), and CreativeWork (publications, datasets, key content). Include invariant identifiers like Wikidata Q-codes, LEIs, and ISINs for cross-web de-duplication.
Why does JSON-LD outperform GraphQL and REST for public fact sharing?
JSON-LD is self-describing through @context and @type declarations, requiring no schema introspection or query construction from crawlers. GraphQL demands interactive handshakes. REST produces ad-hoc JSON that forces ingest pipelines to reverse-engineer semantics. For immutable reference data, JSON-LD's declarative format eliminates the friction that makes other formats less crawler-friendly.
What is llms.txt and how does it relate to a JSON-LD fact endpoint?
The llms.txt file is a proposed root-level document that curates URLs for LLM crawler prioritization. Unlike robots.txt which governs access, llms.txt governs which pages the model should read first during inference-time retrieval. Pointing the first entry to your /facts.jsonld endpoint gives your structured data pole position in answer-reranking pipelines.
How should a JSON-LD endpoint be versioned and cached?
Use semantic versioning with paths like /facts/1.2.3.jsonld and maintain a redirect from /facts/latest. Set Cache-Control: public, max-age=3600 and serve Content-Type: application/ld+json. Enable permissive CORS headers. This combination creates frictionless ingestion for crawlers while maintaining an auditable change history.
What are the common failure modes when maintaining a public fact API?
Mistyped @id fields spawn orphan entities that knowledge graphs treat as strangers. Missing @language tags confuse multilingual models. Over-zealous minification strips whitespace needed for diff reviews, causing silent version drift. The most damaging failure is forgetting to update the endpoint after a rebrand, which cements stale product names in LLM memory indefinitely.
Kurt Fischman is the CEO and founder of Growth Marshal, an AI-native search agency that helps challenger brands get recommended by large language models. Read some of Kurt's most recent research here.
All claims verified as of October 2025. This article is reviewed quarterly. Platform behaviors and endpoint specifications may have changed.
Drive more awareness in answer engines. Transfer more work to machines. Build the operating structure that will keep you ahead of whatever comes next.