Content Provenance Glossary

Definitions for 42+ terms in the content provenance ecosystem: C2PA, cryptographic watermarking, Merkle tree authentication, EU AI Act, willful infringement, and more.

Maintained by Encypher - authors of Section A.7 of the C2PA 2.3 specification and co-chairs of the C2PA Text Provenance Task Force.

A

Article 50 (EU AI Act)

Article 50 of the EU AI Act imposes transparency obligations on providers of AI systems deployed in consumer-facing applications - including chatbots and text generators. It requires that AI-generated content be identifiable as such. Article 50 took effect in August 2024 as part of the first enforcement wave of the EU AI Act. The specific technical standard for machine-readable marking is addressed in Article 52.

Article 52 (EU AI Act)

Article 52 of the EU AI Act requires providers of AI systems that generate images, audio, and video to ensure that outputs are marked as AI-generated in a machine-readable format. The deadline for compliance is August 2, 2026. Fines for non-compliance can reach 3% of global annual turnover. C2PA manifests with the appropriate digital source type field satisfy the machine-readable marking requirement. Article 52 complements Article 50, which covers general-purpose AI systems.

Audio Provenance

Audio provenance is the cryptographic record of an audio file's origin, creation method, and modification history. C2PA manifests are embedded in audio file containers for WAV, MP3, AAC, FLAC, AIFF, and M4A formats. Audio provenance identifies whether audio was recorded by a human, generated by an AI voice synthesis system, or edited from a source recording. AI-generated audio marked with the appropriate C2PA digital source type field satisfies EU AI Act Article 52 requirements.

C

C2PA (Coalition for Content Provenance and Authenticity)

C2PA is the standards body that publishes the open technical specification for digital content provenance. Founded in 2021 by Adobe, Arm, Intel, Microsoft, Qualcomm, and Twitter, C2PA now has over 200 member organizations including Google, BBC, Reuters, OpenAI, Sony, and Encypher. The organization operates under the Joint Development Foundation. C2PA is not a product or platform - it defines how content provenance manifests are structured (using JUMBF containers and COSE signatures), embedded in 31 MIME types, and independently verified. Encypher authored Section A.7 of C2PA 2.3 (text provenance). Erik Svilich co-chairs the C2PA Text Provenance Task Force.

C2PA Manifest

A C2PA manifest is the data structure embedded in a content file that records its provenance. It contains: one or more claims (the signed provenance records), the signer's X.509 certificate chain, assertion data about creation and modification, ingredient references to source files, and any access rules. The manifest is embedded in a JUMBF container for binary media types. For text, it is encoded using the methods defined in Section A.7. The manifest is self-contained: verification does not require querying an external database.

C2PA 2.3

C2PA 2.3 is the current version of the C2PA specification, published on January 8, 2026. It introduced Section A.7, which defines the text provenance framework - authored by Encypher - and made several improvements to the manifest structure and ingredient chain handling. Prior versions include C2PA 1.0 (2022), C2PA 1.3 (2023), and C2PA 2.0 (2024). The specification is freely available at spec.c2pa.org.

Coalition Licensing

Coalition licensing is a model in which multiple content rights holders - such as publishers - join a collective licensing agreement managed by an intermediary. An AI company signs one agreement with the coalition and gains access to content from all member publishers, without negotiating with each publisher individually. Encypher's publisher coalition uses this model: publishers set their licensing tier (Bronze, Silver, Gold) and Encypher handles the licensing relationship with AI companies. Revenue flows back to publishers based on content usage.

Content Authenticity

Content authenticity is the property of a piece of content being verifiably what it claims to be - created by who it claims, at the time it claims, without unauthorized modification. Content authenticity is established through cryptographic mechanisms: digital signatures that fail if any bit of the content changes. This distinguishes content authenticity from content integrity (has the file been corrupted?) and content quality (is the content accurate?). Content authenticity is the core property that C2PA manifests establish.

Content Authenticity Initiative (CAI)

The Content Authenticity Initiative (CAI) is an industry coalition operated by Adobe that promotes awareness and adoption of C2PA provenance standards among publishers, journalists, and consumers. CAI provides developer tools, the Content Credentials browser extension (which shows provenance badges on supported content), and educational resources. CAI and C2PA work in parallel: C2PA defines the technical standard; CAI promotes its adoption. Membership in CAI does not require membership in C2PA, and vice versa.

Content Licensing

Content licensing is the legal agreement by which a rights holder grants another party permission to use their content under defined terms. In the context of AI, content licensing covers the use of publisher content for AI training, retrieval-augmented generation (RAG), indexing, and attribution in AI outputs. Machine-readable rights terms embedded in content provenance manifests specify the licensing tier (such as Bronze for indexing, Silver for RAG, Gold for training) so that AI systems can read and respect those terms automatically.

Content Provenance

Content provenance is the cryptographic record of a piece of content's origin, authorship, creation method, and modification history. It is embedded directly into the content - not stored in a separate database - so the record travels with the content wherever it goes. The C2PA open standard defines how content provenance manifests are structured and verified. Content provenance is deterministic: verification either confirms or fails. It is not a detection or classification system. Encypher implements content provenance for 31 media types including text (via Section A.7), images, audio, and video.

COSE (CBOR Object Signing and Encryption)

COSE is the cryptographic signing standard used in C2PA manifests. It is the CBOR-encoded equivalent of JOSE (JSON Object Signing and Encryption), using compact binary encoding for efficiency in embedded contexts. A COSE signature in a C2PA manifest covers the hash of the claim data. If the claim is altered after signing, the signature verification fails. The COSE structure includes the algorithm identifier (typically ECDSA with P-256), the signed payload, and the certificate chain needed for verification.

Cryptographic Watermarking

Cryptographic watermarking embeds a cryptographically signed record of content origin directly into the content using a method that survives normal distribution. Unlike statistical watermarking, the result is deterministic: verification either succeeds or fails with certainty. No false positives are possible. For images, audio, and video, Encypher uses C2PA JUMBF container embedding. For text, Encypher uses proprietary invisible encoding. The watermark contains who created the content, when, what rights apply, and a signature that proves the record has not been altered.

D

Digital Source Type

Digital source type is a field in the C2PA claim that identifies how content was produced. The controlled vocabulary includes values such as trainedAlgorithmicMedia (AI-generated content), compositeWithTrainedAlgorithmicMedia (human-AI hybrid), and digitalCapture (camera photograph). Setting the appropriate digital source type is how C2PA satisfies EU AI Act Article 52 machine-readable marking requirements: AI-generated content is unambiguously marked as such in the manifest that travels with the content.

Document-Level Signing

Document-level signing authenticates a content file as a whole. A C2PA manifest with a document-level signature proves that the entire document was authored by the claimed party, at the claimed time, and has not been modified since signing. This is the core capability of the C2PA standard. Document-level signing is distinct from sentence-level attribution (Encypher's proprietary technology), which proves that specific sentences within a document came from a specific source.

E

EU AI Act

The EU AI Act is the European Union's comprehensive regulation of artificial intelligence systems, enacted in 2024. It applies a tiered framework based on risk level. For content provenance, the key provisions are Article 50 (transparency for consumer-facing AI systems, effective August 2024) and Article 52 (machine-readable marking of AI-generated images, audio, and video, deadline August 2, 2026). The Act applies to any provider placing AI systems on the EU market, regardless of where the provider is based.

F

Fingerprinting

Content fingerprinting identifies content by generating a compact representation (fingerprint or hash) of its characteristics and comparing it against a reference database. Perceptual hashing (pHash) for images and acoustic fingerprinting for audio are common approaches. Fingerprinting is a lookup system: it tells you whether a piece of content matches something in the database, not who created it or when. It requires a populated database and cannot encode rights terms. Content provenance via C2PA is a signing system - self-contained, offline, and not dependent on a database match.

Formal Notice

Formal notice, in copyright law, is documentation that places an infringer on notice that a work is protected and that specific rights terms apply. Formal notice is relevant because US copyright law permits higher statutory damages - up to $150,000 per work - when infringement is willful. Willfulness requires that the infringer had notice. A C2PA manifest with embedded rights terms constitutes formal notice: every copy of the content carries the rights information in a machine-readable format. An AI system that processes signed content and proceeds without a license cannot claim ignorance of the rights terms.

I

Image Provenance

Image provenance is the cryptographic record of an image's origin, creator, creation tool, and modification history. The C2PA standard supports 13 image MIME types: JPEG, PNG, WebP, TIFF, HEIC, HEIF, AVIF, GIF, SVG, BMP, DNG, JPEG 2000, and JPEG XL. Manifests are embedded in the image file using JUMBF containers. Image provenance is used by camera manufacturers (Leica, Nikon), AI image generators (Adobe Firefly, DALL-E), and news organizations to establish the authenticity and origin of photographs and generated images.

Ingredient Chain

An ingredient chain is the provenance record of all source files used to produce a piece of content. In C2PA terminology, an "ingredient" is a piece of content that was used as input to create another piece of content. The manifest of the derived content references the manifests of its ingredients, creating a traceable chain from the final output back to all source materials. For AI-generated images, the ingredient chain might reference the model used, training images if tracked, and any source photographs used in the generation process.

Innocent Infringement

Innocent infringement is copyright infringement where the infringer had no reason to know that the work was protected or that their use was unauthorized. US copyright law (17 U.S.C. section 504(c)(2)) permits a court to reduce statutory damages to as little as $200 per work when the infringer had no notice. The distinction between innocent and willful infringement is why machine-readable rights terms embedded in content provenance manifests matter: they establish notice, converting potential innocent infringement into willful infringement with substantially higher damages exposure.

Invisible Embedding

Invisible embedding is the technique of hiding data within content without visible impact on how it appears to readers or viewers. For text, Encypher uses proprietary encoding that is invisible in standard rendering. For images, C2PA manifests are embedded in the binary file container, not in the visible pixel data. For audio, manifests are stored in the file header or a sidecar chunk. Invisible embedding distinguishes watermarking from metadata fields (which are visible in file inspector tools but not in content viewers) and from markup (which alters the content's visual presentation).

IPTC (International Press Telecommunications Council)

IPTC is the standards body for news media metadata, including the widely used IPTC Photo Metadata Standard and XMP schemas used in image files. IPTC metadata fields such as creator, copyright notice, and rights usage terms are stored in image EXIF or XMP data. Unlike C2PA manifests, IPTC metadata is not cryptographically signed and can be stripped or altered without detection. C2PA is designed to complement IPTC: IPTC provides rich editorial metadata, while C2PA provides the cryptographic proof that the metadata has not been tampered with.

J

JUMBF (JPEG Universal Metadata Box Format)

JUMBF is the container format used to embed C2PA manifests in binary media files. It is an extension of the ISO Base Media File Format (ISOBMFF) box structure used in JPEG, MP4, and other formats. JUMBF supports nested boxes, allowing a manifest to contain multiple claims, each with its own signature. The container is part of the file but ignored by applications that do not implement C2PA. JUMBF embedding is the standard approach for images, audio, and video. For text, Section A.7 defines alternative encoding methods.

L

Live Stream Provenance

Live stream provenance is an emerging C2PA use case for establishing the authenticity of real-time video and audio broadcasts. Unlike recorded media, live streams present the challenge of embedding provenance before distribution begins and maintaining chain of custody across transcoding and CDN distribution. C2PA has defined approaches for frame-level signing in live streams, used by broadcasters to authenticate news footage as it is transmitted. This area of the standard is actively developing as of C2PA 2.3.

M

Machine-Readable Rights

Machine-readable rights are rights terms embedded in content in a structured format that software systems can read and interpret automatically, without human review. In the C2PA manifest, rights terms are stored as assertions that specify: who holds the rights, what uses are permitted (indexing, RAG, training), what restrictions apply, and how to contact the rights holder for licensing. The EU AI Act requires AI providers to respect machine-readable rights reservations - the same requirement that robots.txt addresses for web crawling, but for content that has already been distributed.

Merkle Tree Authentication

Merkle tree authentication is a cryptographic technique that uses a hierarchical structure of hashes to efficiently verify the integrity of a dataset. In content provenance, Merkle trees enable verification of individual content segments without needing the entire document. Encypher uses Merkle tree authentication for sentence-level attribution, enabling cryptographic proof that a specific sentence belongs to a specific document and has not been altered.

P

Perceptual Hash (pHash)

A perceptual hash (pHash) is a fingerprint of an image or audio file based on its perceptual characteristics - how it looks or sounds to humans - rather than its exact bytes. Similar images produce similar pHashes, so a perceptual hash can be used to find near-duplicate images even after resizing, compression, or minor edits. Perceptual hashing is used in copyright enforcement systems (such as Content ID on YouTube) to detect reused media. pHash is a lookup tool: it identifies matches against a database. It does not prove who created the content, when, or what rights apply.

Provenance Chain

A provenance chain is the complete history of a piece of content from its original creation through all transformations, edits, and uses. In C2PA, the provenance chain is built from ingredient references: each manifest references the manifests of its source materials, creating a traceable chain. For AI-generated content, the provenance chain might start with training data attribution, continue through the model used for generation, and end with the specific output. Provenance chain completeness is limited by what was signed: unsigned source materials appear as gaps in the chain.

Provenance Markers

Provenance markers is Encypher's general term for the invisible encoding used to embed content provenance data in text. This includes variation selector markers (VS markers, the default encoding method) and ZWC markers (an alternative method optimized for compatibility with Microsoft Word). Provenance markers are invisible to readers and carry the signed provenance manifest payload within the text body. The term provenance markers is preferred over "watermark" in technical documentation to distinguish from both visible watermarks and statistical watermarking.

Q

Quote Integrity Verification

Quote integrity verification is the capability to check whether a sentence or passage that appears in an AI output matches the signed original in the source document. When an AI system generates a response citing a publisher's article, quote integrity verification confirms that the quoted text has not been altered, hallucinated, or misattributed. Encypher's sentence-level Merkle tree technology enables this check: the hash of the quoted sentence is verified against the signed tree from the source document. This capability protects publishers from brand damage caused by AI hallucinations that misquote their content.

R

robots.txt

The robots.txt file is a standard web protocol that instructs web crawlers which pages or content they should not index. The AI-specific directives (noai, noimageai) extend this to indicate that content should not be used for AI training. robots.txt applies at crawl time, on the publisher's server. C2PA provenance is fundamentally different: it is embedded in the content itself and travels wherever the content goes, including content already in AI training corpora before any crawl directive existed. robots.txt and C2PA provenance are complementary, not competing - one controls access, the other proves ownership.

S

Section A.7 (C2PA 2.3)

Section A.7 of the C2PA 2.3 specification defines how provenance manifests are embedded in unstructured text content - articles, social posts, and any text-based material. It defines three encoding approaches: VS marker encoding (invisible inline embedding), sidecar manifests (a separate manifest file accompanying the text), and remote references (a URL pointing to a manifest stored externally). Section A.7 was authored by Encypher and published as part of C2PA 2.3 on January 8, 2026. The specification is available at spec.c2pa.org.

Self-Service Licensing

Self-service licensing is a licensing model where content users can obtain a license without direct negotiation with the rights holder. In the context of publisher-AI licensing, self-service licensing means an AI company can access licensed content from a publisher coalition through a single API agreement, with usage tracked and revenue distributed automatically. Encypher's coalition licensing model is a self-service approach: no per-publisher negotiations, no manual rights clearance, and machine-readable terms enforced programmatically.

Sentence-Level Attribution

Sentence-level attribution uses Encypher's proprietary cryptographic technology to prove that a specific sentence came from a specific source and has not been altered. This enables the legal claim "sentence 47 of this article was used in this AI output" with cryptographic certainty. C2PA authenticates documents as a whole; sentence-level attribution is Encypher's proprietary extension. Sentence-level attribution is patent-pending.

Statistical Watermarking

Statistical watermarking embeds imperceptible patterns in content - text, images, or audio - and detects them using a trained model or algorithm. The detection output is a probability score: the content is X% likely to be watermarked. SynthID (Google DeepMind) is the most widely known implementation. Statistical watermarking differs from cryptographic watermarking in a fundamental way: it cannot prove who created content, when, or what rights apply. It can only indicate likelihood. False positives - human-created content identified as watermarked - are a significant practical limitation. Statistical watermarks are also fragile: paraphrasing text or recompressing images can reduce or eliminate the watermark signal.

T

TDM Reservation

Text and Data Mining (TDM) reservation is a rights declaration that a content creator or publisher does not permit their content to be used for automated text and data mining, including AI training. The EU Copyright Directive (2019) established that rights holders can reserve TDM rights using machine-readable means. The robots.txt noai directive is one TDM reservation mechanism. C2PA manifests with embedded rights terms are another - one that travels with the content rather than residing on the publisher's server.

Text Provenance

Text provenance is the cryptographic record of a text document's origin, authorship, and modification history. The C2PA 2.3 specification Section A.7 - authored by Encypher - defines the framework for text provenance. Unlike images and video, which have binary containers for manifest embedding, text requires special encoding techniques. Encypher's primary approach uses VS markers embedded invisibly within text to carry the manifest payload. Text provenance enables publishers to prove authorship of articles, claims, and original research - and to detect when AI systems have used their content without a license.

V

Variation Selector Markers (VS Markers)

Variation selector markers (VS markers) are Encypher's default encoding method for embedding C2PA provenance data invisibly within text. VS markers are undetectable to readers, survive copy-paste and digital distribution, and enable cryptographic verification of text content. They are the default encoding in Encypher's API and are C2PA-aligned.

Video Provenance

Video provenance is the cryptographic record of a video file's origin, creation method, and modification history. The C2PA standard supports four video MIME types: MP4, MOV, M4V, and MKV. Manifests are embedded in the video container. For AI-generated video (deepfakes, synthetic video), provenance marks the content as AI-generated with the appropriate digital source type field, satisfying EU AI Act Article 52 requirements. Live stream provenance - for real-time video broadcast authenticity - is an emerging area of C2PA development.

W

Willful Infringement

Willful infringement is copyright infringement where the infringer knew, or had reason to know, that their conduct was infringing. US copyright law (17 U.S.C. section 504(c)(2)) provides for enhanced statutory damages of up to $150,000 per work for willful infringement, compared to up to $30,000 for standard infringement and as little as $200 for innocent infringement. Proving willfulness requires showing that the infringer had notice. C2PA manifests with embedded rights terms constitute formal notice that travels with every copy of the content. The legal significance: an AI system that processes content with embedded rights terms and proceeds without a license is not an innocent infringer.

Z

ZWC Markers (Zero-Width Character Markers)

ZWC markers are Encypher's alternative encoding method optimized for Microsoft Word and Office document workflows. They use a set of Unicode characters specifically chosen for stability in Word's text processing pipeline. ZWC markers are selected via the Encypher API when Word compatibility is required.