People make decisions every day that involve risk and uncertainty. Generally, we reconcile a variety of decision models using risk criteria often provided by organizational policies and/or guided by a variety of personal belief and trust systems. Many times we are forced to address ambiguous situations in uncertain ways, using uncertain terms and with uncertainty about the information. And, more often than not, we are thrust into situations where we need to make decisions on who to trust, what to trust, how much to trust, and when to trust dynamically.

The underlying question to our decision-making needs is whether there are adequate primitives in our base trust systems to underpin our social and business interactions in constructive and positive ways – interactions that are being bombarded incessantly with information – some false, some true, some unknown. What can we rely on to separate fact from fiction? Can we rely on technology, on friends, on professional reputation, on laws to enable trustworthy interaction within a single community or webspace, and across different communities/webspaces?

One of the key factors affecting our understanding of risk is “Where does the data come from that we work with and depend on; and, can we trust the originator of this data?” The underlying perceptions of the source of the information we use are greatly affected by our belief systems and also impact our reliance on the information as trustworthy. For example, stockholders and investors may trust a financial statement report of a company audited by a certified public accountancy to a much higher degree over the stock tips provided by Internet spammers. The process for determining the source of an object is referred to as provenance. According to Wikipedia, “the primary purpose of provenance is to confirm the time, place, and if appropriate the person responsible, for the creation, production or discovery of the object. Comparative techniques, expert opinions, written and verbal records and the results of various kinds of scientific tests are often used to help establish provenance.” There are many types of objects where provenance plays an increasingly important role. Some examples include:

• Where does this software module come from? – Software provenance
• Where is this cyberattack coming from? – Network provenance and attack attribution
• What is the origin of this web service? – Service provenance
• What is the pedigree of this content? – Information provenance
• What is the source of this food poisoning? – Product provenance

The provenance of data products generated by complex transformations such as workflows is of considerable value. From it, one can ascertain the quality of the data based on its ancestral data and derivations, track back sources of errors, allow automated re-enactment of derivations to update data, and provide attribution of data sources. Provenance is also essential since it can be used to drill down to the source of data in a data warehouse, track the creation of intellectual property, and provide an audit trail for regulatory purposes.

The importance of provenance seems to increase in proportion to the degree of risk involved; and, the risk increases with the value of the object and the impact that the object may have on the value chains that comprise the production, distribution, and consumption of the object.  Given that similar data or even the same data object may be delivered by different information providers, it sometimes becomes important to know the specifics of a provenance – the actors, the artifacts, the processes, the roles, and the causalities of a specific provenance. Why is this detail important? In most cases, the importance of establishing these provenance details is due to the potential that one of these information providers may be less trustworthy than another.

There has been quite a bit of research and development activity over the last few years, driven by the threats of misinformation, fake news, and deep fakes. These threats pose significant potential damage to the reputations of organizations and individuals. In an economy where 70% to 80% of market value comes from hard-to-assess intangible assets such as brand equity, intellectual capital, and goodwill, organizations are especially vulnerable to anything that damages their reputations. Sometimes these threats reflect a contest in philosophy or political views with the winner having the best AI/ML technology – deep fakes involve a cat and mouse game that uses machine learning to generate the deep fake and machine learning to detect the deep fake once it is manifested. Other times these threats are aimed at monetary gain through deception. However, all these types of threats to authentic content entail significant challenges to detect the threat, and also in providing ways to eliminate the threat and contain the damage, especially given the torrent of content flowing on the Internet. These threats impose a critical need to ensure transparency, understanding, and ultimately, trust in content. People want to know upfront that what they’re seeing online is authentic. People can be misled when they don’t know who is altering content and what content has been changed. The ability to provide proper content attribution for creators, editors, and publishers is also critical to proactively ensure trust and authenticity of online content.

The need for authenticity of digital content has provided the impetus for R&D efforts focused on making digital content “authentic-by-design.” The R&D efforts cover topics and technology such as decentralized systems and identities, provenance, block-chain distributed ledgers, distributed file systems and content addressable storage, secure metadata, privacy and zero knowledge proofs, data compression algorithms, artificial intelligence, computer vision, ontologies, and roots of trust. But what is really interesting is that some of the leading efforts aim to place the determination of what is authentic content, not solely in the hands of big corporate platform providers or big media publishers or even “professional fact-checkers,” but instead to distribute mechanisms for transparent content publication and guarantees of authenticity across the Internet – and rooted in a provenance chain extending from a camera phone’s trusted execution environment (TEE) to distributed ledger technology (DLT) implementations at the edge. Not to say that the large platform providers and big media aren’t in the mix of this research activity as well, but the underlying philosophies of the R&D efforts are towards a more democratic web – one where the means for creating and publishing digital content, and discerning what is authentic content, are in the hands of the individual creator, publisher, and consumer (with transparency and authenticity buttressed by trusted technology). This democratic form of publication and consumption of digital content helps to avoid suppression or distortion of content by a few large [social] media providers and enables a grassroots and verifiable approach to authentic content. 

Three significant R&D efforts that are underway and that support this approach include the Content Authenticity Initiative (CAI), the Starling Framework for Data Integrity, and extensions to the Semantic Web.

The focus of CAI, led by Adobe and now supported by a variety of media publishers and technology companies, is to develop an open industry standard and an ecosystem of supporters for digital content attribution. Most attribution information is embedded in the metadata of assets via long- established standards such as EXIF and XMP. However, most assets appear on the Web without this information intact. Content moderators, fact-checkers and end-users are left to reconstruct context through imperfect and inefficient methods. CAI intends to provide a layer of tamper-evident attribution and history data built upon XMP, Schema.org and other metadata standards. This attribution information will be bound to the assets it describes. Content with attribution exposes indicators of authenticity so that consumers can have awareness of who has altered content and what exactly has been changed. This ability to provide content attribution for creators, publishers and consumers is essential to engender trust online. At the same time, it is critically important that those same content creators be able to protect their privacy when necessary. Any solution attempting to restore trust must be globally viable across technology contexts and minimize opportunities to cause unintended harms or risks. It must also have freedom of creative expression in media production at its core. Adobe intends to include a version of CAI’s standard for attribution in its Photoshop and other content production products it offers. An early demonstration capability of how CAI works in action is based on Qualcomm’s Snapdragon chip and Truepic‘s trusted image verification technology and can be found here. W3C and ISO are the target standards organizations for the initiative’s specification products.

CAI is being designed to balance ease of use with security against tampering along with strong links to identity through the use of digital signatures and hashing. Identity can be that of an individual, where prudent, or that of the trusted cryptographic signing entity as a proxy. That is, in many cases the individual who created or edited an asset will not be the holder of the signing certificate. Instead, the signing certificate belongs to the hardware or software actor (e.g. Truepic Camera, Adobe Photoshop, BBC, etc.) that performed the actions on behalf of someone else. This model allows CAI to provide anonymity (and/or pseudonymity) where desired. For scenarios where the certificate holder is able to reliably establish the identity of the individual, and the individual wishes their identity associated with an asset, an identity assertion is used. Decentralized identities (DIDs) are one type of identity token that will be supported by CAI’s standard for attribution. Learn more about DIDs in another recent blog by ActiveCyber.net here.

CAI calls out a special but logical caveat – specifically that an asset with valid CAI information does not imply anything about the trustworthiness of the content of the asset. Specifically, the CAI attribution model does not prevent a malicious user from stripping all of the CAI data (claims and assertions) from an asset and then adding new claims representing themselves as the originator. Similarly, the “analog hole” or “rebroadcast attack,” which are common methods for subverting provenance systems by capturing an image of a photograph or computer screen, are not addressed directly by the model. However, according to CAI, there are some solutions that can be implemented in concert with the CAI model to achieve resilience against intentional misuse.

  • An actor could use watermarking technology to durably embed information (either perceptibly or imperceptibly) about the asset’s current claim. The watermark could be subsequently used to recover provenance data.
  • A camera device or software could utilize depth mapping to capture scene information (as CAI assertions) which would indicate whether a photograph depicts a 3D scene or a rebroadcast photo of a photo.

Many details have yet to be proposed for ensuring a CAI-based system can embrace time-based media like audio, video, and streaming formats.

Another challenge facing CAI is that there is currently no universal approach for storing attribution data appropriate for all use cases. Depending on the systems involved, the attribution information may be large enough to make it impractical to embed in a file containing digital content. Conversely, some creators may have privacy concerns such that no metadata associated with digital content nor the asset itself can be stored on servers in the cloud to preserve anonymity. Therefore, the CAI imagines data storage to comprise a continuum of options ranging from file-based to cloud-based, with hybrid approaches in between.

That is where the Starling initiative comes into play. The Starling Framework for Data Integrity is a distributed storage solution built on Filecoin, using Filecoin’s protocol and core implementation to create an immutable archive. This effort is spearheaded by the USC Shoah Foundation, and Stanford’s Department of Electrical Engineering. The Starling Framework for Data Integrity lets organizations leverage the power of cryptography and distributed systems to authenticate digital video and images.

Starling is a hardware-based system, similar to Truepic, that uniquely signs each photo/video and uses that to create two hashes: one of the image and the other of the metadata. Then, Starling registers the content and it gets pushed off the phone and is included on the Filecoin network. Finally, to deal with all the hashes generated during the capture and storage processes, Starling includes a hash/certification management system that lets organizations engage multiple experts to verify footage. Each organization can then publish on their DLT and this knowledge graph can be accessed by users on any platform.

Starling has three modules: capture, storage, and verify, and uses the IPFS/Filecoin framework to enable these modules. During capture, a combination of hardware (HTC) and software (IPFS) creates a chain of custody from the cameras to digital platforms. The image is paired to metadata from an array of sensors on the device to prove footage was taken at a specific time, date, and location. The footage is then cryptographically hashed using IPFS, creating a content identifier (CID) that serves as a unique fingerprint of that footage. Data is then replicated onto multiple IPFS/Filecoin storage nodes, which natively use content identifiers (CIDs). Content addressing is powerful because a change to a single pixel will generate a completely different hash for the footage. Also, when data is fetched using a CID rather than a URL, you are guaranteed to see the intended version of that data. In essence, Filecoin and IPFS form a decentralized, global network that is far more difficult to hack than a centralized system.

One issue I have with the Starling and CAI approaches is their requirement to include a “human-in-the-loop” for verification purposes. I can foresee this becoming closed chains of attestation to “facts” rather than truly open knowledge circles or a web of trust. It seems to me that verifiers should be able to depend on AI/ML to perform this function, hence eliminating potentially biased human fact-checkers from the digital content publication and verification process. And yes, I understand that AI/ML algorithms can be poisoned or tainted as well, but at least they can be tested and challenged on a scientific basis. An alternative is to add reputation or a trust index as a parameter to the verification process – applied to either human-based or AI/ML-based verification processes.

Establishing the trustworthiness of information providers today relies on a well-established practice of combining reputation and provenance information. For instance, consider a service where several subprocesses are executed in sequence to produce a particular output. It may be necessary to know the sources that were used to generate this output, for example, to verify if the process is correctly executed, or to check if the sources are reliable (a reputation factor). This scenario is typical in data provenance applications, where it is necessary to know which elements were used to generate a target data source and the trustworthiness of the originating data sources. A prime example of such a scenario is Google search. In this case, metadata acts as a surrogate for actual records. Can we trust the agents and processes that selected the metadata that take us to the data? We base our use of Google or another search engine on the “reputation” of that search engine. However, in some cases we need to understand the specifics of the provenance of the metadata to determine if the search results are being manipulated. The manipulation may be by a trusted party, in which case the results may actually have more value. But, there may also be cases where the manipulation is a cause for concern, or at least, a need exists for further verification. These observations suggest that information retrieval in the distributed environment is going to become a much more complex process and one that depends on the provenance and reputation of the source provider(s).

This is where the Semantic Web could help. The Semantic Web has started to emerge as a movement away from the centralization of services like search, social media and chat applications that are dependent on a single organization to function. In particular, work by Berners‑Lee on a project called Solid, based around personal data stores or “pods,” over which individuals retain control. Berners‑Lee has formed a startup, Inrupt, to advance the idea and attract volunteer developers.

Ontologies provide a great way to add context to the metadata. The Web Ontology Language along with RDFa provide a good technology foundation for the Semantic Web to help address the complexities of reputation and provenance within a distributed information retrieval environment. Specially-crafted Semantic Web metadata can be directed at weaving together sensible well-structured domain-specific digital assets containing reputation and provenance. Some types of metadata that may support a trust index or input to a reputation score could include ratings for: 

  • Correctness to include accurate portrayal of the actual phenomena and valid deductions
  • Currency of the information [latency between the point of capture and the closest point of persistence]
  • Completeness and relevancy to ensure that anything that is required and no more is included in the information and metadata provided
  • Conformance to standards and technical rules, internal business policies
  • Compliance to regulations and laws
  • Security and privacy [adequacy of controls related to authentication, authorization, integrity, privacy, auditing]

These ratings could apply to the digital asset in question or to an entity such as a “fact-checker” – whether human or automated. It can also apply to a reputation authority who may issue trust statements or reputation assurances regarding a service provider. A reputation authority is generally defined as an organization offering a reputation service to a community of registered Service Providers. A trust statement is a composite document containing the information necessary for a Service Requester to make an informed decision before entering into a transaction with a Service Provider and might include reputation metrics, terms and conditions, and guarantees. For example, a reputation authority may issue a guarantee or bond in support of a reputation rating.

With respect to a security and privacy rating parameter, it should be noted that the introduction of reputation and provenance may be considered a double-edged sword. On the one hand it can be very advantageous for the purposes noted before. On the other it may definitely contain information much more sensitive than data to which it corresponds. An example would be an intelligence report whose pedigree tracks right back to sources, those who processed information, as well as techniques used. Hence the potential need for security to protect reputation and provenance information.

Adding AI/ML techniques to the Semantic Web is an area of active research and could also boost support for a trust service based on reputation and provenance metadata. AI/ML will require extensions to the Web Ontology Language (OWL), for example to annotate conditional probabilities. 

There have been some standards work involving reputation systems as well. The more recent effort by ISO – ISO 20488:2018 – Online consumer reviews — Principles and requirements for their collection, moderation and publication – is applicable to any organization that publishes consumer reviews online, including suppliers of products and services that collect reviews from their own customers, a third party contracted by the supplier, or an independent third party. It has special relevance to those sectors, like tourism, where customer experience and reputation ratings are overwhelming factors in decision making. It’s also a major influence when it comes to buying products that have to perform in a certain way, such as sporting goods.

Over 10 years ago, OASIS also made an attempt at a reputation management standards. I was personally involved in this early effort as part of the OASIS Open Reputation Management Systems (ORMS) Technical Committee. The committee was formed to develop an Open Reputation Management System (ORMS) that provides the ability to use common data formats for representing reputation data, and standard definitions of reputation scores. The system was intended not to define algorithms for computing the scores. However, it was designed to provide the means for understanding the relevancy of a score within a given transaction. It did provide an XML schema associated with reputation scoring and management of reputation scores as well as a reference model.

As shown in this context diagram for ORMS, a good Reputation Management System will separate the reputation of the evaluator from the data that is used to evaluate a given entity in the system. In this fashion, aggregators will have a reputation that can be used to score how well they do in gathering good data, and feedback providers will have their own reputation that could be used as a means to purge or clean feedback that they provide on other entities. Such systems will be less susceptible to data manipulation and have the ability to provide constructive reputation or trustworthiness scores/values. A specific aspect is that reputers, reputees and the reputation service provider may determine criteria to be evaluated. Both reputers and reputees may apply their respective weightings allowing the reputation service provider to calculate overall ratings 

Even given these standards, there are still many questions that arise when trying to address this combination of provenance and reputation:

  • What protocols should exist between reputation management systems and provenance systems?
  • Can the reputation of a source be a piece of metadata to help to mitigate disputes when multiple views of a data object or other object are experienced, i.e., different provenances?
  • Why does a reputation score fluctuate? Does the fluctuation reflect a real or perceived change in the reputation (therefore, the activity of the reputee) or is the reputation being manipulated by other means. Can understanding the provenance of the score help to detect these issues?
  • Do reputation and provenance systems need to operate under a coherent “global order?” How are reputation and provenance systems affected by different contexts, rules, and policies?
  • How does a reputation emerge – what kickstarts it? Does provenance play a role in the emergence of a reputation? How can a Reputation Authority assist in the kickstart process?
  • How does a reputation take on robustness – i.e., immunization to attacks that are unwarranted? How can provenance play a role in establishing the robustness of a reputation?
  • How to manage risks dealing with provenance (sources and methods), while providing some notification of the “trustworthiness” of the information provided? Does reputation of the data aggregator help in establishing the trustworthiness of the provenance of the data?

So thanks for taking the time to read this somewhat lengthy article. Let me know what you think about reputation and provenance and the roles they should play in establishing authenticity of digital content.


And thanks to my subscribers and visitors to my site for checking out ActiveCyber.net! Please give us your feedback because we’d love to know some topics you’d like to hear about in the area of active cyber defenses, authenticity, PQ cryptography, risk assessment and modeling, autonomous security, digital forensics, securing OT / IIoT and IoT systems, Augmented Reality, or other emerging technology topics. Also, email chrisdaly@activecyber.net if you’re interested in interviewing or advertising with us at Active Cyber™.