Current Security Trends Reveal Difficulties in Assuring Authenticity

Recently I was thinking about some of the major security challenges and problems of 2020 and going forward into 2021 like the ongoing SolarWinds supply chain mitigation issues; election fraud; the problems around disinformation and deepfakes; false flags in cyberattacks and the difficulties in making accurate attribution; reputation risks due to cyberattacks, doxxing, and information operations; record numbers of phishing attacks; machine learning poisoning; search engine and social media manipulation; on-line surveillance and privacy violations; ransomware; identity theft and spoofing, sock puppets and source hacking; gaps in our critical infrastructure protections; pervasive supply chain and cyber vulnerabilities; and, amazingly shoddy cyber hygiene. And now we are moving headlong into IoT, 5G, XR – new ways to increase our attack surface by orders of magnitude. Makes me think we really need to re-evaluate our cyber systems and reset some protection baselines while asking some hard questions:

  • Can we trust the providers of what we see, hear and read?
  • How can we ensure that we get unfiltered / unsuppressed access to content that we have legitimate and lawful rights to access?
  • Are the digital things that we see, hear, and read authentic and legitimate representations?
  • When necessary, can we positively identify and trace a digital entity and its associated attributes to its corresponding physical entity and vice versa?
  • Can we trust enterprises and governments to manage our identity information or should we look to adopt a decentralized, person-centric identity approach?
  • Can we track the provenance of digital content and software, as it gets modified, and is transported across the Internet?
  • Can we trust our digital and physical supply chains and trading partners? Do we have the ability to track and trace physical and corresponding digital elements in our supply chain and ensure their authenticity? Can we trust the technology and products that we use today to perform as specified and as required by law?
  • How can we better manage and fairly protect personal and brand reputations online?
  • How can we make our social platforms more accountable in maintaining conformance to laws and social norms? Can we ensure that they stay neutral or at least act as multi-voice platforms? Is it time to change the Communications Decency Act to reflect the reality of these social media platforms?

Many of the questions raised above are related to the means by which we use to ensure authenticity, whether we are dealing with authentic (and legitimate) content, authentic software and devices, or an authentic [and authorized] person at the other end of our [Internet/fill-in-the-box] communication.

Authenticity problems are also likely to get worse before they get better. Already, sifting through content to determine “what” is genuine is very hard as around 40 trillion gigabytes of data (40 zettabytes) of data is on the Internet, and it is growing at over 2.5 quintillion bytes (2.5 e+9 GB) of data every day, and this number is increasing as tens of billions of IoT devices come online spewing even more into this ocean of data. Paradoxically, as the world gets flatter and more digital, access to content is increasingly controlled and manipulated by different large corporations and governments using proprietary APIs to the platforms that host the content that we want to access. Content and users are analyzed by “smart” autonomous agents powered by artificial intelligence and machine learning that use these information piles to match profiles and push content to users’ likes. This proprietary, API-based content access approach makes it hard to impossible to ensure authenticity in a universally consistent manner.

It is also hard to determine “who” is genuine in this growing digital world. Identities are virtual, as everyone takes on multiple digital personas and every “thing” gets an identity. AI-powered digital personas will dynamically link information from various sources — such as books, research papers, notes and media interviews — and turn the disparate information into a knowledge system that people can interact with digitally. Identity is also complicated by “deep fakes” – a popular image synthesis technique based on artificial intelligence. It is more powerful than traditional image-to-image translation as it can generate images without given paired training data. The goal of deep fakes is to capture common characteristics from a collection of existing images and to figure out a way of enduing other images with those characteristics, e.g. shapes and styles. Generative adversarial networks (GANs) provide a way to implement deep fakes. Deep fakes are becoming increasingly difficult to discern from a real image or video. Combining deep fakes with the ability to profile users and create digital personas can easily lead to on-line attacks on reputation, doxxing, bullying, other types of disinformation campaigns, as well as fraud.

Cloud computing along with “Bring Your Own Device,” remote work, IoT, autonomous systems, and broad use of 5G erode the relevance of fixed network boundaries, whether physical or software-defined. This erosion of the network perimeter, coupled with problems in the digital [and physical] supply chains and escalating cyber security issues, raise questions of what and who to trust – i.e., are the persons, devices, data on the network authentic? In this digital world it will be harder to detect fact from fiction, authentic personas from fake, phishing from legitimate emails, and counterfeit or gray market items from the “real article.”

So exactly what do I mean by “authenticity?” According to the Stanford Encyclopedia of Philosophy, the term “authentic” is used either in the strong sense of being “of undisputed origin or authorship,” or in a weaker sense of being “faithful to an original” or a “reliable, accurate representation.” And according to the American Heritage® Dictionary of the English Language, Fifth Edition, authenticity is defined as the quality or condition of being authentic, trustworthy, or genuine. Authentic content tries to genuinely be of service to its users and its industry without trying to manipulate its audience. Anything that seeks to trick consumers is the opposite of authentic. These definitions seem to closely reflect my view on authenticity for the purposes of this article.

So what can we do to improve authenticity? That is the subject of this article and several more to come. I hope to cover the questions above, and some more ideas and emerging innovations around establishing authenticity, how reputation and provenance play a role in authenticity, and how a secure, decentralized “platform” foundation is instrumental to bootstrapping and maintaining authenticity in our daily lives.

Content and Identity Authenticity Needs to be Woven Into the Fabric of the Internet

Moving forward, the current detect and prevent / mitigate approach, such as Facebook’s efforts to detect and remove deepfakes, or the “fact checking” approach used by many third party (mostly partisan) groups to verify authentic and legitimate content, generally do not scale nor do they provide timely response to authenticity challenges. The current approaches need to be reinforced or replaced in some way. I believe authenticity challenges need to be addressed from the ground up in a secure-by-design approach. Or what I refer to as authenticity-by-design.

An authenticity-by-design approach necessarily must focus on assuring a strong foundation of underlying and reinforcing properties – verifiable identity, integrity, availability and accessibility, accountability, transparency, and privacy. Generally, when data is transparent and easily accessible, it makes it easier to identify anomalies… or straight up deceit. These properties should be backstopped by automated reasoning systems to reduce uncertainty and vagueness created by the vastness of the Internet and human intentions. In addition, an authenticity-by-design approach would benefit from a democratization and decentralization of digital content and a collaborative, open approach to capability development and delivery. Checking ourselves out of proprietary content silos takes power away from web monopolies like Google, Facebook, and Twitter. We’re so used to third party platforms collecting, distilling, and presenting information that we forget they’re not strictly necessary and they really don’t add value when it comes to authenticity of content.

When thinking of these properties and capabilities, I am reminded of Tim Berners-Lee Semantic Web. Envisioned in the late 1980’s and through many ups and downs since then, the Semantic Web today provides a common framework that allows data to be shared and reused across application, enterprise, and community boundaries. It is a collaborative effort led by the World Wide Web Consortium (W3C) with participation from a large number of researchers and industrial partners.

By user:Marobi1 – https://en.wikipedia.org/wiki/File:Semantic-web-stack.png, CC0, https://commons.wikimedia.org/w/index.php?curid=35013850

As shown in the Figure, Berner-Lee’s Semantic Web, now reinforced by the advancements in AI and ML, provides a good foundation for an authenticity framework. The Semantic Web technologies enable people to create content ecosystems on the internet, build vocabularies, and write rules for handling data, including opting in or out of attribution as a publisher, or enabling privacy of linked data. Linked data are empowered by technologies such as RDF, SPARQL,  OWL, and RIF/SWRL. The Semantic Web is silo-less; it is free, open, and abstract, enabling communication between different languages and platforms that would be far more difficult otherwise.

The Semantic Web also presents a potential downside as an advanced implementation might make it easier for governments to control the viewing of online information, as this information would be much easier for an automated content-blocking machine to understand. In addition, the issue has also been raised that, with the use of geolocation meta-data, there would be very little anonymity associated with the authorship of articles on things such as a personal blog. Some of these concerns were addressed in the “Policy Aware Web” project and they are an active research and development topic by W3C. Also, the same control and privacy problems could be said of any widely adopted, centrally-based, content attribution protocol, such as being developed under Adobe’s Content Authenticity Initiative.

Progressing parallel to the Semantic Web are several initiatives and standards which might also address some of the issues raised by the Semantic Web. These initiatives focus on establishment of decentralized identities, such as:

  • World Wide Web Consortium (W3C) Verifiable Credentials and Decentralized Identifiers (DIDs) standards. Verifiable credentials (VCs) are the electronic equivalent of the physical credentials that we all possess today, such as: plastic cards, passports, driving licenses, qualifications and awards, etc. The data model for verifiable credentials is a World Wide Web Consortium Recommendation, “Verifiable Credentials Data Model 1.0 – Expressing verifiable information on the Web” published 19 November 2019. Such a model is decentralized and gives much more autonomy and flexibility to the participants. DIDS are identifiers which are globally unique, highly available and cryptographically verifiable.
  • European Self-Sovereign Identity Framework (ESSIF). The ESSIF makes use of DIDs and the European Blockchain Services Infrastructure (EBSI).
  • Decentralized Identity Foundation (DIF). DIF is an engineering-driven organization focused on developing the foundational elements necessary to establish an open ecosystem for decentralized identity and ensure interoperability between all participants. Groups in DIF develop specifications and emerging standards for protocols, components, and data formats that implementers can execute against, with special focus on DIDs.

Decentralized identity solutions enable trustable ways to store and share verified identity while also preserving privacy and human autonomy. While data analytics are becoming increasingly effective in identifying and linking the digital trails of individual persons, it has become correspondingly necessary to defend the privacy of individual users and implement instruments that allow and facilitate anonymous access to services. Decentralized identity solutions can provide a means of authentication while still allowing the individual – be it employee, entity partner, or device – to be in control of their identity data.

According to Anne Bailey, an analyst at KuppingerCole, in her report, – Leadership Brief: What to Consider When Evaluating Decentralized Identity?– “Decentralized identity solutions serve to create a different infrastructure for managing digital identity that returns control of PII data to the individual owner, streamline the user experience, address questions of digital document integrity, and provide resiliency against malicious attacks. Early adopters consider decentralized identity solutions as a way to redesign organizational IAM to suit current data privacy expectations and support the influx of uniquely identified IoT devices. Decentralized ID should create a record in the enterprise user directory, but with attestations of verified identity credentials published to a blockchain identity ecosystem that preserves ownership by the user and enables reuse of verified identity credentials with other enterprises.” These concepts are represented in the KuppingerCole Figure below.

from KuppingerCole Leadership Brief – What to Consider When Evaluating Decentralized Identity? – Anne Bailey, March 5, 2020

 

So what would an authenticity stack look like when we combine identity and content? I believe that Andy Tobin’s 3 Pillars of Self Sovereign Identity (SSI) provides a good general description of what capabilities the combination could entail.

He describes a simple three layer stack which I consider as the basis for an authenticity stack. First, a layer providing secure connections is needed. The work that the DIF Working Group is doing on DIDComm 2.0 would be a key protocol to consider when building this layer of the authenticity stack. DIDComm is a standard, open protocol for establishing unique, private and secure connections between multiple parties without requiring the assistance of an intermediary “connection broker,” like Google, WhatsApp, an email provider, or a phone carrier. DIDComm is based on a standard format known as JSON Web Message. Secure connections are typically created by two or more peers creating and exchanging decentralized identifiers or “DIDs.” There are a variety of different implementations of DIDs, known as DID methods, available in the market, each with fundamentally different properties. Regardless of DID method, once two parties have exchanged DIDs, they can communicate securely as though through a private tunnel that nobody else can see or enter. DIDs can be created by anyone at any time, and you can have different DIDs for each of your digital relationships in order to keep them separate. DIDs provide secure connectivity; they do not by themselves provide trust.

A second layer focused on enabling trust is next. This authenticity layer would be based on a standard, open “digital data watermarking” protocol for issuing, holding, and verifying protected data, including verifiable credentials or “VCs.” This enables anyone to verify the source, integrity, and validity of any data that is presented to them, and to do so robustly and securely. This mechanism uses well-proven public key cryptography to digitally sign each data element. Additionally, this layer includes protocols for delegation of data, encryption, secure data storage, and approaches to revocation. The open Verifiable Credentials (VC) standard is the embodiment of this protocol, and is now a formal recommendation at the W3C. Any data can be put into a verifiable credential for any purpose by anyone. The combination of human trust in the issuer of the credential and cryptographic trust in the protocol is what provides digital trust between two parties.

A third layer is also critical – somewhere to store the public verification keys of connections and data owners. This allows anyone to locate and retrieve public keys at any time in order to verify the source, integrity, and validity of any data that adheres to the protocols in the previous layers. These keys and other cryptographic data are typically held in DID documents. While these could be stored in any database, in order for the source of information to be globally trusted, many decentralized identity systems have chosen distributed ledger technologies (DLTs) for their unique properties:

  • no backdoor or admin access for malicious changing of data;
  • no reliance on a single monopolistic provider that can turn it off, and;
  • chronologically ordered so you know you are retrieving current keys.

Note: DLT-based decentralized identity solutions should store PII data on the user’s personal device (in the secure enclave for iOS, for example), or in a dedicated hardware wallet. Only a hash of encrypted PII data should be stored on the ledger chain. Hashing-out and deleting legacy data comply with an erasure request required by GDPR even though the hash of the personal data remains on the ledger.

The opportunity to leverage this three layer stack is not limited to identity, either. While many initial implementations have an identity focus, these foundational building blocks can be used for any type of data, as well as for people, organizations, and things. What’s more, this capability enables secure messaging by default, as well as a new type of “quad-factor” authentication comprising device, biometric/PIN, DID connection, and verifiable credential proof. Taken together, these developments raise the bar on Internet identity, security, and privacy and provide a good basis for an authenticity stack.

So what do the advancements in the Semantic Web add to the authenticity stack? I see 4 key advantages:

  1. the Semantic Web enables M2M authenticity which will become increasingly important as IoT and autonomous things start to really take hold and dominate the generation of content and use of identity on the internet.
  2. the Semantic Web enables language independency by data enrichment. More people can participate in the development of better content by being language-independent.
  3. the Semantic Web associates meaning to data through semantic annotation and ontologies. Ontologies are critical for applications that need to search across or merge information from diverse communities.

Ontologies based on standards such as OWL (Web Ontology Language), facilitate machine interpretability of Web content by providing additional vocabulary along with a formal semantics. Ontologies are also one way to build and support knowledge graphs. The Resource Description Framework (RDF) is a widely used general-purpose language that produces a flexible graph model for representing meta-information and enabling data interchange on the web.

There are four semantic annotation formats that extend HTML syntax to create machine-readable semantic markup for HTML documents; Microformat, RDFa, Microdata and JSON-LD. Semantic markup is often generated automatically, rather than manually. RDF plays the role of a common model, as a kind of a “glue” to integrate the data. That does not mean that the data must be physically converted into RDF form and stored in RDF/XML. Instead, automatic procedures can produce RDF data on-the-fly as an answer to queries, e.g., SQL-to-RDF converters for relational databases, GRDDL processors for XHTML files with microformats, RDFa, etc, can produce RDF data. RDF data may also be included in the data via other tools (e.g, Adobe’s XMP data that gets automatically added to JPEG images by Photoshop). Authoring tools also exist to develop ontologies on a high level instead of editing the ontology files directly. Of course, direct editing of RDF data is sometimes necessary, but it can be expected to become less and less prevalent as smarter editors come to the fore.

4. the Semantic Web provides usage and validation rules to make sure that the knowledge remains consistent against multiple viewpoints; retains a good quality against validation rules and supports the inference of additional knowledge that is not explicitly stated in the data. Some examples include: SWRL (Semantic Web Rule Language), SPARQL, RDQL (RDF Data Query Language). Recently, the Shapes Constraint Language (SHACL) has also entered these ranks. With SHACL, a schema language now exists for schema-less graphs to become Knowledge Graphs. SHACL provides a set of critical capabilities in the semantic stack, In terms of Knowledge Graphs, this means that SHACL can increase the coherence of data, and it assists in Data Quality by making things explicit. SHACL provides a nice separation of concerns between meaning (OWL) and usage (SHACL).

The Semantic Web is also aided by the application of automated reasoning and AI to address a variety of challenges such as:

  • the vagueness of user queries – of concepts represented by content providers, or matching query terms to provider terms, and trying to combine different knowledge bases. Fuzzy logic is the most common technique for dealing with vagueness.
  • uncertainty –  for example, a patient might present a set of symptoms that correspond to a number of different distinct diagnoses each with a different probability. Probabilistic reasoning techniques are generally employed to address uncertainty.
  • inconsistency – there are logical contradictions that arise during the development of large ontologies, and when ontologies from separate sources are combined. Deductive reasoning fails when faced with inconsistency, because “anything follows from a contradiction.” Defeasible reasoning and paraconsistent reasoning are two techniques that can be employed to deal with inconsistency.
  • deceit – This is when the producer of the information is intentionally misleading the consumer of the information. Cryptography techniques are currently utilized to alleviate this threat by providing a means to determine the information’s integrity, including that which relates to the identity of the entity that produced or published the information, however credibility issues still have to be addressed in cases of potential deceit.

So to conclude, there is a base set of open technology that we can begin to draw upon (and that some of us have been drawing upon for some time) to ensure authenticity-by-design for the content we create and the personas and identities that we adopt or which we assign to devices. These authenticity technology instruments must be made transparent to the average user through a simple UX to generate the level of adoption we need to get authenticity where it needs to be.

Stay tuned for my next article where I will discuss how emerging ecosystems play a growing role in ensuring authenticity. Also in future articles, I will explore supply chain authenticity for physical objects, digital objects, software, and the roles of provenance and reputation in establishing and maintaining autenticity-by-design.


And thanks to my subscribers and visitors to my site for checking out ActiveCyber.net! Please give us your feedback because we’d love to know some topics you’d like to hear about in the area of active cyber defenses, PQ cryptography, risk assessment and modeling, autonomous security, digital forensics, securing OT / IIoT and IoT systems, Augmented Reality, or other emerging technology topics. Also, email chrisdaly@activecyber.net if you’re interested in interviewing or advertising with us at Active Cyber™.