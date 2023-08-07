Publishers are seeking ways to safeguard their subscription businesses from generative AI chatbots, such as OpenAI’s ChatGPT, that allow users to bypass paywalls. To address this issue, industry experts emphasize the need for chatbot developers to indicate when they are accessing publishers’ content, enabling publishers to treat them in a similar manner to search engine bots.

Generative AI chatbots operate similarly to search engine bots, which scan websites and gather information to include in search results. Although OpenAI recently suspended this feature, Google’s Bard and Microsoft’s Bing have yet to disable their bots’ ability to crawl web content.

Publishers have the option to disable crawl access for bots but differentiating between AI bots and search engine bots, like those from Google, poses a challenge since the latter allows pages to be indexed and appear in search results. According to Arvid Tchivzhel, managing director at Mather Economics’ digital consulting practice, the absence of a unified “do not crawl” standard and selective blocking technology leaves insufficient measures to prevent large language models from crawling websites.

To understand the potential solutions available to publishers, it is important to examine the two main methods for implementing paywalls: JavaScript-based paywalls and paywalls built on a content delivery network (CDN).

JavaScript-based paywalls work by presenting a pop-up overlay on a reader’s device after the page has loaded, requiring users to log in to access further content. This approach is similar to how advertisements are displayed on a webpage.

On the other hand, CDNs function by loading the page on a separate server rather than directly on the user’s device. The page is only permitted to load once the reader has logged in. Examples of CDNs include Cloudflare and AWS, as well as Zuroa’s Zephyr, which has developed its own CDN.

While CDNs offer stronger protection against AI bots, it remains unclear if they can effectively block them. Paywall management companies suggest that blocking AI crawlers would require AI organizations to flag their bots accordingly, using consistent and known IP addresses without alteration.

Companies like Paywall platform Piano are exploring solutions to combat generative AI crawling. They are developing a product called Edge Experience, which can secure content within a CDN. This feature will be launched in beta with approximately five clients in the coming month, aiming to block generative AI bots by enabling clients to identify and block specific user agents associated with such crawlers.

Industry experts stress the importance of a unified approach by publishers to combat AI bot crawlers. One possible strategy involves negotiating licensing deals with generative AI companies, allowing them to use and distribute publishers’ content.

To monitor AI crawlers effectively, analyzing bot traffic is suggested as a valuable tool. The Philadelphia Inquirer’s chief technology and product officer, Matt Boggie, states that they observe a significant surge in requests from a limited range of IPs or a single IP as a red flag. However, real-time tracking remains challenging, as these anomalies can easily go unnoticed throughout the day.

Considering the difficulty in determining the source of bots, publishers like The Inquirer are cautious of apparent bot activities. In fact, URLs from The Inquirer were identified in a dataset published by The Washington Post, which exposed websites utilized for training AI chatbots.