AI companies reportedly continue to take down websites despite protocols meant to block them

Perplexity, a company that describes its product as “a free AI search engine,” has come under fire in recent days. A little after Forbes accused him of stealing her story and reposting it on multiple platforms, Wired reported that Perplexity ignored the Robot Exclusion Protocol, or robots.txt, and removed its website and other Condé Nast […]

News Jun 23, 2024 17 Add to Reading List

AI companies reportedly continue to take down websites despite protocols meant to block them

Perplexity, a company that describes its product as “a free AI search engine,” has come under fire in recent days. A little after Forbes accused him of stealing her story and reposting it on multiple platforms, Wired reported that Perplexity ignored the Robot Exclusion Protocol, or robots.txt, and removed its website and other Condé Nast posts. technology website The shortcut also accused the company of scraping its articles. NOW, technology/artificial-intelligence/multiple-ai-companies-bypassing-web-standard-scrape-publisher-sites-licensing-2024-06-21/” rel=”nofollow noopener” target=”_blank” data-ylk=”slk:Reuters;cpos:4;pos:1;elm:context_link;itc:0;sec:content-canvas” class=”link “>Reuters reported that Perplexity is not the only AI company bypassing robots.txt files and scraping websites for content that is then used to train their technologies.

Reuters said it saw a letter to the publishers of TollBit, a startup that pairs them with AI companies so they can enter into licensing deals, warning them that “AI agents from multiple sources (not from ‘only one company) choose to bypass the robots.txt’ protocol to retrieve content from sites. The robots.txt file contains instructions for web crawlers about what pages they can and cannot access. Web developers have used the protocol since 1994, but compliance is entirely voluntary.

TollBit’s letter did not name any companies, but Business Insider said he learned that OpenAI And Anthropic – the creators of chatbots ChatGPT and Claude, respectively – also bypass robots.txt signals. Both companies have previously proclaimed that they respect the “do not crawl” instructions that websites put in their robots.txt files.

During his investigation, Wired discovered that a machine on an Amazon server “certainly operated by Perplexity” was bypassing robots.txt instructions on its website. To confirm if Perplexity was retrieving its content, Wired provided the company’s tool with titles of its articles or short prompts describing its stories. The tool allegedly produced results that accurately paraphrased his articles “with minimal attribution.” And sometimes he even generated inaccurate summaries for his articles – Wired says the chatbot falsely claimed to have reported that a specific California police officer had committed a crime in one case.

In an interview with Fast business, Perplexity CEO Aravind Srinivas told the publication that his company “does not ignore the bot exclusion protocol and then lie about it.” That doesn’t mean it doesn’t benefit from crawlers that ignore protocol. Srinivas explained that the company uses third-party web crawlers in addition to its own, and that the crawler Wired identified was one of them. When Fast business When asked if Perplexity had asked the bot vendor to stop taking down Wired’s website, he said only that “it’s complicated.”

Srinivas defended his company’s practices, telling the publication that the bot exclusion protocol is “not a legal framework” and suggesting that publishers and companies like his may need to establish a new type of relationship. He also allegedly insinuated that Wired deliberately used prompts to make Perplexity’s chatbot behave the way it did, so that ordinary users wouldn’t get the same results. As for the inaccurate summaries generated by the tool, Srinivas said, “We never said we never hallucinated.”

Teknory