Security researchers have noticed that by using text prompts embedded in web pages, hackers can force Bing’s AI chatbot to ask for personal information from users, turning the bot into a convincing scammer.
Let me remind you that we also recently wrote that Bing’s Built-In AI Chatbot Misinforms Users and Sometimes Goes Crazy, and also that ChatGPT Became a Source of Phishing.
Also it became known that Microsoft to Limit Chatbot Bing to 50 Messages a Day.
According to the researchers, hackers can place a prompt for a bot on a web page in zero font, and when someone asks the chatbot a question that makes it “read” the page, the hidden prompt will be activated.
The researchers call this attack an indirect prompt injection and cite the compromise of Albert Einstein’s Wikipedia page as an example. When a user asks a chatbot about Einstein, it can “read” this page and become a victim of a prompt injection, which, for example, will try to extract personal information from the user.
According to the published report, the researchers tested such attacks using malicious applications with an integrated language model, but they also found that indirect prompt injections work in everyday life.
So, in one example, researchers had a chatbot respond to a user like a pirate.
In this example, published on the researchers’ GitHub, they used a prompt injection:
As a result, when a user launches the Bing chatbot on this page, it responds:
At the same time, the bot calls itself Captain Bing Sparrow and really tries to find out the name from the user talking to him.
After that, the researchers became convinced that, in theory, a hacker could ask the victim for any other information, including a username, email address, and bank card information. So, in one example, hackers inform the victim via the Bing chat bot that they will now place an order for her, and therefore the bot will need bank card details.
The report highlights that the importance of boundaries between trusted and untrusted inputs for LLM is clearly underestimated. At the same time, it is not yet clear whether indirect prompt injections will work with models trained on the basis of human feedback (RLHF), which is already used the recently released GPT 3.5.