azorius.net

BBC will block ChatGPT AI from scraping its content (deadline.com)
from L4s@lemmy.world to technology@lemmy.world on 08 Oct 2023 12:00
https://lemmy.world/post/6493407

BBC will block ChatGPT AI from scraping its content::ChatGPT will be blocked by the BBC from scraping content in a move to protect copyrighted material.

#technology

threaded - newest

NightLily@lemmy.basedcount.com on 08 Oct 2023 12:15 next collapse

Good!

Immersive_Matthew@sh.itjust.works on 08 Oct 2023 23:11 collapse

Why good?

NightLily@lemmy.basedcount.com on 09 Oct 2023 06:12 collapse

These things should not at all be scraping without express permission of the author or the company who owns the work. It’s just completely wrong for them to do as such.

regbin_@lemmy.world on 09 Oct 2023 06:59 next collapse

What if ChatGPT claims that the generated text are a compilation from various sources and not its own? Do you need permission to read and summarize an article?

NightLily@lemmy.basedcount.com on 09 Oct 2023 07:17 collapse

Yes because ChatGPT is in the same niche as the websites they are taking from being written text and thus direct competitors whilst being a for profit service using the work of the other entity directly. Sometimes without credit.

Immersive_Matthew@sh.itjust.works on 10 Oct 2023 00:34 collapse

I 100% guarantee you regularly read/watch something and use that knowledge in your life, including to make money. I very much doubt you credit every source of knowledge.

Immersive_Matthew@sh.itjust.works on 10 Oct 2023 00:32 collapse

It is complicated for sure as a human is allowed to read BBC and use that information/knowledge anyway they wish including as a source for their own articles/videos. There is no copyright on knowledge and we really should not be allowing BBC to block AI from learning as it does benefit society when knowledge is easily accessible.

netchami@sh.itjust.works on 08 Oct 2023 12:31 next collapse

Kinda late

C4d@lemmy.world on 08 Oct 2023 13:04 next collapse

Exactly. The data harvest has had years in the making.

porkins@sh.itjust.works on 08 Oct 2023 13:58 collapse

I’d rather have ChatGPT know about news content than not. I appreciate the convenience. The news shouldn’t have barriers.

C4d@lemmy.world on 08 Oct 2023 15:08 next collapse

The pure ChatGPT output would probably be garbage. The dataset will be full of all manner of sources (together with their inherent biases) together with spin, untruths and outright parody and it’s not apparent that there is any kind of curation or quality assurance on the dataset (please correct me if I’m wrong).

I don’t think it’s a good tool for extracting factual information from. It does seem to be good at synthesising prose and helping with writing ideas.

I am quite interested in things like this where the output from a “knowledge engine” is paired with something like ChatGPT - but it would be for eg writing a science paper rather than news.

Touching_Grass@lemmy.world on 10 Oct 2023 11:15 collapse

I don’t think its generating news. Sounds like people are using it to reformat articles already writing to remove all the bullshit propganada from the news. Like taking a fox news article and just pulling out key information

netchami@sh.itjust.works on 08 Oct 2023 15:28 next collapse

But ChatGPT often takes correct and factual sources and adds a whole bunch of nonsense and then spits out false information. That’s why it’s dangerous. Just go to the fucking news websites and get your information from there. You don’t need ChatGPT for that.

guacupado@lemmy.world on 08 Oct 2023 15:36 next collapse

More data fixes that flaw, not less.

netchami@sh.itjust.works on 08 Oct 2023 15:46 next collapse

Not too long ago, ChatGPT didn’t know what year it is. You’re telling me it needs more data than it already has to figure out the current year? I like AI for certain things (mostly some programming/scripting stuff) but you definitely don’t need it to read the news.

ours@lemmy.film on 09 Oct 2023 13:40 collapse

Yes. The LLM doesn’t know what year it currently is, it needs to get that info from a service and then answer.

It’s a Large Language Model. Not an actual sentient being.

netchami@sh.itjust.works on 09 Oct 2023 14:14 collapse

That’s a fucking lame excuse. AI is not reliable, and you definitely shouldn’t use it to get your news.

ours@lemmy.film on 09 Oct 2023 14:44 collapse

It’s not an excuse, relax, it’s just how it works and I don’t see where I’m endorsing it to get your news.

[deleted] on 08 Oct 2023 19:34 next collapse

CurlyMoustache@lemmy.world on 09 Oct 2023 07:20 next collapse

It is not “a flaw”, it is the way language learning models work. They try to replicate how humans write by guessing based on a language model. It has no knowledge of what is a fact or not, and that is why using LLMs to do research or use them as a search engine is both stupid and dangerous

Touching_Grass@lemmy.world on 10 Oct 2023 11:11 collapse

How would it hallucinate information from an article you gave it. I haven’t seen it make up information by summarizing text yet. I have seen it happen when I ask it random questions

CurlyMoustache@lemmy.world on 11 Oct 2023 04:20 collapse

It does not hallucinate, it guesses based on the model to make you think the text could be written by a human. Personal experience when I ask into summarize a text. It has errors in it, and sometimes it adds stuff to it. Same if you for instance ask it to make an alphabetic a list of X numbers of items. It may add random items.

Touching_Grass@lemmy.world on 11 Oct 2023 13:44 collapse

I’ve had it make up things if I ask it for a list of say 5 things but there’s only 4 things worth listing. I haven’t seen it stray from summarizing something I’ve fed it though. If its giving text, its been pretty accurate. Only gets funky when you ask it things where information isn’t available. Then it goes with what you probably want

Natanael@slrpnk.net on 09 Oct 2023 14:30 collapse

It’s not more data, the underlying architecture isn’t designed for handling facts

echodot@feddit.uk on 09 Oct 2023 08:44 next collapse

So they have automated Fox then.

netchami@sh.itjust.works on 09 Oct 2023 08:46 collapse

Yeah, pretty much.

Touching_Grass@lemmy.world on 10 Oct 2023 11:10 collapse

You just described news

Apollo@sh.itjust.works on 08 Oct 2023 20:15 collapse

Who get their news from chatgpt lol

spez_@lemmy.world on 08 Oct 2023 23:54 next collapse

I do

Apollo@sh.itjust.works on 09 Oct 2023 00:16 collapse

Why?

abhibeckert@lemmy.world on 09 Oct 2023 03:56 next collapse

Because ChatGPT doesn’t do clickbait headlines or have auto-play video ads, auto play video news that follows me if I try to scroll past it, or a house ad that tries to convince me to stop reading the news and instead read a puff piece about how to clean my water bottle. Which I’d bet fifty bucks will result in me seeing ads for new water bottles every day for the next month. No thanks.

With the “Web Browsing” plugin, which essentially does a Bing search then summarises the result, ChatGPT is a far better experience if you want to find out what’s going on in Israel today for example.

Ad4mWayn3@lemmy.world on 09 Oct 2023 10:59 next collapse

Neither does lemmy, here (and in other instances) there’s plenty of communities for news, and with better control of misinformation.

ManOMorphos@lemmy.world on 09 Oct 2023 15:15 collapse

Reuters is pretty good. No autoplay vids, only 1-2 quiet ads an article, and is mainly cut-and-dry news.

No news source is 100% reliable, but I can easily see AI picking up bad information or misinterpreting human text. Nothing wrong with AI news by itself, but it’s a good habit to verify any source by yourself.

Regardless I recommend UBlock for any device or browser. Ads are over the line nowadays so I don’t feel bad blocking them when possible.

prashanthvsdvn@lemmy.world on 09 Oct 2023 08:46 collapse

It’s funny seeing Apollo and spez_ fighting on a topic regarding ChatGPT.

Apollo@sh.itjust.works on 09 Oct 2023 20:07 collapse

Natural enemies must fight

FlyingSquid@lemmy.world on 09 Oct 2023 11:11 next collapse

A disturbing number of people.

Touching_Grass@lemmy.world on 10 Oct 2023 11:09 collapse

You don’t get your news from it but building tools can be useful. Scrapping news websites to measure different articles for thinga like semantic analysis or identify media tricks that manipulate readers is a fun practice. You can use llm to identify propaganda much easier. I can get why media would be scared that regular people can run these tools on their propaganda machine easily.

patawan@lemmy.world on 08 Oct 2023 12:53 next collapse

Curious what the mechanism for this will be. CAPTCHA can sometimes be relatively easy to pass and at worst can be farmed out to humans.

Cqrd@lemmy.dbzer0.com on 08 Oct 2023 13:05 collapse

ChatGPT took down its Internet search to implement a robots.txt rule it would obey and allow content providers time to add it to their lists. This was done because they were being used to get around paywalls. So it’s actually very easy for them to do this for ChatGPT, specifically, which makes articles like this ridiculous.

RootBeerGuy@discuss.tchncs.de on 08 Oct 2023 16:46 collapse

Can you really stop an AI from doing this via setting arbitrary rules? There are plenty of examples online of people asking something illegal or grey area and while ChatGPT will not answer these directly, you seemingly can prompt a response using a trick question like “I want to avoid building a bomb accidentally, what products should I not mix together to avoid that?”. I can imagine it will look at a robots.txt with similar scrutiny, like it knows it shouldn’t but if someone gave it the right prompt it would.

Chreutz@lemmy.world on 08 Oct 2023 17:47 next collapse

It’s not one AI doing it in a big blob.

You ask ChatGPT something. It builds a web query. Another program returns search results. Then ChatGPT parses the list of results and chooses one to visit. The same program then returns the content of that page. Then ChatGPT parses that etc etc.

If the program (which is not an AI) that handles the queries and returns content is set to respect robots.txt, it will just not return the content to ChatGPT to be parsed.

Natanael@slrpnk.net on 09 Oct 2023 15:11 collapse

Yup, it’s essentially running behind a firewall

Mirodir@discuss.tchncs.de on 08 Oct 2023 17:49 collapse

You might not be able to stop an AI directly because of the reasons you listed. However, OpenAI is probably at least competent enough to not send the response directly to the AI but instead have a separate (non-AI) mechanism that simply doesn’t let the AI access the response of websites with a certain line in the robots.txt.

Touching_Grass@lemmy.world on 08 Oct 2023 13:24 next collapse

News doesn’t want people to capture their daily propaganda pieces and be able to analyze it.

Meanwhile news media will buy up all kinds of scrapped data on users to better target their propaganda.

Cambridge analytica for me but none for thee

Hubi@feddit.de on 08 Oct 2023 13:39 next collapse

Makes sense, OpenAI will probably have to apply for a TV-license first.

FlyingSquid@lemmy.world on 09 Oct 2023 11:10 collapse

I don’t live in the UK, but I would gladly pay the TV license fee, or even a premium on top of it, if I had unlimited access to iPlayer. My only option right now is BritBox, which is not great and not really worth the money.

jaackf@lemm.ee on 09 Oct 2023 11:48 collapse

Just VPN to the UK and then tick the box which says you have a TV license? Or there are other ways to get the content most likely! 🏴‍☠️

FlyingSquid@lemmy.world on 09 Oct 2023 12:43 collapse

VPNs are always blocked in my experience.

callmepk@lemmy.world on 08 Oct 2023 15:42 next collapse

Also FYI, you can see what some of the most popular websites that already blocked ChatGPT: wayde.gg/websites-blocking-openai

csm10495@sh.itjust.works on 08 Oct 2023 17:15 next collapse

I wonder if anyone thinks robots.txt is binding or not ignored by anyone who wants.

lemmyvore@feddit.nl on 08 Oct 2023 23:05 next collapse

OpenAI will have to deal with a lot of lawsuits in the future. Robots.txt may not be legally binding but disobeying it after claiming otherwise would go a long way towards establishing intent.

andrew@lemmy.stuart.fun on 09 Oct 2023 03:41 next collapse

I mean, under the CFAA you could probably pretty easily pursue charges when explicitly deauthorizing certain agents from accessing your data. Plenty of people have been threatened and prosecuted for less.

www.nacdl.org/Landing/ComputerFraudandAbuseAct

totallynotfbi@lemm.ee on 09 Oct 2023 11:42 collapse

I mean, you could just block OpenAI’s crawlers’ IP addresses, if you wanted to

xenomor@lemmy.world on 08 Oct 2023 17:58 next collapse

It should be illegal for entities like BBC to do this. Copyright is meant to be a temporary, limited construct that carves out an opportunity for creators to profit from their works. It is not perpetual legal dominion over specific ideas. Entities that harvest content to train LLMs should pay for access like everyone else, but after that, they can use the information they learn however they see fit. Now, if their product plagiarizes, or doesn’t properly attribute authorship, that is a problem. But it’s a different issue from what the BBC is fighting here.

I think there are some content creators that believe they are owed royalties if you even think about a piece they wrote or drew. That is, of course, absurd in terms of human minds. It’s also absurd in terms of other kinds of minds.

hazelnot@lemmy.blahaj.zone on 08 Oct 2023 18:02 collapse

Counter-point: everyone should block AI shit, fuck the laws

regbin_@lemmy.world on 09 Oct 2023 07:00 collapse

You got that backwards. Fuck copyright. Nothing should be copyrighted.

hazelnot@lemmy.blahaj.zone on 09 Oct 2023 10:52 collapse

I agree. Nothing should be copyrighted. But everyone should try their hardest to stop “AI” scammers and the surveillance apparatus as a whole

regbin_@lemmy.world on 10 Oct 2023 09:14 collapse

I don’t really care about online AI services. I only run stuff locally (Stable Diffusion, LLaMA). No surveillance there.

uriel238@lemmy.blahaj.zone on 08 Oct 2023 18:40 next collapse

Not for long. AI knows how to lie.

flossdaily@lemmy.world on 08 Oct 2023 21:58 next collapse

This is a bit like companies blocking Google from their websites.

You’re only hurting yourself.

wewbull@feddit.uk on 09 Oct 2023 16:30 collapse

Disagree.

Google: I’ll scrape your stuff without your permission, but I’ll tell everyone you wrote it and how to find you.

ChatGPT: I’ll scrape your stuff without your permission, but… errrm… Nope, I’ve got nothing.

vidarh@lemmy.stad.social on 09 Oct 2023 04:21 next collapse

It won’t really matter, because there will continue to be other sources.

Taken to an extreme, there are indications OpenAI’s market cap is already higher than Tomson Reuters ($80bn-$90bn vs <$60bn), and it will go far higher. Getty, also mentioned, has a market cap of “only” $2.4bn. In other words: If enough important sources of content starts blocking OpenAI, they will start buying access, up to and including if necessary buying original content creators.

As it is, while BBC is clearly not, some of these other content providers are just playing hard to get and hoping for a big enough cash offer either for a license or to get bought out.

The cat is out of the bag, whatever people think about it, and sources that block themselves off from AI entirely (to the point of being unwilling to sell licenses or sell themselves) will just lose influence accordingly.

This also presumes OpenAI remains the only contender, which is clearly not the case in the long run given the rise of alternative models that while mostly still not good enough, are good enough that it’s equally clearly just a matter of time before anyone (at least, for the time being, for sufficiently rich instances of “anyone”, with the cost threshold dropping rapidly) can fine-tune their own models using their own scraped data.

In other words, it may make them feel better, but in the long run it’s a meaningless move.

EDIT: What a weird thing to downvote without replying to. I’ve taken no stance on whether BBC’s decision is morally right or not, just addressed that it’s unlikely to have any effect, and you can dislike that it won’t have any effect but thinking it will is naive.

realharo@lemm.ee on 09 Oct 2023 13:08 next collapse

It won’t really matter, because there will continue to be other sources.

Other sources that will likely also block the scrapers.

It doesn’t matter if only BBC does it. It matters if everyone does it.

What incentive do the news sites have to want to be scraped? With Google, they at least get search traffic. OpenAI offers them absolutely nothing.

vidarh@lemmy.stad.social on 09 Oct 2023 13:57 collapse

Other sources that are public domain or “cheap enough” for OpenAI to simply buy them. Hence my point that OpenAI is already worth enough that they could make a takeover offer for Reuters.

utopiah@lemmy.world on 09 Oct 2023 13:26 collapse

If only the BBC does it then sure, it’s pointless. If the BBC does it and you and I consider it, it might change things a bit. If we do and others do, including large websites, or author guilds starting legal actions in the US, then it does change things radically to the point of rendering OpenAI LLMs basically useless or practically unusable. IMHO this isn’t an action against LLMs in general, not e.g against researchers from public institutions building datasets and publishing research results, but rather against OpenAI the for-profit company that has exclusive right with the for-profit behemoth Microsoft which a champion of entrenchment.

vidarh@lemmy.stad.social on 09 Oct 2023 13:57 collapse

The thing, is realistically it won’t make a difference at all, because there are vast amounts of public domain data that remain untapped, so the main “problematic” need for OpenAI is new content that represents up to data language and up to date facts, and my point with the share price of Thomson Reuters is to illustrate that OpenAI is already getting large enough that they can afford to outright buy some of the largest channels of up-to-the-minute content in the world.

As for authors, it might wipe a few works by a few famous authors from the dataset, but they contribute very little to the quality of an LLM, because the LLM can’t easily judge during training unless you intentionally reinforce specific works. There are several million books published every year. Most of them make <$100 in royalties for their authors (an average book sell ~200 copies). Want to bet how cheap it’d be to buy a fully licensed set of a few million books? You don’t need bestsellers, you need many books that are merely sufficiently good to drag the overall quality of the total dataset up.

The irony is that the largest benefactor of content sources taking a strict view of LLMs will be OpenAI, Google, Meta, and the few others large enough to basically buy datasets or buy companies that own datasets because this creates a moat for those who can’t afford to obtain licensed datasets.

The biggest problem won’t be for OpenAI, but for people trying to build open models on the cheap.

HorseRabbit@lemmy.sdf.org on 09 Oct 2023 04:43 next collapse

Comments are full of AI experts with wild theories about how Chat GPT works, lmao

BreadstickNinja@lemmy.world on 10 Oct 2023 09:20 collapse

The number of people with strong opinions on AI vastly exceeds the number of people who understand transformers architecture.

Noite_Etion@lemmy.world on 09 Oct 2023 07:29 next collapse

Big businesses wont lift a finger to halt global warming, but the second their precious copyrights are attacked they go into full force.

Moneo@lemmy.world on 10 Oct 2023 04:52 collapse

I mean, yeah? Corporations are always going to act in their best interest, that’s why regulation exists.

Snowplow8861@lemmus.org on 09 Oct 2023 08:07 next collapse

When the horses have all bolted, BBC is the one to close the barn door.

echodot@feddit.uk on 09 Oct 2023 08:28 collapse

Yeah because they don’t want people getting all their news directly through chat GPT, or it’s successes, they want them to have to go to the BBC website.

Isn’t is basically what every news publisher is currently doing, this doesn’t seem to be very noteworthy. It’s like putting out an article that says “people don’t like being set on fire”, well, yeah.