AI chatbots unable to accurately summarise news, BBC finds

@[email protected] · 1 day ago

AI chatbots unable to accurately summarise news, BBC finds

paraphrand · edit-2 1 day ago

I don’t think giving the temperature knob to end users is the answer.

Turning it to max for max correctness and low creativity won’t work in an intuitive way.

Sure, turning it down from the balanced middle value will make it more “creative” and unexpected, and this is useful for idea generation, etc. But a knob that goes from “good” to “sort of off the rails, but in a good way” isn’t a great user experience for most people.

Most people understand this stuff as intended to be intelligent. Correct. Etc. Or they At least understand that’s the goal. Once you give them a knob to adjust the “intelligence level,” you’ll have more pushback on these things not meeting their goals. “I clearly had it in factual/correct/intelligent mode. Not creativity mode. I don’t understand why it left out these facts and invented a back story to this small thing mentioned…”

Not everyone is an engineer. Temp is an obtuse thing.

But you do have a point about presenting these as cloud genies that will do spectacular things for you. This is not a great way to be executing this as a product.

I loathe how these things are advertised by Apple, Google and Microsoft.

@[email protected] · edit-2 1 day ago

Temperature isn’t even “creativity” per say, it’s more a band-aid to patch looping and dryness in long responses.
Lower temperature is much better with modern sampling algorithms, E.G., MinP, DRY, maybe dynamic temperature like mirostat and such. Ideally, structure output, too. Unfortunately, corporate APIs usually don’t offer this.
It can be mitigated with finetuning against looping/repetition/slop, but most models are the opposite, massively overtuning on their own output which “inbreeds” the model.
And yes, domain specific queries are best. Basically the user needs separate prompt boxes for coding, summaries, creative suggestions and such each with their own tuned settings (and ideally tuned models). You are right, this is a much better idea than offering a temperature knob to the user, but… most UIs don’t even do this for some reason?

What I am getting at is this is not a problem companies seem interested in solving.They want to treat the users as idiots without the attention span to even categorize their question.

@[email protected] · 1 day ago

This is really a non-issue, as the LLM itself should have no problem at setting a reasonable value itself. User wants a summary? Obviously maximum factual. He wants gaming ideas? Etc.

@[email protected] · edit-2 1 day ago

For local LLMs, this is an issue because it breaks your prompt cache and slows things down, without a specific tiny model to “categorize” text… which few have really worked on.

I don’t think the corporate APIs or UIs even do this. You are not wrong, but it’s just not done for some reason.

It could be that the trainers don’t realize its an issue. For instance, “0.5-0.7” is the recommended range for Deepseek R1, but I find much lower or slightly higher is far better, depending on the category and other sampling parameters.