Well, that’s awesome.

  • @[email protected]
    link
    fedilink
    26
    edit-2
    2 months ago

    The problem is that LLMs aren’t human speech and any dataset that includes them cannot be an accurate representation of human speech.

    It’s not “LLMs convinced humans to use ‘delve’ a lot”. It’s “this dataset is muddy as hell because a huge proportion of it is randomly generated noise”.

    • @[email protected]
      link
      fedilink
      -72 months ago

      What is “human speech”? Again, so many people (around the world) have picked up idioms and speaking cadences based on the media they consume. A great example is that two of my best friends are from the UK but have been in the US long enough that their families make fun of them. Yet their kid actually pronounces it “al-you-min-ee-uhm” even though they both say “al-ooh-min-um”. Why? Because he watches a cartoon where they pronounce it the British way.

      And I already referenced socal-ification which is heavily based on screenwriters and actors who live in LA. Again, do we not speak “human speech” because it was artificially influenced?

      Like, yeah, LLMs are “tainted” with the word “delve” (which I am pretty sure comes from youtube scripts anyway but…). So are people. There is a lot of value in researching the WHY a given word or idiom becomes so popular but, at the end of the day… people be saying “delve” a lot.

      • @[email protected]
        link
        fedilink
        10
        edit-2
        2 months ago

        Speech written by a human. It’s not complicated.

        It cannot possibly be human speech if it was produced by a machine.