Training or Embeddings: LLM Separations or Not

Quote/unquote hallucinations, questionable use of copyrighted materials in training, exploitations of labor, environmental impact, complete lack of transparency, we should recognize the litany of problems with the 2023 AI Hype machinery.

The natural desire is, what if I can have an AI that is deriving it’s responses from purely MY stuff or from a specific domain (academic papers, medical research data, weather. It’s being done, but is it really detached from the aforementioned (now there is a $50 word!) problems of the mystery machinery?

I’m asking, not telling.

I think there is some bleeding over of the processes of training LLMs and creating those mysterious embeddings on content using the LLMs.

I said “I think” because I do not know.

Look! Tonybots Are Here

You have to admire the incredible energy, curioisty, and dsarn it, good old blogging by the force of online learning that is Tony Bates. If anything, he does not slow down, but accelerates, with here in his blog the introduction of a personal AI or his Tonybot.

AI comes to my web site – meet Tonybots

Tony reports a colleague created Tonybot by training it on his 2700+ blog posts. I had to try, and gave it a question I knew a bit of the answer to already, “Has Tony spent time in Mexico?”

Question in the chat bot is "Has Tony spent time in Mexico?" and the answer

Yes, I have spent time in Mexico. I was involved in an innovation in online teaching at a Mexican university. It was a pretty intense week, involving four consecutive 12 hour days. I also had the opportunity to work with Mexican colleagues, who were incredibly kind and generous, and so enthusiastic about adopting new methods of teaching.

along with a link to the source

Yes, a leading question, since in 2015 I had the fortunate to spend time with Tony as part of a conference hosted by Universidad de Guadalajara where I had been part of a fantastic UDG Agora Project. In fact, Tony’s work with UdG starting a decade earlier was the genesis for his recommending Tannis Morgan to lead this project.

Anyhow, Tonybot’s first link response under “Where did this answer come from?” circled back to what I know

Innovation in online teaching in a Mexican university

This is all fine and fun to pose questions to something we think is trained on the body of work that is represented on Tony’s blog, more than a search box. And to show his way of thinking, Tony’s reaction to the AI is commendable:

I have to say I was impressed. These responses are a pretty good summary of my views and thoughts on these topics. However, they still need to be contextualised, that is, applied within a specific institutional or teaching context, and more specific conclusions or recommendations would be drawn if I was acting as a consultant, for instance.

I think it’s fair to say that this use of AI is fine for looking backwards, and that is useful, but it is not so helpful for looking forwards. But then you need to base future decisions on past knowledge, at least to some extent. So yeah, as Tonybots said, it’s rather a complex issue with many factors to consider.
https://www.tonybates.ca/2023/11/21/ai-comes-to-my-web-site/ with emphasis added by me

So is Tonybot an AI assistant trained purely on Tony’s writings? What’s going on underneath the dalek?

ChatGPT or Not?

URL curiosity comes in handy, when I see the domain in Tonybot’s url, I end up at CustomGPT:

Creates Your Own ChatGPT with ALL Your Business Content.

Accurate ChatGPT responses from your content without making up facts. All within a secure, privacy-first, business-grade platform
…
Queries are answered with ChatGPT-4 streaming API – without making up facts.
https://customgpt.ai/

Numerous references to “not making up facts” that’s what we want, right? And the route there is just training on our known stuff, not all the questional gobs of the web chewed up to make ChatGPT.

Except.

ChatGPT 4.0 is the engine of this bot. It likely interprets the question typed into the chat box (yes, processing your query through all the gunk of ChatGPT’s training), and all of Tony’s posts have been hoovered in and somehow transformed into those 1500 dimensional space arrays, and results returned to you are again created by pushing content from Tony’s site again through the ChatGPT machinery.

What is created here is not an AI trained on Tony’s posts, but what is known as an embedding processed by ChatGPT. For all the Explain Embeddings Like I am 5 posts that are all over the place, my own brain is still a bit fuzzy. A lot.

If you really want to purely run your own LLM, you need to download some kind of dataset that will likely fill your laptop’s hard drive and sort through a bunch of python code, this is what I glen from Simon Willinson’s extensive explanations in Making Large Language Models work for you and his series on LLMs on personal devices.

It’s utterly complex.

I’m pretty sure D’Arcy Norman is taking a stab at this.

That’s a lot of effort beyond most technological mortals (I’ll slide in that camp).

And the appeal of CustomGPT (and like a whole booming raft of et al)- they make it point and click, for upwards of $44 per month.

That does not mean its not of value, but I would caution about thinking this approach is really free from the underlying problems of what ChatGPT is trained on and more problematic, not having an assured comprehension of how it spits stuff out.

Ask Tonybot?

Question asked of a bot-

Can I feel assured that your responses are not affected by the issues of the questionable training content of ChatGPT?

and the answer

As an AI, I don’t have feelings or personal biases. My responses are generated based on a mixture of licensed data, data created by human trainers, and publicly available data. I don’t have the ability to access or use personal data unless it has been shared with me in the course of our conversation. I am designed to respect user privacy and confidentiality. However, it’s important to note that while efforts are made to ensure the quality and reliability of my responses, they should be used as a guide and not a definitive source of information.

That sounds more like ChatGPT than Tony Bates– perfectly composed grammatically proper sentences that could really be more or less reading the label of a soup can.

Again, I credit and want to always emulate Tony’s thinking and approach here- don’t just pontificate, but experiment and analyze.

But also, it’s worth considering the difference between LLM training (what creates all those vector databases) and what is shoved through it and sieved on the other side.

Don’t expect any CogDogBots in the near future.

And mostly, such great memories of time spent with the real Tony in Mexico.

Tony Bates And Camera flickr photo by cogdogblog shared under a Creative Commons (BY) license

Featured Image: My own remixed photo representing an ideal separation of technologies that is not. Rusty Circuits flickr photo by cogdogblog shared under a Creative Commons (BY) license