Text has become powerful data for AI models

Written text has become powerful data for AI models
Image by chipus | Bigstockphoto

ChatGPT and GPT-4 are now hot topics. I wrote earlier that they would alter the user experience forever. Another interesting area is how this opens a lot of new opportunities to use text (natural language) as input. It is, of course, quite natural that a language model type AI can use text input. But it is still interesting to think about how it can enable a lot of new data to be analyzed and new areas to utilize AI, even in politics, history, or medicine.

Traditionally numbers and tables, or more generally structured data, have been the main input for data analytics and AI. It could be data on how a system in a factory works, financial data, marketing results, sales data, or whatever formalized numerical data is relevant for different needs. But it has been hard to utilize, for example, newspaper articles, verbal reports written by analysts, doctors, diplomats or sales reps, or academic papers. We can also claim that most of the data from history is also in text format.

As I mentioned in my earlier GPT article, only time will tell how much GPT tech changes AI as a whole. But already ,we can clearly see that there are areas in which it opens totally new opportunities. One of them is user experience: you can start to ‘talk with data’. Another very interesting area is how much new, important input data this brings to AI.

Early text experiments

This is not the first time text data has been utilized for AI. There have been models, for example, to pre-process text into a structured format, after which the structured data can be used for data analytics and AI.

But often this has been quite limited. I have been involved in projects to analyze data such as news articles around the world to analyze spreading diseases and pandemics, as well as financial news of M&As to get structured data that can combine better with other financial data.

However, these solutions have been quite cumbersome to use in practice, and they have been more like academic research projects.

New domains of AI-powered insights

Let’s look at some examples of what kind of interesting new data GPT tech or large language models (LLM) more generally can offer for AI applications:

  1. Political analysis. Analyzing countries, their politics and trends has traditionally involved lots of manual work. We have had numerical data like opinion polls, election results, and budget data. But if we start to analyze all newspaper articles, opinion articles, social media discussions, political speeches, political reports, debates, etc., it opens a lot of new opportunities. And if you want to understand geopolitics better, you can do this with local languages in each country and perhaps also enable better sentiment analysis.
  2. Health and wellbeing services. These services can now combine structured data (e.g. clinical data and wearables data) with verbal health records, academic papers, health literature and many other reports. This can offer a totally new level insight.
  3. History research. It is possible to combine and analyze historical texts and other data, and use new tools to analyze and find things in a much more powerful way than researchers can do manually.

The quality of input data

It is also important to emphasize two things: (1) this doesn’t mean these tools are going to directly replace doctors or researchers, but that  there will be new powerful tools to analyze more data and find ways to use it, and (2) the reliability of the results also depends significantly on the input data.

It is not guaranteed that all input data is reliable, but this is a different issue from the technical capability to analyze data and find answers for the user. Obviously, if the underlying data is unreliable, the answers will be too. ChatGPT uses public domain data, but you can build solutions that only use qualified data sources, e.g., for health services.

At the same time, in politics and history research, opinions, unreliable data and conflicting information have important roles. This requires tools to detect these conflicts and inconsistencies and be able to handle and account for them.

It is also obvious that not only ordinary users, but also professionals need to learn to use these tools. For example, if you start to analyze politics and history with GPT tools, it most probably takes time and testing to find an optimal way to use it.

But this is true of most new tools. Spreadsheets, search engines and mobile apps have also changed our way of doing many things. Even so, Excel doesn’t guarantee you always get the right results from your calculations. Search engines don’t guarantee you always get the right answers. And transportation mobile apps don’t guarantee your bus, train or flight will be on time.

Learning to use text tools

When I founded and worked in data analytics and AI companies, we always knew that pre-processing of data was easily 70% to 80% of the work, and actual analytical and AI models only 20% to 30%. This basically means tasks like cleaning data (e.g. removing data you cannot use, incomplete series and unreliable data points), finding ways to interpret data from different sources and combining and linking data sets to each other (e.g. recognizing all data points from different sources that are linked to a certain night to analyze your sleep patterns).

GPT tech can help significantly with this data pre-processing. It makes it much easier to create many new models and combine data from different sources, especially when the data formats are very different. When there is a lot of data in the world and human beings have very limited capability to process data manually, any new technology that helps to utilize more and include it to models can change significantly, how to utilize data. It is sometimes very valuable if a machine can just make some proposals on how to categorize or organize data.

We can see how GPT tech can change how we interact with data, the user experience (enabling ‘talking’ with data), as well as make it easier to use more data sources, especially text for analytics and AI models. AI is especially designed to utilize huge masses of data to help humans extract relevant information and get support in decision making. But as humans, we must learn to use these new tools.

Be the first to comment

What do you think?

This site uses Akismet to reduce spam. Learn how your comment data is processed.