Unstructured data can be tamed and put to good use

unstructured data
Image by eamesBot | Bigstockphoto

More than 90% of the data in existence lies in an unstructured format, and the overall mix has remained broadly the same over the past decade, according to IDC’s latest report, “Worldwide Global DataSphere and Global Storage Sphere Structured and Unstructured Forecast 2021-2025 (July 2021).” However, this is forecast to change as more unstructured data is tamed with the addition of metadata and enters the realms of the structured data world.

A key driver is the advent of software that allows the contents of unstructured data to be analysed and provided with context. For example, video analytics software can mark images in a file and give them specific references that can be stored and searched. This sounds mundane, but for owners of retail stores, it allows shoppers to be digitally identified and followed around a location to see which goods they look at and which ones they end up taking to the checkout. In short, it allows the rich data from an online purchase on Amazon to be duplicated into real life.  Similar revenue-generating opportunities can be extended into many other areas, meaning that unstructured data suddenly becomes highly valuable.

But first, some definitions.  Unstructured data is information in its raw format; it often lives in or near the original location in which it was collected. It represents all types of raw data collected, even that which hasn’t been catalogued or analysed.  Meanwhile, structured data is organised, quantitative data — most commonly numerical or text-based data — that exists in some standard formatting in a fixed field within a file or record. Information that resides in spreadsheets or relational databases is a typical example of structured data. This type of structure makes it simple to query the data when looking for specific pieces of data or groups of information.

Unstructured information is also referred to as qualitative data, meaning that it’s simply information that is observed or recorded. For instance, Internet of Things (IoT) sensors in a factory might collect data about the ongoing performance of equipment. The information is then sent to servers to be stored in an unstructured format, such as PDF and video files.

Other examples of unstructured data include satellite photos, weather reports, patients’ biosignal data in a hospital, and video imagery that has not yet been tagged or catalogued in an organised way. The common denominator is that data is passively gathered and transmitted without any pre-defined organisational formatting. While unstructured data has the opportunity to be extremely useful in spotting more significant trends and constructing predictive models when it has been reviewed and understood as part of a massive dataset, it’s challenging to search and analyse for the purposes of business analytics readily.

Structured vs. Unstructured Data

The main difference between structured and unstructured data is the formatting. Unstructured data is stored in its native formats, such as a PDF, video or sensor output. Structured data is presented strictly in a predefined form or with predefined signifiers that describe it in a standardized format so that it can be easily placed into a table, spreadsheet or relational database.

Unstructured data is often housed in what’s called a data lake, which is essentially a repository that stores raw data in various formats. Structured data resides in data warehouses, repositories that only accept data formatted to pre-defined specifications. A data lake is a reservoir that stores unstructured data and may also store structured data, while a data warehouse stores only organised and formatted, structured data.

Whether data is in a lake or a warehouse, the information is stored in some form of a database. The main difference is that structured data is stored in a relational database, stored in rows and columns using organised formats like Structured Query Language (SQL), PostgreSQL or MongoDB. These formats make structured data far easier for users — or machines — to search, sort and work with. Unstructured data, by contrast, is stored in a non-relational database such as NoSQL.

The two types of data also differ in how they may be analysed, as well as the tools and personnel needed for working with and manipulating them. Unstructured data is typically analysed using techniques such as data stacking and data mining, which have been developed to work with metadata and come to more general conclusions. More mathematical forms of analysis — such as data classification, clustering, and regression analysis — can be used when it comes to structured data. In terms of tools and technologies, structured data facilitates the use of management and analytics tools. Examples of tools used to work with structured data are:

  • Relational Database Management Systems (RDBMS)
  • Customer Relationship Management (CRM)
  • Online Analytical Processing (OLAP)
  • Online Transactional Processing (OLTP)

Software that can work with large datasets existing in multiple formats is typically used for managing and analysing unstructured data. Examples of tools for managing unstructured data include:

  • NoSQL Database Management Systems (DBMS)
  • AI-Driven Data Analysis Tools
  • Data Visualisation Tools

Unstructured data often requires management by a well-trained expert, and software tools that have more advanced Artificial Intelligence (AI) and predictive modelling capabilities, than those used for structured data. Machine learning is one of the strategies used for the analysis of unstructured data. As outlined above, tagged video data in retail environments is another good example of unstructured data being put to good use. 

Because structured data is already sorted and organised, the software tools used to work with these datasets are more accessible for non-expert business users. For example, inputs, searches, queries, and data manipulation are often self-service via a highly organised user interface.

Another example is the case of Pizza Hut in Vietnam, which uses a smart camera system to analyse the visual actions of its cooks and serving staff.  This helps ensure food quality in the making process is in line with established procedures and can also comply with various hygiene and preventative measures in the time of COVID 19.   Video imagery captured needs to be analysed and stored.

Video imagery, whether raw or semi-processed, can be very storage-intensive. This drives a higher-than-ever demand for mass-capacity storage systems centred around hard drives — which continue to provide significant total cost of ownership advantages, as advances in HDD technology continue to make ever-higher capacities possible.

The need to access unstructured data near its source and to move it, as needed, to a variety of private and public cloud data centres to be used for different purposes is also driving the shift from closed, proprietary, and siloed IT architectures to open, composable, hybrid architectures where data moves freely and efficiently across the distributed enterprise.

Mass-capacity storage systems, such as Seagate’s new Exos® CORVAULT™, allow vast amounts of unstructured data to be stored in macro edge and data centre environments.  The high-density storage system offers SAN-level performance built on Seagate’s breakthrough storage architecture that combines the sixth generation VelosCT™ ASIC, ADAPT erasure code data protection and Autonomous Drive Regeneration.

Additionally, modular storage solutions such as Seagate’s new Lyve Mobile offer a better way to move massive amounts of data physically by road transport from one storage location to another.

Conclusion

Today, two types of data have different uses. Unstructured data is the raw output of devices or software that collect information that is moved into data lakes in its original format. Structured data is organised in numerical or text format and can be catalogued, organised, reorganised and analysed within pre-defined parameters.

As more unstructured data enters the realm of the structured IT environment, particularly streaming data from an army of IoT devices, together with a mass of tagged video data, there is an opportunity for organizations to transform that data into information and knowledge.  Unstructured data can be extremely useful in spotting larger trends and constructing predictive models when reviewed and understood as part of a large dataset.  And therein lies the opportunity for people with the vision to gain new and innovative insight to launch new products and services to tap into this rich well of intelligence.

Related article: Livestreamed data to accelerate as edge computing gets more ubiquitous

By B S Teh, Executive VP of Global Sales and Sales Operations, Seagate Technology

Be the first to comment

What do you think?

This site uses Akismet to reduce spam. Learn how your comment data is processed.