Data is a term that is thrown around quite a lot these days and many people use the word without fully understanding what it means. In technology the word “data” is approximately synonymous with “information”. That is data about a person, is just unprocessed information about them.
Almost all companies produce and consume huge amounts of data every second. Websites such as YouTube will generate data about the videos that you watch, how long you watch them for, what time you watch them at, who you share them with and so on and so forth. This data can then be used to recommend other videos that you might enjoy and even serve you adverts based on your interests.
Types of Data
In general, data can be structured or unstructured.
Structured data has a specific format that can’t be changed. You could think of this like fixed columns in a spreadsheet such as “Name”, “Age” and “Nationality”.
Unstructured data, on the other hand, is anything that doesn’t have a pre-defined structure. Things like audio files or the text in an email fit into this category as it is not possible to predict their format or length.
There is a third category in between these two categories called semi-structured data. This tends to be a combination of both structured and unstructured data. Take the example of an email message: although the text itself is unstructured, there is still an underlying structure including a column for the “Date”, “From” and “To”.
Working with data can present some challenges. To help understand these challenges, it is common to think about “The Four V’s of Data”
Volume refers to the size of the dataset that needs to be processed. These can often be petabytes or terabytes in scale and so special technologies are often used to analyse the data – you wouldn’t want to load this amount of data into Microsoft Excel on your laptop!
Velocity refers to the speed at which data can be produced. The Internet of Things refers to the connectivity of devices from cars to toasters. Each of these devices can have hundreds of sensors recording data such as the temperature, humidity or light level every second or minute. When multiplying this by billions of users, it is easy to see how quickly data might be produced!
Variety refers to all the different types of data that might be produced. For example there might be combinations of structured, unstructured or semi-structured data to deal with. Furthermore, it might be necessary to deal with numbers, text, pictures or audio. The different formats can be challenging to consolidate into a single model.
Veracity refers to the quality of data being produced – it is sometimes impossible to tell how accurate data is as there might be a lot of noise. For example, data being produced in a scientific trial is much more likely to be higher quality than the data produced by a consumer smartphone. This is due to better quality sensors and more scientific methods of collecting data.
It’s only recently that data management has been put to the forefront of business. It is a complicated, often subtle, yet utterly essential part of big data.
Data management ties in closely with the four Vs mentioned above. It is all about collecting, processing, and then storing data in an efficient and responsible way. So it draws on a bunch of different ideas from data privacy to storage software options.
Great data management involves balancing conflicting pros and cons. For example, you need to ensure that data is easily accessible to your data scientists so that they can create crazy algorithms and provide you with business insight. Yet at the same time, data must be carefully protected with access granted only when truly needed; things like personal information such as medical records or bank account details should only be viewable and useable by a select few.
How about another example? You probably want your business to be using the latest database hardware and software to store your data. Keeping up to data means that your big data estate will be more secure, easier to use, and simpler to manage. However every time you want to make an update (e.g. moving to a better 3rd party provider or buying new hardware), it will likely involve many months of planning, approvals, and transition before you actually get the benefit. So you can see that once again, data management is a complicated balancing act.
We’re not going to attempt to explain data management in detail here because it is such a large topic. But if you are interested, have a read of this article for a more thorough, industry-focused take on big data management. NOTE- don’t worry if you don’t understand it all! There are is a lot of techy terminology here which you can pickup later when needed.
Data is key to technology today. If you’re interested in learning more about Big Data or Data Science, you can take a look at our learning paths here.