All The Terms In The World Of Data
Getting started in data science, analytics, business intelligence, programming, or any other topic that we cover on DataRundown can be overwhelming, especially when you consider the variety of concepts and techniques.
In this glossary we have put together over 50 common terms with a brief description. We hope it will serve as a handy quick reference whenever you’re reading an article or working on a project. We will update the list continuously, if you think we are missing a term – please let us know
An algorithm is a methodical and systematic description of how to solve a problem. The definition of an algorithm is a step-by-step procedure of well-defined executable instructions designed to perform a task or solve a problem. The algorithm can be seen as a job description of how we should solve a problem and describe the steps the program should take to accomplish that.
Apache Spark, or just Spark, is among the most popular tools in the big data industry. It’s an open-source framework developed by Apache that is used for big data storage, processing, and analysis. Apache Spark comes with a group of tools that can be used for various features, such as structured data, graph data processing, and machine learning analysis.
Download and learn more about Apache Spark at the official website
API (Application Programming Interface) is a software intermediary that allows two separate applications to communicate with one another. APIs define methods of communication between various components.
Artificial intelligence (AI) is a broad field of computer science concerned with creating intelligent computers capable of doing activities that normally require human intelligence. Artificial intelligence (AI) tries to leverage computers and machines to mimic the problem-solving and decision-making capabilities of the human mind.
Machine learning (ML) is an application of artificial intelligence (AI) that allows systems to automatically learn and improve from experience without being precisely programmed.
Some applications of AI include:
- Image recognition
- Speech recognition
- Computer vision
- Virtual reality
- and many more …
Augmented Reality (AR) uses the existing real-world environment – the one that is right in front of us – and puts virtual information on top of it to enhance the experience. AR augments your surroundings by adding digital elements to a live view, often using a smartphone’s camera. Contrary, Virtual reality (VR) uses an entirely computer-generated simulation of an alternate world.
In a computer context, “backend” (sometimes “back-end”, “back end” or “server side”) refers to the parts of a website or software program that is not visible for the users. On the opposite, the front end refers to a program’s or website’s user interface. In programming terminology, the backend is the “data access layer,” while the front end is the “presentation layer.”
The backend parts are responsible for storing and organizing data, and ensuring everything on the user-side actually works.
Bayes Theorem is a mathematical formula for determining conditional probability. Conditional probability is the likelihood of an event or outcome happening, based on the occurrence of a previous event or outcome. Bayes Theorem is widely used in the field of machine learning.
Bayesian Network, also known as a Bayes network or Bayes net, is a type of probabilistic graphical model that uses Bayesian inference for probability computations. Bayesian networks are graphs that represent the relationship between random variables for a given problem. Bayesian networks are useful for taking an observed event and forecasting the likelihood that any of numerous known causes had a contributing factor.
A concept in business analytics that reveals insights into the behaviour of users and focuses on understanding how consumers act and why. It is used on websites, e-commerce, mobile app, chat, email, connected product/Internet of Things (IoT), and other digital channels. Behaviour analytics is a data-driven approach to tracking, forecasting, and utilising users’ behaviour data within a digital product.
Big data is, as the name suggests, huge, hard-to-manage, and more complex data sets. Often from new data sources. These data sets are so extensive that conventional data processing software just can’t manage them. Big data is simply an enormous amount of data that we need new methods and tools to handle. Big data can be analyzed for insights that improve decisions and provide conviction in making strategic business actions.
Digital archives that are composed of unchangeable, digitally recorded data in packages called blocks. The “block” in a blockchain refers to a block of transactions that have been sent to the network. The “chain” refers to a series of these blocks. Each block is “chained” to the next block using a cryptographic signature.
Blockchains are best known for their crucial role in cryptocurrency systems, such as Bitcoin, for maintaining a secure and decentralised record of transactions. One key difference between a typical database and a blockchain is how the data is structured.
Clustering is the task of dividing the population or data points into several smaller groups. The smaller groups are then more similar to other data points in the same group than those in other groups. The aim is to divide groups with similar traits and assign them into clusters.
Usage for clustering could be market segmentation, image segmentation, search result grouping, and so on. There are different types of clustering in machine learning, such as: Centroid-based clustering, Density-based Clustering, Distribution-based Clustering, and Hierarchical Clustering
Customer engagement is the emotional connection between a customer and a brand. Customer engagement is the procedure of interacting with customers through a combination of channels, for example, social media or websites, to strengthen the relationship.
A database is a structured collection of information or data that is often kept electronically in a computer system. The database is usually managed by a database management system (DBMS). Databases are the cornerstone of any Software Applications as this is where we store all of our data.
There are several different types of databases, some examples are: Relational databases (SQL databases), Cloud databases, NoSQL databases, and many more…
A field that focuses on identifying data sources, collection, administration, and storage of the data. It is a broad field with applications in almost every industry. Organizations can collect massive amounts of data and need the right competence and technology to ensure the data is usable by the time it reaches the data scientists and analysts.
Data governance (DG) regulates the availability, accessibility, integrity, and security of data in corporate systems. Data governance is based on internal data standards and regulations that also oversee data consumption. Data governance is a collection of processes, roles, policies, standards, and metrics that ensure the effective and efficient use of information.
The process of finding anomalies, patterns and correlations in large data sets, involving methods at the intersection of machine learning, statistics, and database systems, to predict outcomes
The simplest definition of data science is the extraction of actionable insights from raw data. Data science combines multiple fields, including statistics, computer science, scientific methods, artificial intelligence (AI), and data analysis, to extract value from data.
A person who applies the scientific method of observing, recording, analyzing and reporting results to understand information and use it to solve problems. A data scientist requires both technical and non-technical skills.
The technical skills include programming, with languages like Python, R, C/C++ and SQL, and mathematics, with extra focus on statistics, calculus, linear Algebra. The non-technical skills involve critical thinking and problem solving, strong business understanding, having communication and storytelling skills
The practice of building a narrative around data and its accompanying visualisations to help convey context and the meaning of data in a compelling fashion. In other words, use the data to paint the picture and effectively use data to tell your story.
Three key components to data storytelling
Refers to the tools, processes, and rules that define managing, analyzing, and acting upon business data. Data Strategy describes a “set of choices and decisions that together, chart a high-level course of action to achieve high-level goals” (Source: DAMA ). The data strategy is including business plans to use data for competitive advantage and support company strategic goals.
Refers to the graphical representation of data using visual elements such as charts, graphs, and tables. The intent is to enable decision-making with a suitable presentation of insights.
A central storage of data that can be used to analyze and make more informed decisions. It’s a system used to quickly analyze business trends using data from numerous sources and are designed to make it easy for somebody to answer critical statistical questions without being highly experienced in database management.
In decision analysis, a decision tree can be used to visually and explicitly represent decisions and decision making. Decision Trees are part of machine learning and allow you to visualise how decisions are made. The solution takes different paths depending on the decision, in other words, going down different branches of the three, and the answer to the decision is typically a yes or no answer.
A subset of machine learning, which is essentially a neural network with three or more layers, is based on artificial neural networks inspired by the structure of the human brain. It learns from extensive amount of data and is especially good at finding patterns from unstructured data such as text and images.
Event data is any data that you want to measure about an event. For example, when a customer clicks on a product link, the event data would track the product name, category, and other information about the click.
Event data can help organizations understand how people interact with their products and services, what works well, and where they need to make improvements.
Read more about Event Data in our post: Event Data: Everything You Need To Know About Event Data
The front end of a program or website is what you see and interact with in your interface. Front end is also referred to as “client-side”, it includes everything the user see directly: from text and colors to buttons, images, and navigation menus. Responsiveness and performance are two main objectives of the Front End.
The three main front end languages are:
The term fuzzy refers to things that are not clear or are vague. Fuzzy logic is a technique that tries to embody human-like thinkings into a control system. The system will be designed to give acceptable reasoning and not 100% accurate reasoning.
The system based on fuzzy logic will mimic human deductive thinking, that is, the process people use to assume conclusions from what they know.
Java is a powerful general-purpose programming language that can be used to create everything from desktop and mobile apps to big data processing and robot programming. Java is one of the most well-known and popular programming languages today. Reasons why Java programming has grown in popularity is due to its platform independence, object-oriented nature, and ease of use.
Read more about Java in our post Java: One of the most popular programming languages
JSON Data Format
Julia, an open-source programming language released in 2012 that was created to be as easy to use as languages such as R and Python while also as fast as C. Julia is a high-level and general-purpose language that can be used to write code that is fast to perform and easy to implement for scientific calculations
Download Julia and learn more about how you can use the Julia programming language at the official website
Machine data, also called machine-generated data, is the data that is automatically generated by the activities and operations of networked devices, such as computers, smartphones, embedded systems, sensors, and many other sources.
Machine data is considered to be the largest source of big data and also the most complex. Estimations suggest that most of the world’s data is generated from machines.
Machine learning (ML) is an application of Artificial Intelligence (AI) that allows systems to automatically learn and improve from experience without being directly programmed.
Machine learning can be broadly defined as the capability of a machine to imitate intelligent human behavior. What this means is that machine learning focuses on developing programs that can access data and use it to learn for themselves.
Some applications of machine learning are:
- Image recognition
- Speech recognition
- Self-driving cars
- Social media algorithms
- And many many more
If you want a comprehensive introduction to machine learning, check out our post: Introduction to Machine Learning for Beginners
Refers to the set of identifiers that gives context about business data such as location, customer, product, asset, etc. The core data is fundamental for running operations within a company. Otherwise, there would be no way to compare data between systems uniformly. Examples of some Master Data Management (MDM) systems include SAP, Salesforce and Oracle.
Master Data Management
Master data management (MDM) does exactly what the name implies – manages master data. The MDM system incorporates applications and technologies that converge, cleanses, and extends this master data and synchronize it with applications, business processes, and analytical tools.
Examples of some MDM systems include SAP, Salesforce and Oracle.
Metadata is data that provide information about other data. You can think of metadata as references to data. Metadata represents data about data.
Some examples of basic metadata are author, date created, date modified, and file size. Another example of metadata is everything written on a letter envelope to help the actual content, the letter, get delivered to its intended receiver.
The metaverse is, at the moment, a shorthand for virtual worlds. A metaverse is a network of 3D virtual worlds focused on social connection. The metaverse will use technologies like augmented reality, virtual reality, smart glasses, etc. to create and simulate these 3D spaces. The ambition with the metaverse is that users can interact with each other and engage with apps and services in a more appealing way.
Non-fungible token (NFT) is a non-interchangeable unit of data stored on a blockchain, a form of a digital record, that can be sold and traded. NFT data units may be connected with digital files such as photos, videos, and audio. The NFT works like a certificate of authenticity for an object, real or virtual.
The chain of ownership is both verified and documented publicly in the worldwide network as the digital file is stored on a blockchain network. Meaning that ownership changes are permanently stored in the file itself, and it’s said to be almost (in theory at least) to swap in a fake one.
NoSQL databases (“not only SQL”) are non-tabular databases and store data differently than relational tables. NoSQL databases have a simple and flexible structure and are schema-free (data can be stored without a previous structure).
Store types of NoSQL databases include column store, document store, key value store, graph store, object store, etc.
Source code that is made freely available for possible modification and redistribution, meaning that it’s design is publicly accessible. Open source projects, products, or initiatives embrace principles of open exchange, collaborative participation, transparency, and community-oriented development.
A variable where the value is very different from that expected considering the value of other variables in the dataset. These can be indicators of rare or unexpected events or unreliable data.
Overfitting is used when we talk about training our machine learning models. Overfitting refers to a model that follows the training data too well. This happens when a model learns the detail and so-called noise in the training data to the extent that it negatively impacts the model’s performance on new real-life data.
Why is overfitting a problem? Well, the noise or random occurrences (outliers etc.) in the training set will be learned as concepts by the model, and it will try to execute these concepts on the new data set.
A data driven approach that helps businesses transform the way they analyse, monitor, and optimise their processes.Key benefits of process mining include, to uncover the actual process, identify bottlenecks and inefficiencies, base decisions on data, and finding best-practices for processes and way of working
R in data science is used to handle, store and analyse data and can be used for data analysis and statistical modelling. R is used in statistical computing and graphics, is easy to learn, and is a free, open-source software. Want to learn more about R and it’s background? Check out the The R Project for Statistical Computing
SQL (Structured Query Language) is one of the most important data science programming languages, as it’s used for performing various operations on the data stored in the databases like updating records, deleting records, creating and modifying tables, views, etc.
SQL is a relational database. A relational database consists of multiple tables that relate to each other. Data Science is the comprehensive study of data, and to work with data, we need to extract it from the database.
Read more about SQL in our post: SQL The powerful language to extract data from databases
Structured data is highly-organized and follows the same format making it easily searchable and managed. Examples of structured data could be dates, customer ID:s, phone numbers, etc.
I like to think of structured data as data I can store and display in fixed rows, columns and relational databases that are easy to search.
Learn more about structured data in our post: Structured vs Unstructured Data: The Complete Guide
TensorFlow is an open source framework that has become a standard tool for Machine Learning. TensorFlow has an extensive ecosystem of tools, libraries, and community resources that lets data scientists quickly build and deploy machine learning applications.
The major benefit of using TensorFlow is abstraction – allowing the data scientist to focus on the overall logic of the application rather than going into too much detail.
Learn more about TensorFlow at the official website
Unstructured data has no predefined format or organization, making it more challenging to collect, process, and analyze. Unstructured data is information that either does not have a predetermined data model or is not set up and managed in a predefined manner.
Examples of unstructured data could be email messages, customer chats, video files, images, interviews, and so on.
Learn more about unstructured data in our post: Structured vs Unstructured Data: The Complete Guide
Web3 is characterised by internet services and mobile apps rebuilt on decentralised blockchain technology. The basic notion is that it will be decentralised, rather than controlled by governments and businesses like today’s internet