Statistics and Probability for Data Science form the foundation of data analysis, enabling data scientists to extract meaningful insights from raw data. Statistics involves the collection, analysis, interpretation, and presentation of data, helping to summarize large datasets and draw conclusions. Key concepts in statistics include measures of central tendency (mean, median, mode), measures of variability (variance, standard deviation), and inferential statistics, which help in making predictions or generalizations about a population based on sample data.

Techniques like hypothesis testing, confidence intervals, and regression analysis are commonly used to evaluate data and make data-driven decisions. Probability, on the other hand, is the study of uncertainty and the likelihood of events occurring. It provides a mathematical framework for modeling uncertainty and making predictions. Concepts like conditional probability, Bayes’ theorem, and probability distributions (such as normal, binomial, and Poisson distributions) are essential for building predictive models in data science.

Understanding probability allows data scientists to assess risks, quantify uncertainties, and build more accurate models, particularly in areas like machine learning and AI. Together, statistics and probability are indispensable tools in data science, helping professionals analyze data, identify patterns, and make informed predictions and decisions that drive business outcomes.

What is Data?

Data refers to raw facts, figures, or information that can be collected, analyzed, and interpreted to generate meaningful insights. It can take various forms, such as numbers, text, images, audio, or even sensor readings. On its own, data may only have much value once it is processed, analyzed, or organized in a way that makes it useful for decision-making or problem-solving.

There are two main types of data:

  • Quantitative Data: This is numerical data that can be measured and quantified. Examples include age, salary, temperature, or sales figures. Quantitative data is typically used for statistical analysis, and it can be discrete (countable values) or continuous (measurable quantities).
  • Qualitative Data: Also known as categorical or descriptive data, this refers to non-numeric information. Examples include names, colors, or customer reviews. Qualitative data is typically used for categorization and analysis of trends or patterns in behavior or opinions.

In the context of data science, data is the foundation for performing analyses, building predictive models, and deriving insights that can guide business decisions, improve processes, and drive innovation. Data can be collected from various sources, such as surveys, experiments, transactions, sensors, or social media, and it is often stored in databases or data warehouses for easy access and analysis.

1. Quantitative Data (Numerical Data)

Quantitative data refers to data that is expressed in numerical terms, allowing for mathematical calculations and statistical analysis.

1. Discrete Data: This type of data consists of distinct, countable values. Discrete data represents whole numbers.


Example:

  • Number of students in a class (e.g., 25 students)
  • Number of cars in a parking lot (e.g., 50 cars)

2. Continuous Data: Continuous data can take any value within a range and is measurable with precision. It is typically represented in decimal or fractional form.


Example:

  • Height of a person (e.g., 5.9 feet, 6.1 feet)
  • Temperature (e.g., 22.5°C, 35.6°C)
  • Time taken to run a marathon (e.g., 3.2 hours)

2. Qualitative Data (Categorical Data)

Qualitative data is non-numeric and describes qualities or characteristics. This type of data can be used to classify and categorize information.

1. Nominal Data: Nominal data consists of categories with no inherent order or ranking. It is used for labeling or naming things.


Example:

  • Colors of cars (e.g., Red, Blue, Green)
  • Types of fruit (e.g., Apple, Banana, Orange)
  • Nationality (e.g., American, French, Japanese)

2. Ordinal Data: Ordinal data represents categories with a meaningful order or ranking, but the intervals between categories need to be more consistent.


Example:

  • Education level (e.g., High School, Bachelor’s, Master’s, PhD)
  • Customer satisfaction ratings (e.g., Poor, Fair, Good, Excellent)
  • Movie ratings (e.g., 1 star, two stars, three stars)

3. Binary Data

Binary data is a special case of categorical data where there are only two possible categories, typically represented by two states.

Example:

  • Whether a customer purchased a product or not (Yes/No)
  • A light is on or off (On/Off)
  • Gender (Male/Female)

4. Time Series Data

Time series data refers to a sequence of data points collected or recorded at regular time intervals. It is used to analyze trends over time.

Example:

  • Daily stock prices over a month (e.g., the closing price of a stock for each day of the month)
  • Hourly temperature readings (e.g., temperature recorded every hour throughout the day)
  • Monthly sales revenue (e.g., total sales for each month)

5. Textual Data

Textual data consists of words, sentences, or documents. It is often unstructured and requires techniques such as natural language processing (NLP) to analyze and extract insights.

Example:

  • Customer reviews (e.g., "Great product, would buy again!")
  • Social media posts (e.g., tweets, Facebook status updates)
  • Email contents (e.g., body of an email)
  • Articles or books (e.g., a news article or a research paper)

6. Spatial Data (Geospatial Data)

Spatial data is information that has a geographic or spatial component. It represents the location or shape of objects in space.

Example:

  • GPS coordinates (e.g., latitude 37.7749° N, longitude 122.4194° W)
  • Locations of stores on a map (e.g., store locations in a city)
  • Satellite images (e.g., images showing different landforms or urban areas)
  • Map data (e.g., roads, cities, and landmarks on a map)

7. Structured Data

Structured data refers to data that is organized in a predefined format, often in rows and columns, such as in relational databases. It is highly organized and easy to search and process.

Example:

  • A spreadsheet of sales data (e.g., columns for product name, price, quantity sold, and total revenue)
  • A customer database (e.g., columns for customer name, phone number, address, and purchase history)
  • Employee data in an HR management system (e.g., name, employee ID, department, and salary)

8. Unstructured Data

Unstructured data does not have a predefined format and is often more difficult to analyze. It includes text, images, video, audio, and other non-tabular formats.

Example:

  • Social media content (e.g., tweets, photos, or video posts)
  • Emails (e.g., the body of an email without any structure)
  • Audio recordings (e.g., podcasts or customer service calls)
  • Images and videos (e.g., pictures, surveillance footage)

9. Semi-Structured Data

Semi-structured data contains elements of both structured and unstructured data. It doesn’t conform to the rigid structure of a database but has some organization, such as tags or markers, to separate elements.

Example:

  • XML files (e.g., a file storing hierarchical data with tags like <name>, <address>, <phone>)
  • JSON files (e.g., a data format that represents data objects with key-value pairs)
  • NoSQL database records (e.g., data in MongoDB or CouchDB, where the format is flexible)

10. Relational Data

Relational data refers to data that is stored in a relational database, where data is organized into tables that are related to each other through keys (such as primary and foreign keys). This type of data is often used in applications requiring structured relationships between datasets.

Example:

A database for an e-commerce store that links customer details to their orders through an Order ID

  • Customers Table: customer_id, name, address
  • Orders Table: order_id, customer_id (foreign key), product_name, quantity
  • A university database that links students to the courses they are enrolled in.

Mean, Median, and Mode

In statistics, the mean, median, and mode are fundamental measures of central tendency. These measures help describe the center of a dataset, summarizing a large amount of data into a single value that represents the "average" or central point. Let’s take a closer look at each:

1. Mean (Average)

The mean is the most commonly used measure of central tendency. It is calculated by adding up all the values in a dataset and then dividing by the number of values.

Formula:

Mean=∑XNMean=NX

Where:

  • ∑X
  • X is the sum of all the values in the dataset.
  • N
  • N is the number of values in the dataset.

Example:

Suppose you have the following dataset: 2, 4, 6, 8, 10.

  • The sum of the values = 
  • 2+4+6+8+10=30
  • 2+4+6+8+10=30
  • Number of values = 5

Mean=305=6Mean=530​=6

So, the mean is 6.

When to use the Mean:

The mean is useful when the data is symmetrically distributed, and there are no extreme outliers, as outliers can significantly skew the mean.

2. Median

The median is the middle value in a dataset when the values are ordered from smallest to largest (or vice versa). If there is an odd number of values, the median is the value at the center. If there is an even number of values, the median is the average of the two middle values.

Steps to Calculate the Median:

  • Sort the data in increasing or decreasing order.
  • If the number of values is odd, the median is the middle value.
  • If the number of values is even, the median is the average of the two middle values.

Example 1 (Odd number of values):

Dataset: 1, 3, 5, 7, 9

  • Sorted data: 1, 3, 5, 7, 9
  • Middle value: 5

So, the median is 5.

Example 2 (Even number of values):

Dataset: 1, 3, 5, 7

  • Sorted data: 1, 3, 5, 7
  • The two middle values are 3 and 5.
  • Median = 
  • 3+52=4
  • 2
  • 3+5
  • =4

So, the median is 4.

When to use the Median:

The median is particularly useful when there are outliers, or the dataset is skewed. Unlike the mean, the median is not affected by extremely high or low values.

3. Mode

The mode is the value that appears most frequently in a dataset. A dataset can have one mode, more than one mode, or no mode at all.

  • Unimodal: One value appears most frequently.
  • Bimodal: Two values appear with the same highest frequency.
  • Multimodal: More than two values appear with the same highest frequency.
  • No Mode: All values appear with the same frequency.

Example 1 (Unimodal):

Dataset: 2, 3, 4, 4, 5

  • The number 4 appears twice, while all other numbers appear once.
  • So, the mode is 4.

Example 2 (Bimodal):

Dataset: 1, 2, 2, 3, 3, 4

  • Both 2 and 3 appear twice, while the rest appear once.
  • The dataset is bimodal with modes 2 and 3.

Example 3 (No Mode):

Dataset: 1, 2, 3, 4, 5

  • All values appear exactly once.
  • So, there is no mode.

When to use the Mode:

The mode is useful when you want to know the most frequent or popular item in a dataset. It’s often used for categorical data where the values represent categories, like voting preferences, product choices, or survey responses.

Variance and Standard Deviation

Variance and Standard Deviation are both measures of dispersion or spread in a dataset. They provide insights into how much the values in a dataset deviate from the mean. While they are closely related, they are presented in different forms, and each has its use case.

1. Variance

Variance is a statistical measure that describes the average squared deviation of each data point from the mean of the dataset. It gives an idea of how spread out the data is. A higher variance indicates that the data points are more spread out, while a lower variance indicates that the data points are closer to the mean.

The formula for Variance:

Population Variance (for the entire population):

  • 1. σ2=1N∑i=1N(Xi−μ)2
  • 2. σ2=N1​i=1∑N​(Xi​−μ)2

Where:

  • σ2
  • σ
  • 2
  •  = population variance
  • N
  • N = number of data points in the population
  • Xi
  • X
  • i
  •  = each data point
  • μ
  • μ = mean of the population
  • Sample Variance (for sample data):

s2=1n−1∑i=1n(Xi−Xˉ)2

s2=n−11​i=1∑n​(Xi​−Xˉ)2

Where:

  • s2
  • s
  • 2
  •  = sample variance
  • n
  • n = number of data points in the sample
  • X
  • ˉ
  •  = sample mean

Example of Variance:

Consider the dataset: 2, 4, 6, 8, 10

  1. Find the mean:
  • Mean=2+4+6+8+105=6
  • Mean=52+4+6+8+10​=6

  1. Subtract the mean from each value:
  • (2−6)=−4
  • (2−6)=−4
  • (4−6)=−2
  • (4−6)=−2
  • (6−6)=0
  • (6−6)=0
  • (8−6)=2
  • (8−6)=2
  • (10−6)=4
  • (10−6)=4

  1. Square each difference:
  • (−4)2=16
  • (−4)
  • 2
  • =16
  • (−2)2=4
  • (−2)
  • 2
  • =4
  • (0)2=0
  • (0)
  • 2
  • =0
  • (2)2=4
  • (2)
  • 2
  • =4
  • (4)2=16
  • (4)
  • 2
  • =16

  1. Find the average of the squared differences:
  • Variance=16+4+0+4+165=405=8
  • Variance=516+4+0+4+16​=540​=8

So, the variance is 8 (for the population variance).

When to Use Variance:

Variance is particularly useful in assessing the overall spread of a dataset. However, because the units of variance are the square of the units of the original data (for example, if the data is in meters, the variance is in square meters), it can be difficult to interpret directly in real-world terms.

2. Standard Deviation

The standard deviation is simply the square root of the variance. It measures the average amount by which each data point in a set differs from the mean. Since the standard deviation is in the same units as the original data, it is often more interpretable than variance.

The formula for Standard Deviation:

Population Standard Deviation:

  • σ=1N∑i=1N(Xi−μ)2σ=N1​i=1∑N​(Xi​−μ)2​

Sample Standard Deviation:

  • s=1n−1∑i=1n(Xi−Xˉ)2s=n−11​i=1∑n​(Xi​−Xˉ)2​

Example of Standard Deviation (using the same dataset: 2, 4, 6, 8, 10):

From the previous calculation, the variance was 8. Now, to find the standard deviation:

Standard Deviation=8≈2.83

Standard Deviation=

8

≈2.83

So, the standard deviation of this dataset is approximately 2.83.

Population Data V/s Sample Data 

Here’s a table comparing Population Data and Sample Data:

AspectPopulation DataSample Data
DefinitionRefers to the entire set of data or individuals that are of interest in a study.A subset or portion of the population data is used to represent the entire population.
SizeIncludes every individual or data point in the population.Includes only a small number of individuals or data points from the population.
PurposeRepresents the complete dataset of interest for a specific study or analysis.Represents the population when it is impractical or impossible to collect data from the entire population.
Symbol for Meanμ (Greek letter "mu")Xˉ (X bar, the sample mean)
Symbol for Varianceσ2 (Greek letter "sigma squared")s2 (sample variance)
Formula for VarianceN1∑(Xi−μ)2n−11∑(Xi−Xˉ)2
Formula for Standard Deviationσ=N1∑(Xi−μ)2s=n−11∑(Xi−Xˉ)2
ExampleThe total population of a country (e.g., all citizens of a country).A survey was conducted on 500 people out of 10,000 citizens in a city.
Data CollectionOften difficult or costly to collect data from every individual.Easier and cheaper to collect data from a sample.
AccuracyProvides the most accurate and complete information.Less accurate, as it only represents a part of the population, subject to sampling error.
InferenceNo need for statistical inference—the data is complete.Requires inference methods (e.g., confidence intervals, hypothesis testing) to make generalizations about the population.
Use CasesUsed when studying a small, easily accessible population or when complete data is available.Used when studying large populations where data collection from the entire group is impractical.
Error TypeNo sampling error since all members are included.Potential sampling error due to the sampling process.

What is Probability?

Probability is a branch of mathematics that deals with the likelihood or chance of an event occurring. It provides a quantitative description of the likelihood of various outcomes in uncertain situations. Probability is used to model random events and is foundational to fields like statistics, finance, science, and many areas of decision-making.

Key Concepts in Probability

1. Event:

  • An event is a specific outcome or a set of outcomes of a random experiment.
  • Example: Rolling a die and getting a "3" is an event.

2. Sample Space (S):

  • The sample space is the set of all possible outcomes of an experiment.
  • Example: When rolling a fair six-sided die, the sample space is 
  • S={1,2,3,4,5,6}
  • S={1,2,3,4,5,6}.

3. Outcome:

  • An outcome is a single result from an experiment.
  • Example: A single roll of a die, where the outcome could be any of the numbers 1 through 6.

4. Probability of an Event (P):

  • The probability of an event is a measure of the likelihood that the event will occur, ranging from 0 (impossible event) to 1 (certain event).
  • Mathematically, the probability is calculated as:
  • P(A)=Number of favorable outcomes, number of possible outcomes
  • P(A)=
  • Total number of possible outcomes
  • Number of favorable outcomes
  • Example: The probability of rolling a three on a fair six-sided die is:
  • P(3)=16
  • P(3)=
  • 6
  • 1
  • (since there is one "3" and six possible outcomes).

5. Complementary Events:

  • The complement of an event 
  • A
  • A is the event that 
  • A
  • A does not occur. The probability of the complement is 
  • 1−P(A)
  • 1−P(A).
  • Example: The complement of "rolling a 3" on a die is "not rolling a 3". If 
  • P(3)=16
  • P(3)=
  • 6
  • 1
  • , then 
  • P(not 3)=1−16=56
  • P(not 3)=1−
  • 6
  • 1
  • =
  • 6
  • 5

6. Independent Events:

  • Two events are independent if the occurrence of one does not affect the probability of the other. For independent events, the probability of both events occurring is the product of their probabilities.
  • Example: Flipping a coin and rolling a die. The outcome of the coin flip does not affect the die roll.
  • P(heads and 4)=P(heads)×P(4)=12×16=112
  • P(heads and 4)=P(heads)×P(4)=
  • 2
  • 1
  • ×
  • 6
  • 1​
  • =
  • 12
  • 1

7. Dependent Events:

  • Two events are dependent if the occurrence of one event affects the probability of the other event.
  • Example: Drawing cards from a deck without replacement. If you draw a red card first, the probability of drawing another red card changes.

Types of Probability

1. Classical Probability:

  • Based on equally likely outcomes. Used when all outcomes of an experiment are equally likely to occur.
  • Formula:
  • P(A)=Number of favorable outcomes for a total number of possible outcomes
  • P(A)=
  • Total number of possible outcomes
  • Number of favorable outcomes for A
  • Example: Tossing a fair coin. The probability of getting a heads-up is 
  • P(heads)=12
  • P(heads)=
  • 2
  • 1
  • .

2. Empirical (Experimental) Probability:

  • Based on observed data or experiments. The probability is determined by performing an experiment and counting the relative frequency of the event.
  • Formula:
  • P(A)=Number of times event A occurs, number of trials
  • P(A)=
  • Total number of trials
  • Number of times event A occurs
  • Example: If you roll a die 100 times and get a three on 15 rolls, the empirical probability of rolling a 3 is 
  • P(3)=15100=0.15
  • P(3)=
  • 100
  • 15
  • =0.15.

3. Subjective Probability:

  • Based on personal judgment or experience rather than mathematical calculation or experimentation
  • Example: A weather forecaster might estimate a 70% chance of rain based on weather patterns, though this is a subjective assessment.

Conditional Probability

Conditional probability is the probability of an event occurring, given that another event has already occurred. It answers the question: What is the probability of an event. 

A

A happening given that event 

B

B has occurred?

The notation for conditional probability is 

P(A∣B)

P(AB), which reads as "the probability of 

A

A given 

B

B". This concept is essential in various fields like statistics, machine learning, and decision-making, as it allows us to refine predictions based on new information.

Mathematical Definition

The conditional probability of an event 

A

An occurrence is given that the event 

B

B has already occurred, is given by the formula:

P(A∣B)=P(A∩B)P(B)

P(AB)=

P(B)

P(AB)

Where:

  • P(A∣B)
  • P(AB) is the probability of 
  • A
  • A happening given 
  • B
  • B.
  • P(A∩B)
  • P(AB) is the probability that both 
  • A
  • A and 
  • B
  • B occurs (the intersection of events 
  • A
  • A and 
  • B
  • B).
  • P(B)
  • P(B) is the probability of an event 
  • B
  • B occurring.

Note: This formula is valid only if 

P(B)>0

P(B)>0 because the denominator cannot be zero.

Understanding Conditional Probability

To understand conditional probability more intuitively, let's consider an example:

Example: Drawing Cards from a Deck

Suppose you have a standard deck of 52 playing cards, and you want to find the probability of drawing a King given that you have already drawn a Heart.

  • Define the events:
    1. Let event 
    2. A
    3. A be "drawing a King."
    4. Let event 
    5. B
    6. B be "drawing a Heart."

  • Find 
  • P(A∩B)
  • P(AB):


The intersection of events 

  • A
  • A and 
  • B
  • B is the probability of drawing the King of Hearts, which is one specific card in the deck. Therefore:
  • P(A∩B)=152
  • P(AB)=
  • 52
  • 1
  • Find 
  • P(B)
  • P(B):


The probability of drawing any Heart (13 hearts in total) is:

  • P(B)=1352=14
  • P(B)=
  • 52
  • 13
  • =
  • 4
  • 1
  • Apply the formula:
  • P(A∣B)=P(A∩B)P(B)=1521352=113
  • P(AB)=
  • P(B)
  • P(AB)
  • =
  • 52
  • 13
  • 52
  • 1
  • =
  • 1
  • 1

So, the probability of drawing a King, given that the card is a Heart, is 

113

13

1

.

Bayes' Theorem

Bayes' Theorem is a fundamental concept in probability theory that describes how to update the probability of a hypothesis (or event) based on new evidence. It provides a way to calculate conditional probabilities by relating prior knowledge with the likelihood of observed data.

Mathematically, Bayes’ Theorem is expressed as:

P(A∣B)=P(B∣A)⋅P(A)P(B)

P(AB)=

P(B)

P(BA)⋅P(A)

Where:

  • P(A∣B)
  • P(AB) is the posterior probability or the probability of an event 
  • A
  • A happening given event 
  • B
  • B has occurred.
  • P(B∣A)
  • P(BA) is the likelihood or the probability of observing an event 
  • B
  • B given that event 
  • A
  • A is true.
  • P(A)
  • P(A) is the prior probability or the initial belief or probability of an event 
  • A
  • A before observing the evidence.
  • P(B)
  • P(B) is the marginal likelihood or the total probability of an event
  • B
  • B occurring.

Conclusion

Statistics and probability are essential in data science, enabling professionals to analyze and interpret data effectively. Statistics provides tools for summarizing data, making inferences, and testing hypotheses, while probability models uncertainty and helps in predicting outcomes. Together, they empower data scientists to make data-driven decisions and solve complex problems.

FAQ's

👇 Instructions

Copy and paste below code to page Head section

Statistics helps data scientists analyze, summarize, and draw insights from data. It provides methods for making inferences about populations based on sample data, testing hypotheses, and evaluating model performance.

Probability models uncertainty and randomness, helping data scientists predict outcomes, estimate risks, and build predictive models. It’s crucial in machine learning, decision-making, and understanding data distributions.

Descriptive statistics summarize and describe data (e.g., mean, variance), while inferential statistics use sample data to make predictions or generalizations about a population (e.g., hypothesis testing, confidence intervals).

Bayes' Theorem is a method for updating the probability of a hypothesis based on new evidence. It is widely used in machine learning, particularly in classifiers like Naive Bayes, and for probabilistic decision-making.

Probability helps in building models that can predict outcomes based on uncertain or incomplete data. It’s essential for techniques like regression, classification, and anomaly detection, allowing models to handle uncertainty and make informed predictions.

Hypothesis testing involves evaluating two competing hypotheses using sample data to determine if there is enough evidence to support a specific claim about a population. It helps make decisions or inferences from data.

Ready to Master the Skills that Drive Your Career?
Avail your free 1:1 mentorship session.
Thank you! A career counselor will be in touch with you shortly.
Oops! Something went wrong while submitting the form.
Join Our Community and Get Benefits of
💥  Course offers
😎  Newsletters
⚡  Updates and future events
undefined
undefined
Ready to Master the Skills that Drive Your Career?
Avail your free 1:1 mentorship session.
Thank you! A career counselor will be in touch with
you shortly.
Oops! Something went wrong while submitting the form.
Get a 1:1 Mentorship call with our Career Advisor
Book free session
a purple circle with a white arrow pointing to the left
Request Callback
undefined
a phone icon with the letter c on it
We recieved your Response
Will we mail you in few days for more details
undefined
Oops! Something went wrong while submitting the form.
undefined
a green and white icon of a phone