Data Understanding
Data Science Methodology
Essentially, the data understanding section of the data science methodology answers the question:
Is the data that you collected representative of the problem to be solved?
. In order to understand the data related to congestive heart failure admissions,
descriptive statistics needed to be run against the data columns that would become variables in the model.
1. First, these statistics included hurst, univariates, and statistics on each variable, such as mean, median, minimum, maximum, and standard deviation.
2. Second, pairwise correlations were used, to see how closely certain variables were related, and which ones, if any, were very highly correlated, meaning that they would be essentially redundant, thus making only one relevant for modeling.
3. Third, histograms of the variables were examined to understand their distributions.
Histograms are a good way to understand how values or a variable are distributed, and which sorts of data preparation may be needed to make the variable more useful in a model.
For example, for a categorical variable that has too many distinct values to be informative in a model, the histogram would help them decide how to consolidate those values.
The univariates, statistics, and histograms are also used to assess data quality. From the information provided, certain values can be re-coded or perhaps even dropped if necessary, such as when a certain variable has missing values.
The question then becomes, does "missing" mean anything?
Sometimes a missing value might mean "no", or "0" (zero), or at other times it simply means "we don't know" or, if a variable contains invalid or misleading values, such as a numeric variable called "age" that contains 0 to 100 and also 999, where that "triple-9" actually means "missing",
but would be treated as a valid value unless we corrected it.
Initially, the meaning of congestive heart failure admission was decided on the basis of a primary diagnosis of congestive heart failure.
But working through the data understanding stage revealed that the initial definition was not capturing all of the congestive heart failure admissions that were expected, based on clinical experience.
This meant looping back to the data collection stage and adding secondary and tertiary diagnoses, and building a more comprehensive definition of congestive heart failure admission.
This is just one example of the interactive processes in the methodology.
The more one works with the problem and the data, the more one learns and therefore the more refinement that can be done within the model, ultimately leading to a better solution to the problem.
This ends the Data Understanding section of this course.
​
Okiliong
What are some potential challenges or issues that can arise during the data understanding stage?
During the data understanding stage, there are several potential challenges or issues that can arise. Some of them include:
1.Insufficient or incomplete data: The data collected may not be sufficient or may have missing values, which can affect the accuracy and reliability of the analysis.
2.Data quality issues: The data may contain errors, outliers, or inconsistencies that need to be addressed before proceeding with the analysis.
3.Data bias: The data may be biased towards certain groups or may not be representative of the population, leading to biased results and conclusions.
4.Data privacy and security: Handling sensitive or confidential data requires ensuring proper security measures are in place to protect the data and comply with privacy regulations.
5.Data compatibility: Data from different sources may have different formats, structures, or units, making it challenging to integrate and analyze them together.
6.Data interpretation: Understanding the meaning and context of the data can be complex, especially when dealing with complex variables or domain-specific terminology.
7.Data volume and scalability: Large datasets can pose challenges in terms of storage, processing power, and computational resources required for analysis.
8.Stakeholder involvement: Involving stakeholders and subject matter experts in the data understanding stage can be challenging, especially if there are communication gaps or conflicting priorities.
It's important to address these challenges and issues during the data understanding stage to ensure the accuracy and reliability of the subsequent analysis and modeling steps.
Data Understanding
Data Science Methodology
Essentially, the data understanding section of the data science methodology answers the question:
Is the data that you collected representative of the problem to be solved?
. In order to understand the data related to congestive heart failure admissions,
descriptive statistics needed to be run against the data columns that would become variables in the model.
1. First, these statistics included hurst, univariates, and statistics on each variable, such as mean, median, minimum, maximum, and standard deviation.
2. Second, pairwise correlations were used, to see how closely certain variables were related, and which ones, if any, were very highly correlated, meaning that they would be essentially redundant, thus making only one relevant for modeling.
3. Third, histograms of the variables were examined to understand their distributions.
Histograms are a good way to understand how values or a variable are distributed, and which sorts of data preparation may be needed to make the variable more useful in a model.
For example, for a categorical variable that has too many distinct values to be informative in a model, the histogram would help them decide how to consolidate those values.
The univariates, statistics, and histograms are also used to assess data quality. From the information provided, certain values can be re-coded or perhaps even dropped if necessary, such as when a certain variable has missing values.
The question then becomes, does "missing" mean anything?
Sometimes a missing value might mean "no", or "0" (zero), or at other times it simply means "we don't know" or, if a variable contains invalid or misleading values, such as a numeric variable called "age" that contains 0 to 100 and also 999, where that "triple-9" actually means "missing",
but would be treated as a valid value unless we corrected it.
Initially, the meaning of congestive heart failure admission was decided on the basis of a primary diagnosis of congestive heart failure.
But working through the data understanding stage revealed that the initial definition was not capturing all of the congestive heart failure admissions that were expected, based on clinical experience.
This meant looping back to the data collection stage and adding secondary and tertiary diagnoses, and building a more comprehensive definition of congestive heart failure admission.
This is just one example of the interactive processes in the methodology.
The more one works with the problem and the data, the more one learns and therefore the more refinement that can be done within the model, ultimately leading to a better solution to the problem.
This ends the Data Understanding section of this course.
​
Okiliong
What are some potential challenges or issues that can arise during the data understanding stage?
During the data understanding stage, there are several potential challenges or issues that can arise. Some of them include:
1.Insufficient or incomplete data: The data collected may not be sufficient or may have missing values, which can affect the accuracy and reliability of the analysis.
2.Data quality issues: The data may contain errors, outliers, or inconsistencies that need to be addressed before proceeding with the analysis.
3.Data bias: The data may be biased towards certain groups or may not be representative of the population, leading to biased results and conclusions.
4.Data privacy and security: Handling sensitive or confidential data requires ensuring proper security measures are in place to protect the data and comply with privacy regulations.
5.Data compatibility: Data from different sources may have different formats, structures, or units, making it challenging to integrate and analyze them together.
6.Data interpretation: Understanding the meaning and context of the data can be complex, especially when dealing with complex variables or domain-specific terminology.
7.Data volume and scalability: Large datasets can pose challenges in terms of storage, processing power, and computational resources required for analysis.
8.Stakeholder involvement: Involving stakeholders and subject matter experts in the data understanding stage can be challenging, especially if there are communication gaps or conflicting priorities.
It's important to address these challenges and issues during the data understanding stage to ensure the accuracy and reliability of the subsequent analysis and modeling steps.