Data Science Fundamental Problems
Data science is a multidisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured data. It encompasses several key areas, including data collection and storage, data cleaning and preprocessing, exploratory data analysis, statistical modeling, machine learning, and data visualization. These techniques are used to extract valuable insights, predict future trends, and drive data-driven decision-making.
There are fundamental problems that are essential to understanding the discipline of data science. One of these addresses the following question: Given two separable groups of data, what is the equation of the hyperplane that separates both groups and maximizes the sum of distances of the groups from the hyperplane?
Lets take these groups of data for an example:
For the hyperplane to:
- Separate both groups.
- Maximize the sum of distances from the points of the groups to the hyperplane.
To satisfy the first restriction, we need to create a line connecting the two groups and find the perpendicular line to that line.
To connect the groups, we can find the line that connects the centroid of each group.
To calculate the centroids, we just need to take the mean of the \(x\) and\(y\) components of each group:
For Group 1:
\[\text{mean}(x_{\text{g1}}) = \frac{\sum_{i=1}^{7} x_i}{7} = 3\]
\[\text{mean}(y_{\text{g1}}) = \frac{\sum_{i=1}^{7} y_i}{7} = 8\]
\[(x_1,y_1) = (3,8)\]
For Group 2:
\[\text{mean}(x_{\text{g2}}) = \frac{\sum_{i=1}^{4} x_i}{4} = 8.5\]
\[\text{mean}(y_{\text{g2}}) = \frac{\sum_{i=1}^{4} y_i}{4} = 3\]
\[(x_2,y_2) = (8.5,3)\]
With this information, we can find the equation of the line:
\(y = mx+ b\)
Where \(m = \frac{{y_2 - y_1}}{{x_2 - x_1}}\)
Substituting, we have:
\[m = \frac{{3 - 8}}{{8.5 - 3}} = \frac{-5}{5.5}\]
To find \(b\) we can use a point and substitute into the equation. Let’s take \((3,8)\) for example:
\[8 = \frac{-5}{5.5} \cdot 3 + b\]
\[b = 10.7272\]
So, the equation will be:
\[y = \frac{-5x}{5.5}+ 10.7272\]
Now, to find the perpendicular line to that line. Actually, we don’t need all the information from the perpendicular line; we just need the slope. We know that the slope of a perpendicular line is \(m_p = \frac{-1}{m_c}\), where \(m_c\) is the slope of the line connecting the centroids.This is because two lines are perpendicular if the product of their slopes is \(-1\).
Doing the math, we have:
\[m_c = \frac{-5}{5.5}\]
\[m_p = \frac{5.5}{5} = \frac{11}{10}\]
So, the equation of the hyperplane looks like this:
\[y = \frac{11x}{10}+ b\]
Here is where we need to satisfy our second constraint:
If we want to maximize the sum of distances of both groups to the hyperplane, then the sum of distances of each group must be equal.
The equation of the distance from a point to a hyperplane is:
\[\text{distance}(x_0, y_0) = \frac{|Ax_0 + By_0 + C|}{\sqrt{A^2 + B^2}}\]
Where \(Ax + By + C = 0\) is the equation of the line.
So the sum of distances of group 1 will be:
\[\sum_{i=1}^{7} \text{distance}(x_i, y_i) = \sum_{i=1}^{7} \frac{|Ax_i + By_i + C|}{\sqrt{A^2 + B^2}}\]
Similarly, for group 2:
\[\sum_{i=1}^{4} \text{distance}(x_i, y_i) = \sum_{i=1}^{4} \frac{|Ax_i + By_i + C|}{\sqrt{A^2 + B^2}}\]
The sum of distances of each group must be equal; therefore, we can equate these equations:
\[\sum_{i=1}^{7} \frac{|Ax_i + By_i + C|}{\sqrt{A^2 + B^2}} = \sum_{i=1}^{4} \frac{|Ax_i + By_i + C|}{\sqrt{A^2 + B^2}}\]
Simplifying, we can multiply \(\sqrt{A^2 + B^2}\) in on both sides and factorize \(A\) and \(B\):
\[\begin{aligned} &\left| A \cdot \begin{bmatrix} x_{1g1} \\ x_{2g1} \\ \vdots \\ x_{7g1} \end{bmatrix} + B \cdot \begin{bmatrix} y_{1g1} \\ y_{2g1} \\ \vdots \\ y_{7g1} \end{bmatrix} + 7C \right| = \left| A \cdot \begin{bmatrix} x_{1g2} \\ x_{2g2} \\ \vdots \\ x_{4g2} \end{bmatrix} + B \cdot \begin{bmatrix} y_{1g2} \\ y_{2g2} \\ \vdots \\ y_{4g2} \end{bmatrix} + 4C \right| \end{aligned}\]Substituting, we have:
\[|21A +56B +7C| = |34A + 12 B + 4C|\]
Let’s remember that we know that our equation looks like this:
\[y = \frac{11x}{10}+ b\]
We can transform this equation into the \(Ax + By + C = 0\) form:
\[\frac{-11}{10}x + y - b = 0\]
So we know that the values of \(A\) and \(B\) are \(- \frac{11}{10}\) and \(1\) respectively and \(C\) is equal to \(-b\).
To find \(C\) we can solve for:
\[|21A +56B +7C| = |34A + 12 B + 4C|\]
\[|-23.1+56 +7C| = |-37.4+ 12 + 4C|\]
\[|32.9+ 7C| = |-25.4 + 4C|\]
Dividing each side by \(|-25.4 + 4C|\)
\[\frac{|32.9+ 7C|}{|-25.4 + 4C|} = 1\]
For being an absolute value, we have two solutions:
\[\frac{32.9+ 7C}{-25.4 + 4C} = 1\]
or
\[\frac{32.9+ 7C}{-25.4 + 4C} = -1\]
Solving for the first case:
\[32.9+ 7C = -25.4 + 4C\]
\[58.3 = -3C\]
\[C = -19.433333\]
We can observe that this line does not separate the two groups.
Solving for the second case:
\[32.9+ 7C = 25.4 - 4C\]
\[7.5 = -11C\]
\[C = -0.68181818\]
This line separates the groups and maximizes the sum of distances from each group to the line.
Therefore, the optimal hyperplane has the following form:
\[-1.1x + y -0.68181818 = 0\]
And the maximum distance for both groups is 18.9204511.
We can generalize this, let’s try in another pair of groups of data:
With 100 points for each group.
With 50 and 35 points for Group 1 and Group 2.
With 120 and 85 points for Group 1 and Group 2.
Acknowledgements
I would like to express my gratitude to my colleagues Miguel Perez, Frank Ochoa and Leonardo Cumplido, for their invaluable contributions to this project. Their insights, discussions, and support were instrumental in reaching the solution presented in this blog. Thank you for your collaboration and dedication.