# What to do with nominal variables in a mixed data set?

… and I keep looking for a definitive answer … (so do not read on, if you expect one!)

I would like to perform PCA on a mixed dataset of nominal and continuous variables. Examples for such variables  are gender (m/f; nominal), living on a farm (y/n; nominal) and and organic contaminant concentration data (continuous). The nominal variables cannot be ordered.

According to Kolenikov & Angeles the problem is that

“discrete data tend to have high skewness and kurtosis, especially if the majority of the data points are concentrated in a single category”

[Kolenikov & Angeles, 2004, The Use of Discrete Data in PCA: Theory, Simulations, and Applications to Socioeconomic Indices].

In brief – if using my data as is, since I have lots of data in a single category, I am potentially introducing a bias. Furthermore, in a board post on methodspace.com it has been stated that

“the problem is that when the distributions are far from 50:50 for a dichotomous variable, the correlations are suppressed.”

This is certainly the case for my data. But also a first step towards a solution is offered (reiterating Kolenikov & Angeles’ suggestion)

” You can also go for estimating polychoric/ tetrachoric correlations. Here you assume that each binary (or ordered discrete) indicator is a manifestation of underlying continuous variables, and so you estimate the correlations between these underlying variables.”

On the stats.stackexchange.com board, different avenues are discussed:

“Although a PCA applied on binary data would yield results comparable to those obtained from a Multiple Correspondence Analysis (factor scores and eigenvalues are linearly related), there are more appropriate techniques to deal with mixed data types, namely Multiple Factor Analysis for mixed data available in the FactoMineR R package. […] The challenge with categorical variables is to find a suitable way to represent distances between variable categories and individuals in the factorial space. To overcome this problem, you can look for a non-linear transformation of each variable — whether it be nominal, ordinal, polynomial, or numerical — with optimal scaling.”

On the other hand, at least for some data sets, polychoric PCA does not significantly improve results, as Kolenikov & Angeles, 2004 admit:

“The gain from using computationally intensive polychoric correlations in getting the “correct” variable weights may not be very large compared to the PCA on ordinal data. However, only the polychoric analysis gives consistent estimates of the explained proportion. […] The misclassification rates, as well as Spearman correlation of the theoretical and empirical welfare indices, are not substantially different between the ordinal, group means and polychoric versions of PCA, although the difference is statistically significant due to huge sample size of the simulation results data set.”

Additionally, in my data set I have the a lot of data below the detection limit (<LOD, coded as zero) and some missing data. The former need to be recoded to better represent an “actual value” since it is not zero. Replacement could be e.g., 1/2 LOD as a simple and often used approach, but bias prone, or, better, Kaplan-Meier estimates (see D. R. Helsel (2005) More than obvious: Better methods for interpreting nondetect data, Environ Sci Technol 39, 419A–423A). A convenient side-effect is the removal of (non-finite) zero values as required for PCA.

As a conclusion I will be trying the following for now – always in a basic (b) and comprehensive approach for comparison (c):

• 1/2 LOD replacement for <LOD values (b)
• Kaplan-Meier estimates for <LOD values (c)
• Replacement of discrete variables: 0.5 & 1 instead of 0 & 1 to make values finite (b)
• Employ the polycor R package to treat discrete variables, hopefully giving me finite values (c)

Other avenues to explore…

• not to carry out PCA, but use the FactominR package for Multiple Factor Analysis
• is optimal scaling using the homals package (which does not work right now in the most recently released version of R. It aborts when loading required rgl package)

The next issue is that I do not know about is how to validate results, yet. I will certainly carry out a detailed analysis & comparison of the (scree, loadings & scores) plots resulting from the approaches described above, first.