This data set contains MeSH tags of 7470 cases with 2696 normal cases. For the MeSH tags, the root words and the later words are of different significance. The root words set the tone or topic of the full sentence whereas the later words are qualifiers that describe the situation. The full data set can be seen at the end of the post.

The goal here is to create word embedding for the root words such that similar words are closer. Applying word2vec algorithm in gensim gave very poor result on this data set. We thus tried the Hellinger PCA method, which performs a dimension reduction on the co-occurrence matrix.

The co-occurrence matrix encodes the conditional probability of word co-occurrence. The text (corpus) is first used to form dictionary words and context words. The co-occurrence matrix has dictionary words as rows and context words as columns.

pre-processing

In this data set, there are 1679 unique MeSH sentences, 178 unique words and 118 unique root words.

There are 23 words that only appeared once, and 12 words that only appear twice in the data set, as seen in Table 1. For these 23 one-appearance words, 11 of them show up as one-word sentences. We thus remove them and arrive at a dictionary with 167 words, and the text thus contains 1668 sentences.

one appearance two appearances
osteoporosis aortic aneurysm
sarcoidosis tuberculosis
no indexing acute
technical quality of image unsatisfactory hyperostosis, diffuse idiopathic skeletal
cystic fibrosis cardiophrenic angle
bronchiolitis funnel chest
cholelithiasis contrast media
normal pneumoperitoneum
pectus carinatum bone and bones
hypertension, pulmonary hernia, diaphragmatic
pulmonary disease, chronic obstructive multilobar
hypovolemia mastectomy
pleural sinus  
hemopneumothorax  
bronchitis  
supracardiac  
esophagus  
fibrosis  
colonic interposition  
azygos lobe  
expansile bone lesions  
aortic valve  
hemothorax  

Table 1. Words that only appear once or twice in the data set. The words in bold font are stand-alone sentences and are removed from the dictionary.

construct co-occurrence matrix

The dictionary words are the ones to get word embedding and we choose the root words of the MeSH sentences to be dictionary words. There are 107 of them.

The context words are basically raw features. In this data set, there are 87 qualifier words with a total of 4413 apperances. The common way to pick the context words is to use the top 10% or 20% of the words that occur most frequently. I don’t think it works in this case. The top 30 qualifier words in this data set are shown in Table 2. They account for 3806 apperances.

qualifier word counts
lung 649
right 450
left 346
bilateral 272
mild 227
base 196
multiple 168
upper lobe 115
lower lobe 114
round 94
small 93
hilum 91
interstitial 87
prominent 82
patchy 70
apex 68
posterior 65
thoracic vertebrae 61
thorax 58
middle lobe 55
scattered 53
ribs 52
severe 52
chronic 51
focal 45
degenerative 44
streaky 41
pleura 40
enlarged 34
diffuse 33

Table 2. Most frequently occurring qualifier words.

Many of these qualifier words describe specific features, as shown in Table 3, or referring to an organ. If they were used to define the dictionary words, then closeness between dictionary words would merely indicate their similarity in the left-right symmetry, etc.

Even after PPMI reweighting, the word embedding still does not look good under t-SNE: most points are close together and their distances do not seem to make much sense.

feature context words
left-right left, right, bilateral
front-back anterior, posterior
up-down apex, base
distribution patchy, focal, streaky, diffuse, scattered
shape round, irregular
stage acute, chronic
size small, large
visibility prominent, obscured
severity mild, severe, moderate, borderline

Table 3. Context words that describe a specific feature.

summary

So far the embedding does not work. Maybe all the feature vectors should be removed first? Or maybe I should not use the root words alone as target words. Or is the data size too small?

references

appendix: the full data set

The full data set is shown as follows. The ‘root’ node is only for display purposes. All words in blue boxes except ‘root’ are the root words in MeSH sentences. The qualifiers are in orange boxes. A blank orange box means that the root word can be used alone.