Source | Persagen.com |
Author | Dr. Victoria A. Stuart, Ph.D. |
Created | 2019-11-28 |
Last modified | |
Summary | Evaluation of biomedical contextual language models, visualization |
I have been evaluating some contextual language models for biomedical natural language processing (BioNLP). Several platforms support these models, including
Several of the “general-use” packages mentioned above provide opportunities for visualizing natural language tags and embeddings. For example, spaCy visualizers [note this one] allow the visualization of color-coded entities – as I describe here (I also describe those CRAFT corpus entity labels here.)
Likewise, I was intrigued by this example, Visualizing spaCy vectors in TensorBoard, on the spaCy examples page. It’s apparently possible to view those embeddings (tensors) in the TensorFlow Embedding Projector [example]!
I was looking at Flair embeddings at the time (2019-11-27; awaiting the anticipated release of a BioFlair pretrained model), so I thought I’d try to demo the viewing of those embeddings in TensorFlow’s Projector.
Note: there is currently (late Nov 2019) a bug in Torch / PyTorch (used by Flair) that prevents the installation of that code in Python 3.8. Consequently, as my Arch Linux environment is Py3.8 and there currently is no Py3.8 install wheel for Torch/PyTorch [screenshot; Py3.8 code expected ~mid-Dec 2019], I cannot currently install Flair in that environment.
My solution was to create a Py3.7 venv (which I describe here on StackOverflow), then do the various installations.
Having installed Flair, Torch / PyTorch, TensorFlow, etc. in that Py3.7 venv, I proceeded to figure out how to load the Flair embeddings in TF Projector. The following code provides a step-by-step explanation.
[Click here to read the following code as a single (monochromatic, plain text) file in the browser. ]
# Install Python 3.7 in Python 3.8 env: # https://stackoverflow.com/a/58964629/1904943 # Test (in terminal): [victoria@victoria ~]$ date Wed 20 Nov 2019 04:25:38 PM PST [victoria@victoria ~]$ p37 ## ~/.bashrc alias [Python 3.7 venv (source ~/venv/py3.7/bin/activate)] (py3.7) [victoria@victoria ~]$ env | grep -i virtual VIRTUAL_ENV=/home/victoria/venv/py3.7
(py3.7) [victoria@victoria ~]$ python --version Python 3.7.4 (py3.7) [victoria@victoria ~]$ pip install --upgrade pip ... Successfully installed pip-19.3.1 ## https://github.com/lanpa/tensorboardX ## Also installs (if I recall) tensorflow, other dependencies: (py3.7) [victoria@victoria ~]$ pip install tensorboardX ## << note: capital X ... ## If needed: pip install moviepy (py3.7) [victoria@victoria ~]$ pip install flair ... Successfully installed Cython-0.29.14 SudachiPy-0.4.0 attrs-19.3.0 backcall-0.1.0 boto-2.49.0 boto3-1.10.23 botocore-1.13.23 bpemb-0.3.0 certifi-2019.9.11 cffi-1.13.2 chardet-3.0.4 click-7.0 cloudpickle-1.2.2 cycler-0.10.0 dartsclone-0.6 decorator-4.4.1 deprecated-1.2.7 docutils-0.15.2 flair-0.4.4 future-0.18.2 gensim-3.8.1 hyperopt-0.2.2 idna-2.8 importlib-metadata-0.23 ipython-7.6.1 ipython-genutils-0.2.0 jedi-0.15.1 jmespath-0.9.4 joblib-0.14.0 kiwisolver-1.1.0 kytea-0.1.4 langdetect-1.0.7 matplotlib-3.1.1 more-itertools-7.2.0 mpld3-0.3 natto-py-0.9.0 networkx-2.2 numpy-1.17.4 packaging-19.2 parso-0.5.1 pexpect-4.7.0 pickleshare-0.7.5 pillow-6.2.1 pluggy-0.13.0 prompt-toolkit-2.0.10 ptyprocess-0.6.0 py-1.8.0 pycparser-2.19 pygments-2.4.2 pymongo-3.9.0 pyparsing-2.4.5 pytest-5.3.0 python-dateutil-2.8.1 regex-2019.11.1 requests-2.22.0 s3transfer-0.2.1 sacremoses-0.0.35 scikit-learn-0.21.3 scipy-1.3.2 segtok-1.5.7 sentencepiece-0.1.83 six-1.13.0 sklearn-0.0 smart-open-1.9.0 sortedcontainers-2.1.0 sqlitedict-1.6.0 tabulate-0.8.6 tiny-tokenizer-3.0.1 torch-1.3.1 torchvision-0.4.2 tqdm-4.38.0 traitlets-4.3.3 transformers-2.1.1 urllib3-1.24.3 wcwidth-0.1.7 wrapt-1.11.2 zipp-0.6.0 (py3.7) [victoria@victoria ~]$ python Python 3.7.4 (default, Nov 20 2019, 11:36:53) [GCC 9.2.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import flair ## works, yea!! :-D >>>
[victoria@victoria tensorflow]$ cd /mnt/Vancouver/apps/tensorflow/
[victoria@victoria tensorflow]$ date; pwd; echo; ls -l
Thu 28 Nov 2019 10:50:19 AM PST
/mnt/Vancouver/apps/tensorflow
total 928
-rw------- 1 victoria victoria 19305 Nov 28 10:49 _readme-tensorflow-victoria.txt
drwxr-xr-x 11 victoria victoria 4096 Nov 26 16:45 runs
[victoria@victoria tensorflow]$ tensorboard --logdir runs/
Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all
TensorBoard 2.0.0 at http://localhost:6006/ (Press CTRL+C to quit)
...
from flair.embeddings import FlairEmbeddings, Sentence
from flair.models import SequenceTagger
from flair.embeddings import StackedEmbeddings
sentence = Sentence('The RAS-MAPK signalling cascade serves as a central node in transducing signals from membrane receptors to the nucleus.')
tagger = SequenceTagger.load('ner')
tagger.predict(sentence)
embeddings_f = FlairEmbeddings('pubmed-forward')
embeddings_b = FlairEmbeddings('pubmed-backward')
stacked_embeddings = StackedEmbeddings([
embeddings_f,
embeddings_b,
])
stacked_embeddings.embed(sentence)
tokens = [str(token).split()[2] for token in sentence]
print(tokens)
'''
['The', 'RAS-MAPK', 'signalling', 'cascade', 'serves', 'as', 'a', 'central', 'node', 'in', 'transducing', 'signals', 'from', 'membrane', 'receptors', 'to', 'the', 'nucleus.']
'''
for token in sentence:
print(token)
print(token.embedding)
print(token.embedding.shape)
'''
Token: 1 The
tensor([ 0.0077, -0.0227, -0.0004, ..., 0.1377, -0.0003, 0.0028])
torch.Size([2300])
Token: 2 RAS-MAPK
tensor([-0.0007, -0.1601, -0.0274, ..., 0.1982, 0.0013, 0.0042])
torch.Size([2300])
Token: 3 signalling
tensor([ 4.2534e-03, -3.1018e-01, -3.9660e-01, ..., 5.9336e-02,
-9.4445e-05, 1.0025e-02])
torch.Size([2300])
Token: 4 cascade
tensor([ 0.0026, -0.0087, -0.1398, ..., -0.0037, 0.0012, 0.0274])
torch.Size([2300])
Token: 5 serves
tensor([-0.0005, -0.0164, -0.0233, ..., -0.0013, 0.0039, 0.0004])
torch.Size([2300])
Token: 6 as
tensor([ 3.8261e-03, -7.6409e-02, -1.8632e-02, ..., -2.8906e-03,
-4.4556e-04, 5.6909e-05])
torch.Size([2300])
Token: 7 a
tensor([ 0.0035, -0.0207, 0.1700, ..., -0.0193, 0.0017, 0.0006])
torch.Size([2300])
Token: 8 central
tensor([ 0.0159, -0.4097, -0.0489, ..., 0.0743, 0.0005, 0.0012])
torch.Size([2300])
Token: 9 node
tensor([ 9.7725e-03, -3.3817e-01, -2.2848e-02, ..., -6.6284e-02,
2.3646e-04, 1.0505e-02])
torch.Size([2300])
Token: 10 in
tensor([ 0.0219, -0.0677, -0.0154, ..., 0.0102, 0.0066, 0.0016])
torch.Size([2300])
Token: 11 transducing
tensor([ 0.0092, -0.0431, -0.0450, ..., 0.0060, 0.0002, 0.0005])
torch.Size([2300])
Token: 12 signals
tensor([ 0.0047, -0.2732, -0.0408, ..., 0.0136, 0.0005, 0.0072])
torch.Size([2300])
Token: 13 from
tensor([ 0.0072, -0.0173, -0.0149, ..., -0.0013, -0.0004, 0.0056])
torch.Size([2300])
Token: 14 membrane
tensor([ 0.0086, -0.1151, -0.0629, ..., 0.0043, 0.0050, 0.0016])
torch.Size([2300])
Token: 15 receptors
tensor([ 7.6452e-03, -2.3825e-01, -1.5683e-02, ..., -5.4974e-04,
-1.4646e-04, 6.6120e-03])
torch.Size([2300])
Token: 16 to
tensor([ 0.0038, -0.0354, -0.1337, ..., 0.0060, -0.0004, 0.0102])
torch.Size([2300])
Token: 17 the
tensor([ 0.0186, -0.0151, -0.0641, ..., 0.0188, 0.0391, 0.0069])
torch.Size([2300])
Token: 18 nucleus.
tensor([ 0.0003, -0.0461, 0.0043, ..., -0.0126, -0.0004, 0.0142])
torch.Size([2300])
'''
## The embeddings above are PyTorch tensors (Flair depends on Torch/PyTorch).
## https://stackoverflow.com/questions/53903373/convert-pytorch-tensor-to-python-list
## https://pytorch.org/docs/stable/tensors.html#torch.Tensor.tolist
## https://stackoverflow.com/questions/29895602/how-to-save-output-from-python-like-tsv
## https://stackoverflow.com/a/29896136/1904943
In an earlier iteration of this effort I saved the Flair tokens as metadata, and the embeddings (tensors) as a list. While those files are not needed here, I leave this code for future reference.
import csv
metadata_f = 'metadata.tsv'
tensors_f = 'tensors.tsv'
with open(metadata_f, 'w', encoding='utf8', newline='') as tsv_file:
tsv_writer = csv.writer(tsv_file, delimiter='\t', lineterminator='\n')
for token in tokens:
## Assign to a dummy variable ( _ ) to suppress character counts;
## Using (token), rather than ([token]), prints spaces between all characters:
_ = tsv_writer.writerow([token])
'''
[victoria@victoria tensorflow]$ cat metadata.tsv :
The
RAS-MAPK
signalling
cascade
serves
as
a
central
node
in
transducing
signals
from
membrane
receptors
to
the
nucleus.
'''
import torch ## needed for tolist()
with open(tensors_f, 'w', encoding='utf8', newline='') as tsv_file:
tsv_writer = csv.writer(tsv_file, delimiter='\t', lineterminator='\n')
for token in sentence:
embedding = token.embedding
## https://stackoverflow.com/questions/12770213/writerow-csv-returns-a-number-instead-of-writing-rows
## assign to a dummy variable ( _ ) to suppress character counts
## tolist() is a PyTorch method that converts tensors to lists:
_ = tsv_writer.writerow(embedding.tolist())
## CAUTION: even for the single, short sentence used in this example, the
## following `cat` statement generates an ENORMOUS list!
'''
[victoria@victoria tensorflow]$ cat tensors.tsv
0.007691788021475077 -0.02268664352595806 -0.0004340760060586035 ...
'''
## https://stackoverflow.com/questions/40849116/how-to-use-tensorboard-embedding-projector/41177133 ## https://stackoverflow.com/a/41177133/1904943 [victoria@victoria tensorflow]$ p37 [Python 3.7 venv (source ~/venv/py3.7/bin/activate)]
(py3.7) [victoria@victoria tensorflow]$ python Python 3.7.4 (default, Nov 20 2019, 11:36:53) [GCC 9.2.0] on linux Type "help", "copyright", "credits" or "license" for more information. ## TEST: >>> import numpy as np >>> from torch.utils.tensorboard import SummaryWriter >>> vectors = np.array([[0,0,1], [0,1,0], [1,0,0], [1,1,1]]) >>> metadata = ['001', '010', '100', '111'] # labels >>> print(metadata) ['001', '010', '100', '111'] >>> print(vectors) [[0 0 1] [0 1 0] [1 0 0] [1 1 1]] >>> writer = SummaryWriter() >>> writer.add_embedding(vectors, metadata) >>> writer.close() >>> ## That (Nov 28, 2019: ~11:08 am) generated a new run, "Nov28_11-08-09_victoria", ## visible in the TensorFlow TensorBoard. When I clicked that link, those data ## opened in the Projector! # ---------------------------------------------------------------------------- >>> tokens = [str(token).split()[2] for token in sentence] >>> print(tokens) ''' ['The', 'RAS-MAPK', 'signalling', 'cascade', 'serves', 'as', 'a', 'central', 'node', 'in', 'transducing', 'signals', 'from', 'membrane', 'receptors', 'to', 'the', 'nucleus.'] ''' >>> tokens_array = np.array(tokens) >>> print(tokens_array) ''' ['The' 'RAS-MAPK' 'signalling' 'cascade' 'serves' 'as' 'a' 'central' 'node' 'in' 'transducing' 'signals' 'from' 'membrane' 'receptors' 'to' 'the' 'nucleus.'] ''' >>> for token in tokens_array: print(token) >>> ''' The RAS-MAPK signalling cascade serves as a central node in transducing signals from membrane receptors to the nucleus. ''' >>> embeddings = [token.embedding for token in sentence] >>> print(embeddings) ''' [tensor([ 0.0077, -0.0227, -0.0004, ..., 0.1377, -0.0003, 0.0028]), tensor([-0.0007, -0.1601, -0.0274, ..., 0.1982, 0.0013, 0.0042]), tensor([ 4.2534e-03, -3.1018e-01, -3.9660e-01, ..., 5.9336e-02, -9.4445e-05, 1.0025e-02]), tensor([ 0.0026, -0.0087, -0.1398, ..., -0.0037, 0.0012, 0.0274]), tensor([-0.0005, -0.0164, -0.0233, ..., -0.0013, 0.0039, 0.0004]), tensor([ 3.8261e-03, -7.6409e-02, -1.8632e-02, ..., -2.8906e-03, -4.4556e-04, 5.6909e-05]), tensor([ 0.0035, -0.0207, 0.1700, ..., -0.0193, 0.0017, 0.0006]), tensor([ 0.0159, -0.4097, -0.0489, ..., 0.0743, 0.0005, 0.0012]), tensor([ 9.7725e-03, -3.3817e-01, -2.2848e-02, ..., -6.6284e-02, 2.3646e-04, 1.0505e-02]), tensor([ 0.0219, -0.0677, -0.0154, ..., 0.0102, 0.0066, 0.0016]), tensor([ 0.0092, -0.0431, -0.0450, ..., 0.0060, 0.0002, 0.0005]), tensor([ 0.0047, -0.2732, -0.0408, ..., 0.0136, 0.0005, 0.0072]), tensor([ 0.0072, -0.0173, -0.0149, ..., -0.0013, -0.0004, 0.0056]), tensor([ 0.0086, -0.1151, -0.0629, ..., 0.0043, 0.0050, 0.0016]), tensor([ 7.6452e-03, -2.3825e-01, -1.5683e-02, ..., -5.4974e-04, -1.4646e-04, 6.6120e-03]), tensor([ 0.0038, -0.0354, -0.1337, ..., 0.0060, -0.0004, 0.0102]), tensor([ 0.0186, -0.0151, -0.0641, ..., 0.0188, 0.0391, 0.0069]), tensor([ 0.0003, -0.0461, 0.0043, ..., -0.0126, -0.0004, 0.0142])] ''' import torch ## needed for tolist() >>> embeddings = [token.embedding.tolist() for token in sentence] ## *** CAUTION -- EVEN FOR THIS ONE SENTENCE THIS IS AN ENORMOUS LIST!! *** >>> print(embeddings) ''' [[0.007691788021475077, -0.02268664352595806, ..., -0.0004157265357207507, 0.014170931652188301]] ''' >>> embeddings_array = np.array(embeddings) >>> print(embeddings_array) ''' [[ 7.69178802e-03 -2.26866435e-02 -4.34076006e-04 ... 1.37687057e-01 -3.07319278e-04 2.84141395e-03] [-7.38183910e-04 -1.60104632e-01 -2.73584425e-02 ... 1.98223457e-01 1.31987268e-03 4.19976842e-03] [ 4.25336510e-03 -3.10180396e-01 -3.96601588e-01 ... 5.93362860e-02 -9.44453641e-05 1.00254947e-02] ... [ 3.82626243e-03 -3.53914015e-02 -1.33689731e-01 ... 5.97812422e-03 -3.52837233e-04 1.01681864e-02] [ 1.86223574e-02 -1.51006011e-02 -6.41461909e-02 ... 1.87926367e-02 3.90900113e-02 6.87920302e-03] [ 2.52505066e-04 -4.60800231e-02 4.34845686e-03 ... -1.26084751e-02 -4.15726536e-04 1.41709317e-02]] ''' >>>
OK, we now have everything needed to visualize those tensors (reformatted as NumPy arrays) in TensorFlow’s Embedding Projector! :-D
>>> from torch.utils.tensorboard import SummaryWriter
>>> writer = SummaryWriter()
## Load those data:
>>> writer.add_embedding(embeddings_array, tokens_array)
>>> writer.close()
>>>
## Wait a few seconds for tensorboard, http://localhost:6006/#projector
## to refresh in Firefox (manually reload the browser, if needed).
## My new "run" appears! "Nov28_11-08-09_victoria" !
## /mnt/Vancouver/apps/tensorflow/runs/Nov28_11-54-28_victoria
## Yea: works!! :-D
## SimpleScreenRecorder video screen capture below. :-)
Return to Persagen.com