Visualizing Language Model Tensors (Embeddings) in TensorFlow's TensorBoard

[TensorBoard Projector: PCA; t-SNE; ...]

Source	Persagen.com
Author	Dr. Victoria A. Stuart, Ph.D.
Created	2019-11-28
Last modified
Summary	Evaluation of biomedical contextual language models, visualization

Contents

Background

I have been evaluating some contextual language models for biomedical natural language processing (BioNLP). Several platforms support these models, including

spaCy – notably scispaCy [project page | GitHub],
Flair [GitHub] – notably BioFLAIR [GitHub],
SciBERT [GitHub],
SpERT [illustrating an exciting recent development, “span-based” joint entity and relation extraction models | GitHub],
various semantic role labeling architectures:
SemBERT (Semantics-aware BERT) [GitHub],
various BERT, GPT-2, and other contextual language models available at the extremely popular (and widely implemented) Hugging Face website [GitHub: Transformers; architectures (supported models); …],
etc.

Visualizations

Several of the “general-use” packages mentioned above provide opportunities for visualizing natural language tags and embeddings. For example, spaCy visualizers [note this one] allow the visualization of color-coded entities – as I describe here (I also describe those CRAFT corpus entity labels here.)

Likewise, I was intrigued by this example, Visualizing spaCy vectors in TensorBoard, on the spaCy examples page. It’s apparently possible to view those embeddings (tensors) in the TensorFlow Embedding Projector [example]!

I was looking at Flair embeddings at the time (2019-11-27; awaiting the anticipated release of a BioFlair pretrained model), so I thought I’d try to demo the viewing of those embeddings in TensorFlow’s Projector.

Note: there is currently (late Nov 2019) a bug in Torch / PyTorch (used by Flair) that prevents the installation of that code in Python 3.8. Consequently, as my Arch Linux environment is Py3.8 and there currently is no Py3.8 install wheel for Torch/PyTorch [screenshot; Py3.8 code expected ~mid-Dec 2019], I cannot currently install Flair in that environment.

My solution was to create a Py3.7 venv (which I describe here on StackOverflow), then do the various installations.

Having installed Flair, Torch / PyTorch, TensorFlow, etc. in that Py3.7 venv, I proceeded to figure out how to load the Flair embeddings in TF Projector. The following code provides a step-by-step explanation.

Flair embeddings (tensors) → Tensorflow TensorBoard Embedding Projector

[Click here to read the following code as a single (monochromatic, plain text) file in the browser. ]

Install TensorBoard (Py3.7 venv)

    
      # Install Python 3.7 in Python 3.8 env:
      #   https://stackoverflow.com/a/58964629/1904943

      # Test (in terminal):

        [victoria@victoria ~]$ date
          Wed 20 Nov 2019 04:25:38 PM PST

        [victoria@victoria ~]$ p37    ## ~/.bashrc alias
          [Python 3.7 venv (source ~/venv/py3.7/bin/activate)]

        (py3.7) [victoria@victoria ~]$ env | grep -i virtual
          VIRTUAL_ENV=/home/victoria/venv/py3.7
    

    
      (py3.7) [victoria@victoria ~]$ python --version
        Python 3.7.4

      (py3.7) [victoria@victoria ~]$ pip install --upgrade pip
        ...
        Successfully installed pip-19.3.1

      ## https://github.com/lanpa/tensorboardX
      ## Also installs (if I recall) tensorflow, other dependencies:
      (py3.7) [victoria@victoria ~]$ pip install tensorboardX    ## << note: capital X
        ...
        ## If needed:  pip install moviepy

      (py3.7) [victoria@victoria ~]$ pip install flair
        ...
        Successfully installed
          Cython-0.29.14
          SudachiPy-0.4.0
          attrs-19.3.0
          backcall-0.1.0
          boto-2.49.0
          boto3-1.10.23
          botocore-1.13.23
          bpemb-0.3.0
          certifi-2019.9.11
          cffi-1.13.2
          chardet-3.0.4
          click-7.0
          cloudpickle-1.2.2
          cycler-0.10.0
          dartsclone-0.6
          decorator-4.4.1
          deprecated-1.2.7
          docutils-0.15.2
          flair-0.4.4
          future-0.18.2
          gensim-3.8.1
          hyperopt-0.2.2
          idna-2.8
          importlib-metadata-0.23
          ipython-7.6.1
          ipython-genutils-0.2.0
          jedi-0.15.1
          jmespath-0.9.4
          joblib-0.14.0
          kiwisolver-1.1.0
          kytea-0.1.4
          langdetect-1.0.7
          matplotlib-3.1.1
          more-itertools-7.2.0
          mpld3-0.3
          natto-py-0.9.0
          networkx-2.2
          numpy-1.17.4
          packaging-19.2
          parso-0.5.1
          pexpect-4.7.0
          pickleshare-0.7.5
          pillow-6.2.1
          pluggy-0.13.0
          prompt-toolkit-2.0.10
          ptyprocess-0.6.0
          py-1.8.0
          pycparser-2.19
          pygments-2.4.2
          pymongo-3.9.0
          pyparsing-2.4.5
          pytest-5.3.0
          python-dateutil-2.8.1
          regex-2019.11.1
          requests-2.22.0
          s3transfer-0.2.1
          sacremoses-0.0.35
          scikit-learn-0.21.3
          scipy-1.3.2
          segtok-1.5.7
          sentencepiece-0.1.83
          six-1.13.0
          sklearn-0.0
          smart-open-1.9.0
          sortedcontainers-2.1.0
          sqlitedict-1.6.0
          tabulate-0.8.6
          tiny-tokenizer-3.0.1
          torch-1.3.1
          torchvision-0.4.2
          tqdm-4.38.0
          traitlets-4.3.3
          transformers-2.1.1
          urllib3-1.24.3
          wcwidth-0.1.7
          wrapt-1.11.2
          zipp-0.6.0

      (py3.7) [victoria@victoria ~]$ python
        Python 3.7.4 (default, Nov 20 2019, 11:36:53) 
        [GCC 9.2.0] on linux
        Type "help", "copyright", "credits" or "license" for more information.

      >>> import flair    ## works, yea!!  :-D
      >>>

Start TensorBoard

    
      [victoria@victoria tensorflow]$ cd /mnt/Vancouver/apps/tensorflow/

      [victoria@victoria tensorflow]$ date; pwd; echo; ls -l

        Thu 28 Nov 2019 10:50:19 AM PST
        /mnt/Vancouver/apps/tensorflow

        total 928
        -rw-------  1 victoria victoria  19305 Nov 28 10:49 _readme-tensorflow-victoria.txt
        drwxr-xr-x 11 victoria victoria   4096 Nov 26 16:45 runs

      [victoria@victoria tensorflow]$ tensorboard --logdir runs/
        Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all
        TensorBoard 2.0.0 at http://localhost:6006/ (Press CTRL+C to quit)
        ...

Obtain Flair embeddings for test sentence

    
    from flair.embeddings import FlairEmbeddings, Sentence
    from flair.models import SequenceTagger
    from flair.embeddings import StackedEmbeddings

    sentence = Sentence('The RAS-MAPK signalling cascade serves as a central node in transducing signals from membrane receptors to the nucleus.')

    tagger = SequenceTagger.load('ner')
    tagger.predict(sentence)

    embeddings_f = FlairEmbeddings('pubmed-forward')
    embeddings_b = FlairEmbeddings('pubmed-backward')

    stacked_embeddings = StackedEmbeddings([
        embeddings_f,
        embeddings_b,
    ])

    stacked_embeddings.embed(sentence)

    tokens = [str(token).split()[2] for token in sentence]
    print(tokens)
    '''
      ['The', 'RAS-MAPK', 'signalling', 'cascade', 'serves', 'as', 'a', 'central', 'node', 'in', 'transducing', 'signals', 'from', 'membrane', 'receptors', 'to', 'the', 'nucleus.']
    '''

    for token in sentence:
        print(token)
        print(token.embedding)
        print(token.embedding.shape)

    '''
      Token: 1 The
      tensor([ 0.0077, -0.0227, -0.0004,  ...,  0.1377, -0.0003,  0.0028])
      torch.Size([2300])
      Token: 2 RAS-MAPK
      tensor([-0.0007, -0.1601, -0.0274,  ...,  0.1982,  0.0013,  0.0042])
      torch.Size([2300])
      Token: 3 signalling
      tensor([ 4.2534e-03, -3.1018e-01, -3.9660e-01,  ...,  5.9336e-02,
              -9.4445e-05,  1.0025e-02])
      torch.Size([2300])
      Token: 4 cascade
      tensor([ 0.0026, -0.0087, -0.1398,  ..., -0.0037,  0.0012,  0.0274])
      torch.Size([2300])
      Token: 5 serves
      tensor([-0.0005, -0.0164, -0.0233,  ..., -0.0013,  0.0039,  0.0004])
      torch.Size([2300])
      Token: 6 as
      tensor([ 3.8261e-03, -7.6409e-02, -1.8632e-02,  ..., -2.8906e-03,
              -4.4556e-04,  5.6909e-05])
      torch.Size([2300])
      Token: 7 a
      tensor([ 0.0035, -0.0207,  0.1700,  ..., -0.0193,  0.0017,  0.0006])
      torch.Size([2300])
      Token: 8 central
      tensor([ 0.0159, -0.4097, -0.0489,  ...,  0.0743,  0.0005,  0.0012])
      torch.Size([2300])
      Token: 9 node
      tensor([ 9.7725e-03, -3.3817e-01, -2.2848e-02,  ..., -6.6284e-02,
              2.3646e-04,  1.0505e-02])
      torch.Size([2300])
      Token: 10 in
      tensor([ 0.0219, -0.0677, -0.0154,  ...,  0.0102,  0.0066,  0.0016])
      torch.Size([2300])
      Token: 11 transducing
      tensor([ 0.0092, -0.0431, -0.0450,  ...,  0.0060,  0.0002,  0.0005])
      torch.Size([2300])
      Token: 12 signals
      tensor([ 0.0047, -0.2732, -0.0408,  ...,  0.0136,  0.0005,  0.0072])
      torch.Size([2300])
      Token: 13 from
      tensor([ 0.0072, -0.0173, -0.0149,  ..., -0.0013, -0.0004,  0.0056])
      torch.Size([2300])
      Token: 14 membrane
      tensor([ 0.0086, -0.1151, -0.0629,  ...,  0.0043,  0.0050,  0.0016])
      torch.Size([2300])
      Token: 15 receptors
      tensor([ 7.6452e-03, -2.3825e-01, -1.5683e-02,  ..., -5.4974e-04,
              -1.4646e-04,  6.6120e-03])
      torch.Size([2300])
      Token: 16 to
      tensor([ 0.0038, -0.0354, -0.1337,  ...,  0.0060, -0.0004,  0.0102])
      torch.Size([2300])
      Token: 17 the
      tensor([ 0.0186, -0.0151, -0.0641,  ...,  0.0188,  0.0391,  0.0069])
      torch.Size([2300])
      Token: 18 nucleus.
      tensor([ 0.0003, -0.0461,  0.0043,  ..., -0.0126, -0.0004,  0.0142])
      torch.Size([2300])
    '''

    ## The embeddings above are PyTorch tensors (Flair depends on Torch/PyTorch).

    ## https://stackoverflow.com/questions/53903373/convert-pytorch-tensor-to-python-list
    ## https://pytorch.org/docs/stable/tensors.html#torch.Tensor.tolist

    ## https://stackoverflow.com/questions/29895602/how-to-save-output-from-python-like-tsv
    ## https://stackoverflow.com/a/29896136/1904943

[optional] Write Python output to files

In an earlier iteration of this effort I saved the Flair tokens as metadata, and the embeddings (tensors) as a list. While those files are not needed here, I leave this code for future reference.

    
      import csv

      metadata_f = 'metadata.tsv'
      tensors_f = 'tensors.tsv'

      with open(metadata_f, 'w', encoding='utf8', newline='') as tsv_file:
          tsv_writer = csv.writer(tsv_file, delimiter='\t', lineterminator='\n')
          for token in tokens:
              ## Assign to a dummy variable ( _ ) to suppress character counts;
              ## Using (token), rather than ([token]), prints spaces between all characters:
              _ = tsv_writer.writerow([token])


      '''
      [victoria@victoria tensorflow]$ cat metadata.tsv :
        The
        RAS-MAPK
        signalling
        cascade
        serves
        as
        a
        central
        node
        in
        transducing
        signals
        from
        membrane
        receptors
        to
        the
        nucleus.
      '''

      import torch    ## needed for tolist()

      with open(tensors_f, 'w', encoding='utf8', newline='') as tsv_file:
          tsv_writer = csv.writer(tsv_file, delimiter='\t', lineterminator='\n')
          for token in sentence:
              embedding = token.embedding
              ## https://stackoverflow.com/questions/12770213/writerow-csv-returns-a-number-instead-of-writing-rows
              ## assign to a dummy variable ( _ ) to suppress character counts
              ## tolist() is a PyTorch method that converts tensors to lists:
              _ = tsv_writer.writerow(embedding.tolist())

      ## CAUTION: even for the single, short sentence used in this example, the
      ## following `cat` statement generates an ENORMOUS list!

      '''
        [victoria@victoria tensorflow]$ cat tensors.tsv 
            0.007691788021475077	-0.02268664352595806	-0.0004340760060586035	...
      '''

Transform Flair tokens and tensors to NumPy arrays

    
      ##  https://stackoverflow.com/questions/40849116/how-to-use-tensorboard-embedding-projector/41177133
      ##  https://stackoverflow.com/a/41177133/1904943

      [victoria@victoria tensorflow]$ p37
        [Python 3.7 venv (source ~/venv/py3.7/bin/activate)]
    

    
      (py3.7) [victoria@victoria tensorflow]$ python
        Python 3.7.4 (default, Nov 20 2019, 11:36:53) 
        [GCC 9.2.0] on linux
        Type "help", "copyright", "credits" or "license" for more information.


      ## TEST:

      >>> import numpy as np
      >>> from torch.utils.tensorboard import SummaryWriter

      >>> vectors = np.array([[0,0,1], [0,1,0], [1,0,0], [1,1,1]])
      >>> metadata = ['001', '010', '100', '111']  # labels

      >>> print(metadata)
        ['001', '010', '100', '111']

      >>> print(vectors)
        [[0 0 1]
        [0 1 0]
        [1 0 0]
        [1 1 1]]

      >>> writer = SummaryWriter()
      >>> writer.add_embedding(vectors, metadata)
      >>> writer.close()
      >>>

      ## That (Nov 28, 2019: ~11:08 am) generated a new run, "Nov28_11-08-09_victoria",
      ## visible in the TensorFlow TensorBoard.  When I clicked that link, those data
      ## opened in the Projector!

      # ----------------------------------------------------------------------------

      >>> tokens = [str(token).split()[2] for token in sentence]

      >>> print(tokens)
      '''
        ['The', 'RAS-MAPK', 'signalling', 'cascade', 'serves', 'as', 'a', 'central',
        'node', 'in', 'transducing', 'signals', 'from', 'membrane', 'receptors',
        'to', 'the', 'nucleus.']
      '''

      >>> tokens_array = np.array(tokens)

      >>> print(tokens_array)
      '''
        ['The' 'RAS-MAPK' 'signalling' 'cascade' 'serves' 'as' 'a' 'central'
        'node' 'in' 'transducing' 'signals' 'from' 'membrane' 'receptors'
        'to' 'the' 'nucleus.']
      '''

      >>> for token in tokens_array:
              print(token)
      >>> 
      '''
        The
        RAS-MAPK
        signalling
        cascade
        serves
        as
        a
        central
        node
        in
        transducing
        signals
        from
        membrane
        receptors
        to
        the
        nucleus.
      '''

      >>> embeddings = [token.embedding for token in sentence]

      >>> print(embeddings)
      '''
        [tensor([ 0.0077, -0.0227, -0.0004,  ...,  0.1377, -0.0003,  0.0028]),
        tensor([-0.0007, -0.1601, -0.0274,  ...,  0.1982,  0.0013,  0.0042]),
        tensor([ 4.2534e-03, -3.1018e-01, -3.9660e-01,  ...,  5.9336e-02, -9.4445e-05,  1.0025e-02]),
        tensor([ 0.0026, -0.0087, -0.1398,  ..., -0.0037,  0.0012,  0.0274]),
        tensor([-0.0005, -0.0164, -0.0233,  ..., -0.0013,  0.0039,  0.0004]),
        tensor([ 3.8261e-03, -7.6409e-02, -1.8632e-02,  ..., -2.8906e-03, -4.4556e-04,  5.6909e-05]),
        tensor([ 0.0035, -0.0207,  0.1700,  ..., -0.0193,  0.0017,  0.0006]),
        tensor([ 0.0159, -0.4097, -0.0489,  ...,  0.0743,  0.0005,  0.0012]),
        tensor([ 9.7725e-03, -3.3817e-01, -2.2848e-02,  ..., -6.6284e-02, 2.3646e-04,  1.0505e-02]),
        tensor([ 0.0219, -0.0677, -0.0154,  ...,  0.0102,  0.0066,  0.0016]),
        tensor([ 0.0092, -0.0431, -0.0450,  ...,  0.0060,  0.0002,  0.0005]),
        tensor([ 0.0047, -0.2732, -0.0408,  ...,  0.0136,  0.0005,  0.0072]),
        tensor([ 0.0072, -0.0173, -0.0149,  ..., -0.0013, -0.0004,  0.0056]),
        tensor([ 0.0086, -0.1151, -0.0629,  ...,  0.0043,  0.0050,  0.0016]),
        tensor([ 7.6452e-03, -2.3825e-01, -1.5683e-02,  ..., -5.4974e-04, -1.4646e-04,  6.6120e-03]),
        tensor([ 0.0038, -0.0354, -0.1337,  ...,  0.0060, -0.0004,  0.0102]),
        tensor([ 0.0186, -0.0151, -0.0641,  ...,  0.0188,  0.0391,  0.0069]),
        tensor([ 0.0003, -0.0461,  0.0043,  ..., -0.0126, -0.0004,  0.0142])]
      '''

      import torch    ## needed for tolist()

      >>> embeddings = [token.embedding.tolist() for token in sentence]

      ##  ***  CAUTION -- EVEN FOR THIS ONE SENTENCE THIS IS AN ENORMOUS LIST!!  ***

      >>> print(embeddings)
      '''
        [[0.007691788021475077, -0.02268664352595806, ..., -0.0004157265357207507, 0.014170931652188301]]
      '''

      >>> embeddings_array = np.array(embeddings)

      >>> print(embeddings_array)
      '''
        [[ 7.69178802e-03 -2.26866435e-02 -4.34076006e-04 ...  1.37687057e-01 -3.07319278e-04  2.84141395e-03]
        [-7.38183910e-04 -1.60104632e-01 -2.73584425e-02 ...  1.98223457e-01 1.31987268e-03  4.19976842e-03]
        [ 4.25336510e-03 -3.10180396e-01 -3.96601588e-01 ...  5.93362860e-02 -9.44453641e-05  1.00254947e-02]
        ...
        [ 3.82626243e-03 -3.53914015e-02 -1.33689731e-01 ...  5.97812422e-03 -3.52837233e-04  1.01681864e-02]
        [ 1.86223574e-02 -1.51006011e-02 -6.41461909e-02 ...  1.87926367e-02 3.90900113e-02  6.87920302e-03]
        [ 2.52505066e-04 -4.60800231e-02  4.34845686e-03 ... -1.26084751e-02 -4.15726536e-04  1.41709317e-02]]
      '''
      >>>

Start new TensorBoard instance & load those data

OK, we now have everything needed to visualize those tensors (reformatted as NumPy arrays) in TensorFlow’s Embedding Projector! :-D

    
      >>> from torch.utils.tensorboard import SummaryWriter
      >>> writer = SummaryWriter()

      ## Load those data:

      >>> writer.add_embedding(embeddings_array, tokens_array)
      >>> writer.close()
      >>> 

      ## Wait a few seconds for tensorboard, http://localhost:6006/#projector
      ## to refresh in Firefox (manually reload the browser, if needed).
      ## My new "run" appears!  "Nov28_11-08-09_victoria" !
      ##    /mnt/Vancouver/apps/tensorflow/runs/Nov28_11-54-28_victoria

      ## Yea: works!! :-D
      ## SimpleScreenRecorder video screen capture below.  :-)

TensorBoard Projector

Return to Persagen.com