Source | Persagen.com |
Author | Dr. Victoria A. Stuart, Ph.D. |
Created | 2019-07-30 |
Last modified | |
Summary | Visualization of data in Python using NetworkX |
Related | KEGG Pathways Representation II: Cytoscape |
Hello!
I constantly search for satisfactory representations and renderings of relational data; for example, representing metabolic pathways as relational property graphs.
A little over a year ago (2018-04) I posted a description of my efforts to render the KEGG glycolytic and TCA cycle (Kreb's Cycle) pathways in Neo4j - see my 2018-04 research blog post Creating A Metabolic Pathway In Neo4j, and my accompanying StackOverflow post.
However, while Neo4j offers an excellent, mature platform (including the Cypher graph query language), from my perspective there are limitations:
Hence, around that time I was becoming much less enthused about Neo4j, which I increasingly regarded as "bloatware."
With PostgreSQL serving as a well-supported and highly functional RDBMS, I sought my own graph network visualization and analysis solutions that allow facile programmatic access to relational datastores, amenable to downstream machine learning (ML) and natural language processing (NLP) applications. Additionally, I'd like to be able to access multidimensional data (e.g. tensor representations.
Applications include knowledge graph construction, in silico modeling, metabolic flux balance analysis, etc.
In parallel to my NetworkX experiments (summarized in the following subsection), I briefly looked at SageMath directed graphs, and spent perhaps a week looking at PyViz/Holoviews [website].
While Holoviews offers decent matplotlib graphs (also used by NetworkX, below), I was favorably impressed with the slick browser-based visualizations and interfaces provided by Bokeh, permitting mouseover displays of node and edge attributes, etc. - well summarized and illustrated in my Nov 2018-11 blog post, "Interactive Data Visualization in Python With Bokeh."
However, a deal-breaker for me once again was Bokeh's inability to natively show labeled nodes and edges in Bokeh's HTML representations - where there appears to be a reliance on graph legends, with no permanent (non-mouseover) node/edge labels. The lack of permanent (displayed) labels on nodes/edges remains [2019-06] an acknowledged issue:
While it appears that you can use Bokeh Labels to label nodes and edges, this seems like an unwieldy workaround. Likewise, while this example (Alaska Airline Routes) may prove me wrong, that figure appears to be a matplotlib graph [hv.extension('matplotlib')]: as I recall, my issue was the lack of labeled nodes and edges in the HTML plots (related GitHub issue).
Lastly, while I found the Holoviews / Bokeh communities to moderately active, with a reasonable level of available documentation, frustratingly the code in their examples is generally insufficient to replicate their results.
Having examined other options and discovering their limitations, I was pleased to find that NetworkX offered several attractive attributes.
I recently (Jul 2019) spent a couple of weeks thoroughly investigating the NetworkX platform for my research needs.
While I was pleased, overall, with my programming and modeling in Networkx, there were again some limitations. Most significantly, NetworkX renders graphs through the construction of Python dictionaries: {(src, tgt), rel)} where the keys are node source, target pairs and the edges (relations) are the values. (Note that DICT data structures have unique keys!)
Thus, if your data contains "duplicate" data (e.g. node-rel-node) that appear more than once,
while those underlying data remain unperturbed, when constructing the graphs NetworkX silently drops what it infers as "duplicate" relations - because of the constraint that DICT keys [(src, tgt) pairs) must be unique.
Unaddressed, this results in graphs that do not faithfully and accurately represent the underlying data.
That issue was encountered in my first script (below).
To remedy that issue, I needed to return to the approach described in my 2018-04 StackOverflow post; namely, the use of "tags" to uniquely identify every node and edge in a graph. This issue / approach is illustrated in these pp. from my programming notebook:
While the first script rather easily represents graphs as an edge adjacency framework (where the edges define the graph),
the second script required that the relations themselves be considered as nodes, so that I could unambiguously specify each node-relation-node relationship:
Although this provides a robust solution, I needed to restructure my graph input data, and I lost facile access to the facile embedding edge attributes, used in the first script.
In the first script, where nodes represent KEGG compound (i.e., biochemical metabolites) and edges represent enzymes, it is easy to separately add node and edge attributes - e.g. from Pandas dataframes.
In the second script KEGG compounds and enzymes are both represented as nodes, hence we lose the ability to label edges and add edge-centric attributes (in a facile manner), and the source data preparation is somewhat (in my opinion) somewhat more convoluted.
Noting those observations, here are my two scripts - to which the Reader is referred for details. The code is fully commented, additionally with embedded sample outputs.
Script 1: networkx_practice.py
Script 2: networkx_practice_2.py
Code is in Python 3 (I run these in a Python 3.7 venv).
If you want to run these, you'll need to edit paths in those scripts (search for "Vancouver"). If I forgot to include a datafile, simply me (info@Persagen.com).
Code is in a linear format, as I was just testing and evaluating; for production, you can wrap code sections into functions (def name() ...) and/or methods. Refer here for ideas.
I program in Vim (Neovim) in a widescreen terminal with textwidth=220. If you view the code wrapped with shorter lines, it's going to look pretty messy - ymmv.
While the actual code is reasonably compact, the scripts are thoroughly commented - mostly so that if / when I return to them, I can easily understand and follow what I was thinking and doing.
Like Holoviews / Bokeh, NetworkX was both promising yet frustrating. While sorting through those issues, above, I began searching for additional solutions. For my research purposes, there are two additional solutions.
The R programming language (with which I am acquainted) offers superb utilities for working with genomic data, e.g. via the Bioconductor package - which in turn includes the KEGGlincs utility for explicitly recreating KEGG pathway maps and overlaying NIH LINCS transcriptional data.
KEGGlincs can be used with Cytoscape, to visualize the graph (Cytoscape must be running, for the CyREST interface layer interaction):
Pretty cool!
Continued in my follow-on post, KEGG Pathways Representation II: Cytoscape
Enjoy!
Return to Persagen.com