Information Visualization MOOC (Indiana University), Part 1

For the past few weeks I have been enrolled in an Information Visualization MOOC through the University of Indiana. This course has focused on the use of the Sci2 Tool for preparing, analyzing and visualizing various kinds of data. The follow four graphs are examples of the kinds of visualizations we’re learning. These are my first attempts using the Sci2 Tool, and as such they are quite basic. However, I think they are illustrative of some of the simplest functionality of the tool and I look forward to continuing to learn through application and practice. Sample Social Network

(click graph for a larger view)

The graph above was the first we created in the course. It is a social network using Sci2’s built-in GUESS visualization suite. This network depicts major Florentine families, their connections (marriage and business ties), their relative wealth, and the number of priorates (seats in government) each family held. The article upon which this data is based can be found here.

What we were emphasizing in this exercise was how to convey multiple kinds of information in a single graph: marriage and business ties (qualitative; nominal) were displayed with edge colour, wealth (quantitative; ratio) was shown by relative size of nodes, and the number of seats (quantitative; interval) were illustrated by brightness of hue (‘greenness’ of node).

Sample Burst Analysis

(click graph for a larger view)

This is a temporal graph using Kleinberg’s burst-detection algorithm which identifies sudden increases in the frequency of words. Here I have graphed words appearing in the titles of the 20% most cited articles in the journal Health Psychology from 1984 to 2013 (the first terms to ‘burst’ are in 1991).

Sample Choropleth Map

(click graph for a larger view)

This is a choropleth symbol map which shades portions of the map based on certain numberic values. In the graph above I am depicting the number of patents related to influenza filed by country. The more intense the colour, the more patents (from yellow to red). Similar to the heat map, this is an easy and intuitive way to visualize geospatial data.

Sample Proportional Map

(click graph for a larger view)

This graph shows the amount of money that has been awarded by NSF grants for topics related to ‘pain’  in different states between 1952-2010. The larger the circle, the more money has been awarded. Unlike the choropleth map which can only depict one numerical attribute, I could have visualized multiple attributes with the proportional symbol map (ex: number of grants could have been depicted by the intensity of the colour of the circles).

Coauthorship Networks

I’ve recently been exploring different ways of constructing a coauthorship network. In the past I’ve imported data gathered from PsycInfo in .csv file format into the informatin visualization software Gephi. This time I’ve decided to try using the Sci2 Tool and the Web of Science databases to get my data into Gephi for visualization and data exploration.

I used Scott Weingart’s guide for building co-citation networks using the Sci2 tool. It provides more detail and background on the process than I will post here, so check it out (it also has screenshots helpful for navigating the Web of Science and Sci2 Tool). Here I merely want to go over the steps I followed for a particular project, so description is thin.

I am interested in collaborative communities working in the discipline of health psychology. I want to use coauthorship networks to explore the most influential fields of collaborative research in the subdiscipline. I am trying to learn how to craft coauthorship networks and to begin to examine ways I might be able to use them in my research.

Necessary resources:

  • The Sci2 Tool (available here)
  • Access to Web of Science
  • Gephi (available here)

Part One: Downloading Data

1) Go to the Web of Science and search for a publication of interest using the ‘Publication Name’ in the dropdown menu. I’ve decided to use the journal Health Psychology.
2) Refine results: under Document Type, click ‘Article’, then hit ‘Refine’
3) Download all the records. Web of Science limits you to 500 records per download so you have to download a separate file for each 500 record chunk (i.e., 1-500, 501-1000, 10001-1500, etc.). Using the ‘Send to:’ dropdown menu, click on ‘Other File Formats’. At the pop-up box, check the box for records 1 to 500 and enter those numbers, change the record content to ‘Full Record and Cited References’, and change the file format to ‘Plain Text’. Save it somewhere you won’t forget. Do this for each 500 record chunk.
4) Your files will probably be named: savedrecs.txt, savedrecs(1).txt, savedrecs(2).txt, and so on. The first two lines and the last line of every file are special header and footer lines. Header lines start with FN then VR and the footer line starts with EF. Since we want to merge all the files we have to delete the footer of the first file, the header and footer of the second file (and the others), and the header of the last file, so that the file has one header and one footer, and none in between.

Part Two: Creating a Coauthorship Network

1) Open the Sci2 Tool and go to ‘File > Load’. Open your merged records file, and select the ‘ISI flat format’ file.
2) Making sure the ‘Unique ISI Records’ file is selected, go to ‘Data Preparation > Extract Coauthorship Network’.
3) To make things more manageable I’ve decided to look at only the 100 most cited authors from a total of 1996 authors. To do this go to ‘Preprocessing > Networks > Extract Top Nodes’ and select ‘times_cited’.
4) Then I exported my new coauthorship network (called ‘Top 100 nodes by times_cited’ into Gephi by going to ‘Visualizations > Networks > Gephi’.

Part Three: Visualizing the Network

1) I begin by filtering the weakly connected nodes. I use two filters ‘Giant Component’ and ‘Degree Range’ both in the ‘Topology’ folder in Library (top right corner after clicking on the ‘Filters’ tab). Under the parameters of the degree range I set the minimum degree to 2. By filtering the nodes like this I’ve removed authors that have not coauthored two or more times with members of this sample (i.e., the 100 most cited). This is highly selective, but I am interested in only the most cited AND the most collaborative authors from this journal.
2) Next I run click on the ‘Statistics’ tab and run statistics. Of particular interest is the ‘Modularity’ statistic, which is a community detection algorithm.
3) Moving to the tabs on the left side, I start with ‘Ranking’. I want to set the relative size of the nodes based on number of published works, so I click on the red-diamond and select ‘number_of_authored_works’ from the dropdown menu and ‘Apply’ to re-size them. I’m also interested in having the colours reflect the community membership of the nodes, so I click on the colour-wheel and select ‘Modularity Class’ from the dropdown menu and hit ‘Apply’ again.
4) To set the colors I click on the ‘Partition’ tab and hit the refresh button (with the green arrows) to update my choices of partition parameters so I can select ‘Modularity Class’.
5) Next I want to position the nodes in the network using the Force Atlas layout: click on the ‘Layout’ tab and select ‘Force Atlas’ from the dropdown menu and adjust the parameters according to slide 12 from the Gephi Tutorial on Layouts.


Part Four: Exploring Communities

1) There are eight communities in my network. Since I am interested in the research topics of these communities I decided to identify the each community’s members and examine which topics are subject to the most collaboration within them. To do this I start by extracting the names of the authors. To do this I go back to the ‘Filters’ and I add another subfilter: ‘Attributes > Partitions > Modularity Class’. Clicking through the various communities I can switch between ‘Overview’ and ‘Data Laboratory’ to collect data about the members of each community.
2) With the names of the community members in hand I head back to the Web of Science database and do an Advanced Search with the following search formula: SO=”JOURNAL” AND ((AU=AUTHOR)OR(AU=AUTHOR)OR(AU=AUTHOR))”, switching out JOURNAL for ‘Health Psychology’ and AUTHOR for the various authors in the community.
3) I then sort the search results by most cited to least and begin manually examining articles that contain two or more authors from the community selected. I chose to collect 5 articles from each community using this search strategy (more would have been better, but here I am simply testing out an approach).
4) Next I collect the keywords from each of the five articles from each of the eight groups and enter these words into a tag cloud (using software, like Wordle, for visualization).



These are examples of two of the tag clouds I generated doing this with the journal Health Psychology (1984-2010).

By examining themes and patterns of word frequencies in this (limited) sample suggested that a number of more-or-less distinct highly collaborative research communities exist among psychologists publishing in Health Psychology (the journal of Div.38 of the APA). I have applied preliminary labels to these communities, but this is a rough work in progress…


Health Psychology (1984 – 2013) Preliminary Co-Authorship Analysis

A coauthorship network is a set of nodes and edges where a node represents an author and an edge indicates that they have worked together on a paper. This networks allows the structure to be visualized, which is intended to provide insight about how communities interact.

To reduce the network’s size I have applied extracted the top 100 most cited authors during this period. I have deleted authors who have not co-authored with other members of the network (i.e., another top cited author).

Early Period: 1984 – 1993


Middle Period: 1994 – 2004


Late Period: 2005 – 2013


Health Psychology (1984 – 2013) Preliminary Citation Analysis

Health Psychology (est. 1982 by APA Div.38)

Published monthly, beginning in January

Total number of articles between January 1984 – October 2014: 2,124

Total referenced works in articles: 58,539

Total number of citations among all articles: 88,141

Early Period: 1984 – 1993

Division membership growth: 34% increase

Most cited articles from Health Psychology:

  1. Optimism, Coping, and Health – Assessment and Implications of Generalized Outcome Expectancies (Michael F. Scheier and Charles S. Carver, 1985) [2317]
  2. Psychosocial Models of the Role of Social Support in the Etiology of Physical Disease (Sheldon Cohen, 1988) [703]
  3. The Precaution Adoption Process (Neil D. Weinstein, 1988) [637]
  4. The Relative Efficacy of Avoidant and Non-Avoidant Coping Strategies – A Meta-Analysis (Jerry Suls and Barbara Fletcher, 1988) [601]
  5. Why Won’t it Happen to Me – Perceptions of Risk-Factors and Susceptibility (Neil D. Weinstein, 1984) [524]
  6. Standardized, Individualized, Interactive, and Personalized Self-Help Programs for Smoking Cessation (Prochaska, JO; Diclemente, CC; Velicer; et a., 1993) [480]
  7. Hostility and Health – Current Status of a Psychosomatic Hypothesis (Smith, TW, 1992) [447]
  8. Testing 4 Competing Theories of Health-Protective Behavior (Neil D. Weinstein, 1993) [399]
  9. The Contemplation Ladder – Validation of a Measure of Readiness to Consider Smoking Cessation (Biener, L; Abrams, DB, 1991) [395]
  10. The Stages and Processes of Exercise Adoption and Maintenance in a Worksite Sample (Marcus, BH; Rossi, JS; Selby, VA; et al., 1992) [378]

Most cited authors from Health Psychology:

  1. Neil D. Weinstein (1431)
  2. JS Rossi (1194)
  3. DB Abrams (1022)
  4. S Cohen (911)
  5. BH Marcus (883)
  6. TW Smith (870)
  7. JO Prochaska (649)
  8. WF Velicer (649)
  9. W Rakowski (624)
  10. Andrew Baum (532)

Middle Period: 1994 – 2003

Division membership growth: 16% decrease

Most cited articles from Health Psychology:

  1. Stages of Change and Decisional Balance for 12 Problem Behaviors (Prochaska, JO; Velicer, WF; Rossi, JS; et al., 1994) [1078]
  2. Relationship of subjective and objective social status with psychological and physiological functioning: Preliminary data in health white women (Adlher, NE; Epel, ES; Castellazzo, G; et al., 2000) [465]
  3. Effects of Psychosocial Interventions with Adult Cancer-Patients – A Meta-Analysis of Randomized Experiments (Meyer, TJ; Mark, MM, 1995) [446]
  4. Long-term maintenance of weight loss: Current status. (Jeffery, RW; Brewnowski, A; Epstein, LH; et al., 2000) [415]
  5. Personal and environmental factors associated with physical inactivity among different racial-ethnic groups of US middle-aged and older-aged women (King, AC; Castro, C; Wilcoz, S; et al., 2000) [408]
  6. 10-year Outcomes of Behavioral Family-based Treatment for Childhood Obesity (Epstein, LH; Valoski, A; Wing, RR; et al., 1994) [408]
  7. Patterns, correlates, and barriers to medication adherence among persons prescribed new treatments for HIV disease (Catz, SL; Kelly, JA; Bogart, LM, et al., 2000) [379]
  8. Cognitive-behavioral stress management intervention decreases the prevalence of depression and enhances benefit finding among women under treatment for early-stage breast cancer (Antoni, MH; Lehman, JM; Kilbourn, KM; et al., 2001) [376]
  9. Validation of susceptibility as a predictor of which adolescents take up smoking in the United States (Pierce, JP; Choi, WS; Cilpin, EA; et al., 1996) [373]
  10. Religious involvement and mortality: A meta-analytic review (McCullough, ME; Hoyt, WT; Larson, DB; et al., 2000) [347]

Most cited authors from Health Psychology:

  1. S Cohen (1487)
  2. AJ Rothman (1098)
  3. C Lerman (940)
  4. LH Epstein (935)
  5. NE Adler (840)
  6. JF Sallis (828)
  7. MH Antoni (825)
  8. Karen Matthews (762)
  9. ND Weinstein (745)
  10. VS Helgeson (736)

Late Period: 2004 – 2013

Division membership growth: 2% decrease

Most cited authors from Health Psychology:

  1. S Michie (698)
  2. C Abraham (670)
  3. FX Gibbons (649)
  4. M Gerrard (612)
  5. GB Chapman (402)
  6. KD McCaul (394)
  7. ND Weinstein (389)
  8. M Conner (319)
  9. AC King (298)
  10. KD Brownell

Most cited works from three periods:

Timeline created using Timeline JS

APA’s Health Psychology Division Membership 1984 – 2013

The American Psychological Association’s Division of Health Psychology over the past thirty years: [more coming soon!]

Health Psychology (Division 38; red), Social & Personality Psychology (Division 8; green), and Clinical Neuropsychology (Division 40; blue) of the American Psychological Association, membership compared between 1984 and April 2013.


Health Psychology’s Division 38 membership between 1984 – April 2013


Health Psychology Division 38 Co-Membership in 1991


popular subjects of early health psychology

tag-cloud - HP.1982-1992Most popular article subjects, Health Psychology, 1982-1992 (469).

tag-cloud - BM.1975-1992

Most popular article subjects, Behavioral Medicine, 1975-1992 (353).

tag-cloud - HP+BM.1975-1992

Most popular article subjects from both Health Psychology and Behavioral Medicine, 1975-1992 (822).

authors - HP+BM - 1975-1992

Most published (primary) authors from both Health Psychology and Behavioral Medicine, 1975-1992 (822).


breiger-type networks

Here are two examples of Breiger-type networks using the techniques described by Michael Pettit here.

These networks represent institutional links between the first ten presidents of Division 38 (Health Psychology) of the American Psychological Association.

For simplification, I’ve only included those institutions where more than one individual studied or taught. A shared institution (i.e., a line between two individuals) does not necessarily mean a concurrently shared institution.

div38 first 10 pres

Figure 1: People

div38 first 10 inst

Figure 2: Groups