Coauthorship Networks

I’ve recently been exploring different ways of constructing a coauthorship network. In the past I’ve imported data gathered from PsycInfo in .csv file format into the informatin visualization software Gephi. This time I’ve decided to try using the Sci2 Tool and the Web of Science databases to get my data into Gephi for visualization and data exploration.

I used Scott Weingart’s guide for building co-citation networks using the Sci2 tool. It provides more detail and background on the process than I will post here, so check it out (it also has screenshots helpful for navigating the Web of Science and Sci2 Tool). Here I merely want to go over the steps I followed for a particular project, so description is thin.

I am interested in collaborative communities working in the discipline of health psychology. I want to use coauthorship networks to explore the most influential fields of collaborative research in the subdiscipline. I am trying to learn how to craft coauthorship networks and to begin to examine ways I might be able to use them in my research.

Necessary resources:

  • The Sci2 Tool (available here)
  • Access to Web of Science
  • Gephi (available here)

Part One: Downloading Data

1) Go to the Web of Science and search for a publication of interest using the ‘Publication Name’ in the dropdown menu. I’ve decided to use the journal Health Psychology.
2) Refine results: under Document Type, click ‘Article’, then hit ‘Refine’
3) Download all the records. Web of Science limits you to 500 records per download so you have to download a separate file for each 500 record chunk (i.e., 1-500, 501-1000, 10001-1500, etc.). Using the ‘Send to:’ dropdown menu, click on ‘Other File Formats’. At the pop-up box, check the box for records 1 to 500 and enter those numbers, change the record content to ‘Full Record and Cited References’, and change the file format to ‘Plain Text’. Save it somewhere you won’t forget. Do this for each 500 record chunk.
4) Your files will probably be named: savedrecs.txt, savedrecs(1).txt, savedrecs(2).txt, and so on. The first two lines and the last line of every file are special header and footer lines. Header lines start with FN then VR and the footer line starts with EF. Since we want to merge all the files we have to delete the footer of the first file, the header and footer of the second file (and the others), and the header of the last file, so that the file has one header and one footer, and none in between.

Part Two: Creating a Coauthorship Network

1) Open the Sci2 Tool and go to ‘File > Load’. Open your merged records file, and select the ‘ISI flat format’ file.
2) Making sure the ‘Unique ISI Records’ file is selected, go to ‘Data Preparation > Extract Coauthorship Network’.
3) To make things more manageable I’ve decided to look at only the 100 most cited authors from a total of 1996 authors. To do this go to ‘Preprocessing > Networks > Extract Top Nodes’ and select ‘times_cited’.
4) Then I exported my new coauthorship network (called ‘Top 100 nodes by times_cited’ into Gephi by going to ‘Visualizations > Networks > Gephi’.

Part Three: Visualizing the Network

1) I begin by filtering the weakly connected nodes. I use two filters ‘Giant Component’ and ‘Degree Range’ both in the ‘Topology’ folder in Library (top right corner after clicking on the ‘Filters’ tab). Under the parameters of the degree range I set the minimum degree to 2. By filtering the nodes like this I’ve removed authors that have not coauthored two or more times with members of this sample (i.e., the 100 most cited). This is highly selective, but I am interested in only the most cited AND the most collaborative authors from this journal.
2) Next I run click on the ‘Statistics’ tab and run statistics. Of particular interest is the ‘Modularity’ statistic, which is a community detection algorithm.
3) Moving to the tabs on the left side, I start with ‘Ranking’. I want to set the relative size of the nodes based on number of published works, so I click on the red-diamond and select ‘number_of_authored_works’ from the dropdown menu and ‘Apply’ to re-size them. I’m also interested in having the colours reflect the community membership of the nodes, so I click on the colour-wheel and select ‘Modularity Class’ from the dropdown menu and hit ‘Apply’ again.
4) To set the colors I click on the ‘Partition’ tab and hit the refresh button (with the green arrows) to update my choices of partition parameters so I can select ‘Modularity Class’.
5) Next I want to position the nodes in the network using the Force Atlas layout: click on the ‘Layout’ tab and select ‘Force Atlas’ from the dropdown menu and adjust the parameters according to slide 12 from the Gephi Tutorial on Layouts.


Part Four: Exploring Communities

1) There are eight communities in my network. Since I am interested in the research topics of these communities I decided to identify the each community’s members and examine which topics are subject to the most collaboration within them. To do this I start by extracting the names of the authors. To do this I go back to the ‘Filters’ and I add another subfilter: ‘Attributes > Partitions > Modularity Class’. Clicking through the various communities I can switch between ‘Overview’ and ‘Data Laboratory’ to collect data about the members of each community.
2) With the names of the community members in hand I head back to the Web of Science database and do an Advanced Search with the following search formula: SO=”JOURNAL” AND ((AU=AUTHOR)OR(AU=AUTHOR)OR(AU=AUTHOR))”, switching out JOURNAL for ‘Health Psychology’ and AUTHOR for the various authors in the community.
3) I then sort the search results by most cited to least and begin manually examining articles that contain two or more authors from the community selected. I chose to collect 5 articles from each community using this search strategy (more would have been better, but here I am simply testing out an approach).
4) Next I collect the keywords from each of the five articles from each of the eight groups and enter these words into a tag cloud (using software, like Wordle, for visualization).



These are examples of two of the tag clouds I generated doing this with the journal Health Psychology (1984-2010).

By examining themes and patterns of word frequencies in this (limited) sample suggested that a number of more-or-less distinct highly collaborative research communities exist among psychologists publishing in Health Psychology (the journal of Div.38 of the APA). I have applied preliminary labels to these communities, but this is a rough work in progress…