02-network-data-formats.Rmd

# Network Data in R {#NetworkData}

![](images/image_break.png){width=100%}

A network is simply a set of entities and the formally defined relationships among them. There are, however, many different ways that networks can be encoded and displayed. This section provides examples of many of the most common network formats and data types discussed in Chapter 3 of Brughmans and Peeples 2023. For most of the examples below we use the Cibola technological similarity network data set (described in Chapter 2.8.3 and [here](#Cibola)) because it is relatively small and easy to display in a variety of formats.

Throughout this document, we refer to the unique bounded entities connected in a formal network as **nodes** and the connections between them as **edges** but note that there are many other terms used in the literature and in the documentation for the R packages used here. Nodes are often referred to as **vertices** or **actors** and edges are often referred to as **ties** or **links**. Note that we use **network** to refer to the formal system of interdependent pairwise relationships (edges) among a set of entities (nodes) but the term **graph** is often used equivalently in mathematics and other fields.

## Network Data Formats{#NetworkDataFormats}

This section follows Chapter 3.2 in Brughmans and Peeples (2023) to provide examples of the same network and attribute data in a variety of different data formats as well as code for converting among these formats in R. 

The network data formats we discuss in this section include:

* **Edge list** - A network data format consisting of a list of connected node pairs. *E=((n1,n2),(n1,n3),(n1,n4),...,(ni,nj))*. It can also be represented as a matrix with two columns for source and target nodes respectively and with one edge per row.
* **Adjacency list** - A network data format consisting of a set of rows, where the first node in each row is connected to all subsequent nodes in that same row.
* **Adjacency matrix** - A network data format consisting of a matrix of size *n x n*, with a set of rows equal to the number of nodes, and a set of columns equal to the number of nodes. When a pair of nodes is connected by an edge (i.e., when they are adjacent), then the corresponding cell will have an entry.
* **Incidence matrix** - A network data format consisting of a matrix of size *n x e*, with a set of rows equal to the number of nodes, and a set of columns equal to the number of edges. An entry is made in a cell if the corresponding node and edge are connected. Each column in the incidence matrix has two entries.

Let's first get started by initializing all of the packages we will use in this section.

```{r, warning=F, message=F, results = F}
# initialize packages
library(igraph)
library(statnet)
library(intergraph)
library(vegan)
library(multinet)
```

```{block, type="rmdnote"}
The primary packages used in this Section (`igraph`, `statnet`, and `intergraph`) are already described [in the last section](#PrimaryPackages). We also use here the `vegan` package which includes many functions focused on community ecology. In this document, we rely on this package to calculate several distance/similarity metrics that are useful for generating [similarity networks](#SimilarityNetworks). Finally, we provide a brief example at the end of this section using `multinet` which is a package focused on conducting [multilayer network analyses](#Multinet).
```

### Defining Network Objects in R{#NetworkDataFunctions}

In general most of the examples of network data formats in the remainder of this section are converted into R network objects in two basic steps. 

* First, we read in some external data file that contains network data, usually generated in some sort of spreadsheet program, and create an R data frame or matrix object.
* Next, we call a function that expects the given data format (edge list, adjacency matrix, incidence matrix, etc.) and converts it in to an R network object.

As you will see below, we mostly rely on `igraph` functions that take the following basic format:

`igraph::graph_from_**DataType**` 

where `**DataType**` is replaced with the appropriate format such as `edgelist`, `adjacency_matrix`, and so on. In most of the examples below we then plot the network just to confirm that everything works, but that is certainly optional.

### Edge List {#Edgelist}

The edge list is a very quick and easy way to capture network data. It simply lists the edges in the network one by one by node id: *E=((n1,n2),(n1,n3),(n1,n4),...,(ni,nj))*. For the purposes of data management it is usually easiest to create an edge list as a data frame or matrix where each row represents a pair of nodes with connections going from the node in one column to the node in the second column (additional columns can be used for edge weight or other edge attributes).

In this example, we import the Cibola data set in this format as a data frame and then convert it to an `igraph` network object for further analysis. You can download the [edge list file here](data/Cibola_edgelist.csv) to follow along on your own. Since the edges in this network are undirected this will be a simple binary network, and we will use the `directed = FALSE` argument in the `igraph::graph_from_edgelist` function call. This function simply takes a edge list in tabular format and converts it to a network object R recognizes that can further be used for analysis and visualization.

```{r, warning=F, message=F}
# Read in edge list file as data frame
cibola_edgelist <-
  read.csv(file = "data/Cibola_edgelist.csv", header = TRUE)

# Examine the first several rows
head(cibola_edgelist)

# Create graph object. The data frame is converted to a matrix as that
#is required by this specific function. Since this is an undirected
# network directed = FALSE.
cibola_net <-
  igraph::graph_from_edgelist(as.matrix(cibola_edgelist),
                              directed = FALSE)

# Display igraph network object and then plot a simple node-link diagram
cibola_net
# Set random seed to ensure graph layout stays the same each time.
set.seed(3523)
plot(cibola_net)
```

### Adjacency List{#AdjacencyList}

The adjacency list consists of a set of rows, where the first node in each row is connected to all subsequent nodes in that same row. It is therefore more concise than the edge list (in which each relationship has its own row), but unlike the edge list it does not result in rows of equal length (each row in an edge list typically has two values, representing the pair of nodes). Adjacency lists are relatively rare in practice but can sometimes be useful formats for directly gathering network data in small networks and are supported by many network analysis software packages. 

In the following chunk of code, we convert the network object we created above into an adjacency list using the `igraph::as_adj_edge_list()` function and examine a couple of the rows.

```{r, warning=F, message=F}
# Convert edge list to adjacency list using igraph function
adj_list <- igraph::as_adj_edge_list(cibola_net)

# examine adjacency list for the site Apache Creek
adj_list$`Apache Creek`

# It is also possible to call specific nodes by number. In this case,
# site 2 is Casa Malpais
adj_list[[2]]
```

The output for a particular node can be called by either referencing the name using using `$` followed by the site name or `[[k]]` double brackets where `k` is the row number of the node in question. The printed output is essentially a list of all of the edges incident on the node in question identified by the name of the sending and receiving node. 

### Adjacency Matrix{#AdjacencyMatrix}

The adjacency matrix is perhaps the most common and versatile network data format for data analysis in network science (in sociology it is sometimes referred to as the sociomatrix). It is a symmetric matrix of size *n x n*, with a set of rows and columns denoting the nodes in that network. The node names or identifiers are typically used to label both rows and columns. When a pair of nodes is connected by an edge (i.e. when they are adjacent), the corresponding cell will have an entry. The diagonal of this matrix represents "self loops" and can variously be defined as connected or unconnected depending on the application. 

We can obtain an adjacency matrix object in R by converting our network object created above or by reading in a file directly with rows and columns denoting site and with 0 or 1 denoting the presence or absence of a relation. We take the data frame object `adj_mat` which is a square matrix of 1s and 0s and convert it into a network object using the `igraph::graph_from_adjacency_matrix()` function. You can download the csv file to follow along on your own [here](data/Cibola_adj.csv).

```{r, warning=F, message=F}
# Convert to adjacency matrix then display first few rows/columns
adj_mat <- igraph::as_adjacency_matrix(cibola_net)
adj_mat[1:5, 1:5]

# Read in adjacency matrix and convert to network object for plotting
adj_mat2 <-
  read.csv(file = "data/Cibola_adj.csv",
           header = T,
           row.names = 1)

adj_mat2[1:4, 1:4]

cibola_net2 <-
  igraph::graph_from_adjacency_matrix(as.matrix(adj_mat2),
                                      mode = "undirected")
set.seed(4352)
plot(cibola_net2)
```

Note when you compare this network graph to the one produced based on the edge list there is an additional unconnected node (WS Ranch) that was not shown in the previous network. This is one of the advantages of an adjacency matrix is it provides a way of easily including unconnected nodes without having to manually add them or include self-loops.
 
### Incidence Matrix{#IncidenceMatrix}

An incidence matrix is most frequently used to define connections among different sets of nodes in a two-mode or bipartite network where the rows and columns represent two different classes of nodes and the presence/absence or value of an edge is indicated in the corresponding cell.

By way of example here we can read in the data that were used to generate the one-mode networks of ceramic technological similarity we have been examining so far. In the corresponding data frame, each row represents a site and each column represents a specific cluster of technological attributes in cooking pottery (see Peeples 2018, pg. 100-104 for more details) with the number in each cell representing the count of each technological cluster at each site.

After reading in this rectangular data frame, we can create a network object using the `igraph::graph_from_incidence_matrix()` function. We then plot it as a simple two-mode network with color representing node class. We discuss plotting options in greater detail in the visualization section of this document. You can download the csv file to follow along on your own [here](data/Cibola_clust.csv).

```{r, warning=F, message=F}
# Read in two-way table of sites and ceramic technological clusters
cibola_clust <-
  read.csv(file = "data/Cibola_clust.csv",
           header = TRUE,
           row.names = 1)
head(cibola_clust)

# Convert into a network object using the incidence matrix format. Note that
# multiple=TRUE as we want this defined as a bipartite network.
cibola_inc <-
  igraph::graph_from_incidence_matrix(cibola_clust,
                                      directed = FALSE,
                                      multiple = TRUE)
head(cibola_inc)

set.seed(4543)
# Plot as two-mode network
plot(cibola_inc, vertex.color = as.numeric(V(cibola_inc)$type) + 1)
```

### Node and Edge Information{#NodeAttributes}

Frequently we want to use other information about nodes and edges (node location, site type, edge weight, etc.) in our analyses and need to track these data in a separate attribute object or data column. One common way to do this is to simply create a data frame that contains the required attribute information and call specific data from this data frame when needed. As the following example shows, it is also possible to directly assign attributes to nodes or edges in an `igraph` network object and use those for subsequent analyses using the `V()` for nodes (V standing for vertices) and `E()` for edges calls within `igraph`.

In the following example we use [this file](data/Cibola_attr.csv) which includes basic attribute data by site (node) for all sites in the network we've been working with here. This file includes x and y coordinates for the sites, information on the presence/absence and shape of Great Kiva public architectural features at those sites, and the Region to which they have been assigned. First we read in the data.

```{r, warning=F, message=F}
# Read in attribute data and look at the first few rows.
cibola_attr <- read.csv(file = "data/Cibola_attr.csv", header = TRUE)
head(cibola_attr)

```

In order to assign an attribute to a particular node or edge we can use the `V` and `E` (vertex and edge) calls in `igraph`. For example, in the following example, we will assign a `region` variable to each node in the network we created above using the `V` function to assign a vertex attribute. You simply type the name of the network object in the parenthesis after `V` and use the `$` atomic variable symbol to assign a name to the attribute that will be associated with that network object.

```{r, warning=F, message=F}
# Assign a variable called "region" to the Cibola_net2 based on the
# column in the Cibola_attr table called "Region"
V(cibola_net2)$region <- cibola_attr$Region

# If we now call that attribute we get a vector listing each assigned value
V(cibola_net2)$region

# Note that "region" is now listed as an attribute when we view
# the network object
cibola_net2
```

This can further be used for plotting or other analyses by calling the variable as a factor ([see this resource](https://r4ds.had.co.nz/factors.html) or the [Getting Started with R](#DataTypes) to learn more about the factor data.

```{r, warning=F, message=F}
set.seed(43534)
plot(cibola_net2, vertex.color = as.factor(V(cibola_net2)$region))
```

## Types of Networks{#TypesOfNetworks}

This section roughly follows Brughmans and Peeples (2023) Chapter 3.3 to describe and provide examples in R format of many of the most common types of networks. In the examples below we will use the `igraph` R package but we also show how to use the `statnet` and `network` packages where applicable.

In this section, we cover:

* **Simple Networks** - A set of nodes and a set of edges with no additional information about them.
* **Directed Networks** - A network consisting of a set of nodes and edges connecting them for which the orientation or direction is specified. In other words when A is connected to B, B is not necessarily connected to A.
* **Signed, Categorized, and Weighted Networks** - This category refers to networks where edges (relationships) have additional nominal, ordinal, or metric information encoded in them. A signed network is a network where the edges carry a positive or negative sign indicating some opposed property of relations in the network. A categorized network is a network where edges are classified according to some nominal category that does not necessarily represent an opposition. A weighted network is one in which the edges carry a non-binary value which indicates the strength of a given relationship.
* **Two-Mode Networks** - A network where two separate categories of nodes are defined with edges defined only between these categories.
* **Similarity Networks** - Networks where edges are defined or weighted based on a quantitative metric of similarity or distance based on node attributes or artifact assemblages.
* **Ego Networks** - A network including a focal node, the set of nodes the ego is connected to by an edge and the edges between nodes in this set.
* **Multilayer Networks** - A network where a single set of nodes is connected by two or more sets of edges that each represent a different kind of relationship among the nodes.

### Simple Networks{#SimpleNetworks}

We call a network a simple network (or simple graph) if we only have a set of nodes and a set of edges connecting them, with no additional information about the edges or specific rules they need to follow. Simple networks are, in other words, unweighted and undirected one-mode networks. By way of example we will use the Cibola region [adjacency matrix file](data/Cibola_adj.csv) and convert it into a simple network using both `igraph` and `network`. Notice how in both examples we specify that this is an undirected network (`mode = "undirected"` and `directed = FALSE`). 

```{r, warning=F, message=F}
# Read in raw adjacency matrix file
adj_mat2 <-
  read.csv(file = "data/Cibola_adj.csv",
           header = T,
           row.names = 1)

# Convert to a network object using igraph
simple_net_i <-
  igraph::graph_from_adjacency_matrix(as.matrix(adj_mat2),
                                      mode = "undirected")
simple_net_i

# Covert to a network object using statnet/network
simple_net_s <-
  network::network(as.matrix(adj_mat2), directed = FALSE)
simple_net_s
```

Notice how the two formats differ in the way they internally store network data in R and the way they print output to the screen but that both show a total of 31 nodes (vertices) and 167 edges (for the igraph object the first row specifies node and edge numbers between the -- marks).

### Directed Networks{#DirectedNetworks}

Sometimes relationships are directional, meaning they have an orientation. For example, the flow of a river is directed downstream. In such cases we can incorporate this information in our network data by distinguishing between the source and the target of an edge. 

By way of example here we will modify the Cibola network edge list to remove some number of edges at random to simulate directed network data. We will then convert these data into various network and matrix formats to illustrate how directed networks are stored and used in R. We first use the `sample` function to define a sub-sample of our network nodes and then we create a network object from that random sub-sample. By randomly removing some edges from the edge list we are left with a directed network where not all edges will be reciprocated. 

```{r, message=F, warning=F}
# Read in edge list file as data frame
cibola_edgelist <-
  read.csv(file = "data/Cibola_edgelist.csv", header = TRUE)

# Create a random sub-sample of 125 edges out of the total 167 using
# the "sample" function
set.seed(45325)
el2 <- cibola_edgelist[sample(seq(1, nrow(cibola_edgelist)), 125,
                              replace = FALSE), ]

# Create graph object from the edge list using the directed=TRUE argument
# to ensure this is treated as a directed network object.
directed_net <-
  igraph::graph_from_edgelist(as.matrix(el2), directed = TRUE)
directed_net

# View as adjacency matrix of directed network object
(as_adjacency_matrix(directed_net))[1:5, 1:5]

# Plot network
set.seed(4353)
plot(directed_net)
```

Notice that when we look at the igraph network plot it has arrows indicating the direction of connection in the edge list. If you are making your own directed edge list, the sending node by default will be in the first column and the receiving node in the second column. In the adjacency matrix the upper and lower triangles are no longer identical. Again, if you are generating your own adjacency matrix, you can simply mark edges sent from nodes denoted as rows and edges received from the same nodes as columns. Finally, in the plot, since R recognizes this as a directed igraph object when we plot the network, it automatically shows arrows indicating the direction of the edge.

### Signed, Categorized, and Weighted Networks{#WeightedNetworks}

In many situations we want to add values to specific edges such as signs (sometimes called valences), nominal categories, or weights defining the strength or nature of relationships. There are a variety of ways that we can record and assign such weights or values to edges in R. The simplest way is to directly include that information in one of the formats described above such as an edge list or adjacency matrix. For example, we can add a third column to an edge list that denotes the weight, category, or sign of each edge or can fill the cells in an adjacency matrix with specific values rather than simply 1s or 0s.

In this example, we will randomly generate edge weights for the Cibola network edge list and adjacency matrix to illustrate how R handles these formats. We use the `sample` function again to create a random vector of values between 1 and 4 for every edge in the network and then add it to the edge list as a new variable called `$Weight`. We then convert this data frame into a network object.

```{r, warning=F, message=F}
# Read in edge list file as data frame
cibola_edgelist <-
  read.csv(file = "data/Cibola_edgelist.csv", header = TRUE)
# Add additional column of weights as random integers between 1 and 4
# for each edge
cibola_edgelist$weight <-
  sample(seq(1, 4), nrow(cibola_edgelist), replace = TRUE)

# Create weighted network object calling only the first two columns
weighted_net <-
  igraph::graph_from_edgelist(as.matrix(cibola_edgelist[, 1:2]),
                              directed = FALSE)
# add edge attribute to indicate weight
E(weighted_net)$weight <- cibola_edgelist$weight

# Explore the first few rows and columns of network object
head(get.data.frame(weighted_net))

# View network as adjacency matrix. Notice the attr="weight" command that
# indicates which edge attribute to use for values in the matrix
head(as_adjacency_matrix(weighted_net, attr = "weight"))[1:5, 1:5]

# Plot the network
set.seed(574)
plot(weighted_net, edge.width = E(weighted_net)$weight)
```

Notice in the final plot that line thickness is used to indicate edges with various weights. We will explore further options for such visualizations in the network [visualizations section](#Visualization) of this document.

### Two-mode Networks and Affiliation Networks{#TwoMode}

Two-mode networks are networks where two separate categories of nodes are defined with a structural variable (edges) between these categories. In sociology, two-mode networks are often used for studying the affiliation of individuals with organizations, such as the presence of professionals on the boards of companies or the attendance of scholars at conferences (referred to as affiliation networks).

Two-mode network data are typically recorded in a two-way table with rows and columns representing two different classes of nodes and with individual cells representing the presence/absence or weight of edges between those classes of nodes. By way of example here we will return to the table of ceramic technological clusters by sites for the Cibola region data. The simplest way to create an unweighted two-mode network from these data is to create a network object directly from a two-way table as we saw above. In this example this will create an edge between each site and each technological cluster present there irrespective of relative frequency using the `igraph::graph_from_incidence_matrix` function.

```{r, warning=F, message=F}
# Read in two-way table of sites and ceramic technological clusters
cibola_clust <- read.csv(file = "data/Cibola_clust.csv",
                         header = TRUE,
                         row.names = 1)
# Create network from incidence matrix based on presence/absence of
# a cluster at a site
cibola_inc <- igraph::graph_from_incidence_matrix(cibola_clust,
                                                  directed = FALSE,
                                                  multiple = TRUE)
cibola_inc
set.seed(4537643)
# Plot as two-mode network
plot(cibola_inc, vertex.color = as.numeric(V(cibola_inc)$type) + 1)
```

In this case since most clusters are present at most sites, this creates a pretty busy network that may not be particularly useful. An alternative to this is to define some threshold (either in terms of raw count or proportion) to define an edge between a node of one class and another. We provide an example here and build a function that you could modify to do this for your own data. In this function you can set the proportion threshold that you would like to used to define an edge between two classes of nodes. If the proportion of that cluster at that site is greater than or equal to that threshold an edge will be present. In the example below the threshold is set at 0.25 meaning that we will define edges between nodes that share a common type that makes of at least a quarter of the assemblage of both sites.

```{r, warning=F, message=F}
# Define function for creating incidence matrix with threshold
two_mode <- function(x, thresh = 0.25) {
  # Create matrix of proportions from x input into function
  temp <- prop.table(as.matrix(x), 1)
  # Define anything with greater than or equal to threshold as
  # present (1)
  temp[temp >= thresh] <- 1
  # Define all other cells as absent (0)
  temp[temp < 1] <- 0
  # Return the new binarized table as output of the function
  return(temp)
}

# Run the function and create network object
# thresh is set to 0.25 but could be any values from 0-1
mod_clust <- two_mode(cibola_clust, thresh = 0.25)
# Examine the first few rows
head(mod_clust)
# Create a graph matrix from the new incidence matrix
two_mode_net <- igraph::graph_from_incidence_matrix(
                          mod_clust,
                          directed = FALSE,
                          multiple = TRUE)

# Plot results
set.seed(4537)
plot(two_mode_net,
     vertex.color = as.numeric(V(cibola_inc)$type) + 1)
```

Notice how there are now far fewer edges and if you are familiar with the sites in question you might notice some clear regional patterning. 

It is also possible to create one-mode projections of the two-mode data here using simple matrix algebra. All you need to do is multiply a matrix by the transpose of that matrix. The results will be a adjacency matrix for whichever set of nodes represented the rows in the first matrix in the matrix multiplication. Here is an example using the `mod_clust` incidence matrix with threshold created above. In the resulting incidence matrix individual cells will represent the number of different edges in common between the nodes in question and can be treated like an edge weight. The diagonal of the matrix will be the total number of clusters that were present in a site assemblage. In R the operator `%*%` indicates matrix multiplication and the function `t()` will transpose a given matrix.

```{r, warning=F, message=F}
# In R the command "%*%" indicates matrix multiplication and "t()"
# gives the transpose of the matrix within the parentheses.
# Lets first create a one-mode projection focused on sites
site_mode <- mod_clust %*% t(mod_clust)
site_net <- igraph::graph_from_adjacency_matrix(site_mode,
                                                mode = "undirected",
                                                diag = FALSE)
plot(site_net)

# Now lets create a one-mode projection focused on ceramic
# technological clusters.
# Notice that the only change is we switch which side of the
# matrix multiplication we transpose.
clust_mode <- t(mod_clust) %*% mod_clust
head(clust_mode)

clust_net <- igraph::graph_from_adjacency_matrix(clust_mode,
                                                 mode = "undirected",
                                                 diag = FALSE)
plot(clust_net)
```

### Similarity Networks{#SimilarityNetworks}

Similarity networks simply refer to one-mode networks where nodes are defined as entities of interest with edges defined and/or weighted based on some metric of similarity (or distance) defined based on the features, attributes, or assemblage associated with that node. Such an approach is frequently used in archaeology to explore material cultural networks where nodes are contexts of interests (e.g., sites, excavation units, houses, etc.) and edges are defined or weighted based on similarities in the relative frequencies of artifacts or particular classes of artifacts recovered in those contexts.

There are many different ways to define and track similarity network data for use in R. In this example, we will show several methods using the affiliation data we used in the previous example. Specifically, we will define and weight edges based on similarities in the frequencies of ceramic technological clusters at sites in our Cibola region sample.

For most of these examples we will use the `statnet` package and the `network` package object format within it rather than `igraph` because `statnet` has a few additional functions that are useful when working with similarity data. In the following examples, we will first demonstrate several different similarity/distance metrics and then discuss approaches to binarization of these similarity networks and other options for working with weighted data.

#### Brainerd-Robinson Similarity {- #BrainerdRobinson}

The first metric we will explore here is a rescaled version of the Brainerd-Robinson (BR) similarity metric. This BR measure is commonly used in archaeology including in a number of recent (and not so recent) network studies. This measure represents the total similarity in proportional representation of categories and is defined as:

$$S = {\frac{2-\sum_{k} \left|x_{k} - y_{k}\right|} {2}}$$

where, for all categories $k$, $x$ is the proportion of $k$ in the first assemblage and $y$ is the proportion of $k$ in the second. We subtract the sum from 2 as 2 is the maximum proportional difference possible between two samples. We further divide the result by 2. This provides a scale of similarity from 0 to 1 where 1 is perfect similarity and 0 indicates no similarity. The chunk below defines the code for calculating this modified BR similarity measure. Note here we use a distance metric called "Manhattan Distance" built into the `vegan` package in R. This metric is identical to the Brainerd-Robinson metric. We rescale our results to range from 0 to 1 after calculation.

```{r}
# Read in raw data
cibola_clust <-
  read.csv(file = "data/Cibola_clust.csv",
           header = TRUE,
           row.names = 1)

# First we need to convert the ceramic technological clusters into proportions
clust_p <- prop.table(as.matrix(cibola_clust), margin = 1)

# The following line uses the vegdist function in the vegan package
# to calculate the Brainerd-Robinson similarity score. Since vegdist
# by default defines an unscaled distance we must subtract the results
# from 2 and then divide by 2 to get a similarity scaled from 0 to 1.
cibola_br <- ((2 - as.matrix(vegan::vegdist(clust_p,
                                           method = "manhattan"))) / 2)

# Lets look at the first few rows.
cibola_br[1:4, 1:4]
```

At this point we could simply define this as a weighted network object where weights are equal to the similarity scores, or we could define a threshold for defining edges as present or absent. We will discuss these options in detail after presenting other similarity/distance metrics.

#### Morisita's Overlap Index {#Morisita}

Another measure that has been used for defining similarities among assemblages for archaeological similarity networks is Morisita's overlap index. This measure is a measure of the overlap of individual assemblages within a larger population that takes the size of samples into account. Specifically, the approach assumes that as sample size increases diversity will likely increase. This measure produces results that are very similar to the Brainerd-Robinson metric in practice in most cases but this measure may be preferred where there are dramatic differences in assemblage sizes among observations. 

Morisita's index is calculated as:

$$C_D=\frac{2 \Sigma^Sx_iy_i}{(D_x + D_y)XY}$$

where

* $x_i$ is the number of rows where category $i$ is represented in the total $X$ from the population.
* $y_i$ is the number of rows where category $i$ is presented in the total $Y$ from the population.
* $D_x$ and $D_y$ are the Simpson's diversity index values for $x$ and $y$ respectively.
* $S$ is the total number of columns.

This metric ranges from 0 (where no categories overlap at all) to 1 where the categories occur in the same proportions in both samples. Because this metric works on absolute counts we can run the `vegdist` function directly on the `Cibola_clust` object. Because we want a similarity rather than a distance (which is the default for this function in R) we subtract the results from 1. 

```{r}
# Calculate matrix of Morisita similarities based on the
# Cibola_clust two-way table.
cibola_mor <- 1 - as.matrix(vegan::vegdist(cibola_clust,
                                           method = "morisita"))
cibola_mor[1:4, 1:4]
```


#### $\chi^{2}$ Distance {- #ChiSquare}

The next measure we will use is the $\chi^{2}$ distance metric which is the basis of correspondence analysis and related methods commonly used for frequency seriation in archaeology (note that this should probably really be called the $\chi$ distance since the typical form we use is not squared, but the name persists this way in the literature so that's what we use here). This measure is defined as:

$$\chi_{jk} = \sqrt{\sum \frac 1{c_{j}} 
({x_{j}-y_{j})^{2}}}$$

where $c_j$ denotes the $j_{th}$ element of the average row profile (the proportional abundance of $j$ across all rows) and $x$ and $y$ represent row profiles for the two sites under comparison. This metric therefore takes raw abundance (rather than simply proportional representation) into account when defining distance between sites. The definition of this metric is such that rare categories play a greater role in defining distances among sites than common categories (as in correspondence analysis). This measure has a minimum value of 0 and no theoretical upper limit. 

The code for calculating $\chi^{2}$ distances is defined in the chunk below and a new object called `Cibola_X` is created using this measure. It is sometimes preferable to rescale this measure so that it is bounded between 0 and 1. We create a second object called `Cibola_X01` which represents rescaled distances by simply dividing the matrix by the maximum observed value (there are many other ways to do this but this will be fine for our demonstration purposes). Again, we subtract these results from 1 to convert a distance to a similarity.

```{r, warning=F, message=F}
# Define function for calculating chi-squared distance
chi_dist <- function(x) {
  # calculates the profile for every row
  rowprof <- x / apply(x, 1, sum)
  # calculates the average profile
  avgprof <- apply(x, 2, sum) / sum(x)
  # creates a distance object of chi-squared distances
  chid <- dist(as.matrix(rowprof) %*% diag(1 / sqrt(avgprof)))
  # return the results
  return(as.matrix(chid))
}

# Run the script and then create the rescaled 0-1 version
cibola_x <- chi_dist(cibola_clust)
cibola_x01 <- 1 - (cibola_x / max(cibola_x))

cibola_x01[1:4, 1:4]
```

#### Jaccard Similarity{- #Jaccard}

In many situations we may have either only presence/absence data or data where a one or a few categories dominate the assemblages of most sites. In the former case, the measures of similarity above will not work. In the latter case, although the measures above may work, the results may largely be a product of similarities or differences in those few common categories. In such cases, it may be preferable to use a similarity metric based on presence/absence data. The Jaccard similarity coefficient for two sets of presence/absence values is simply the intersection of values in those two cases divided by the union of those two cases. Formally this can be written as:

$$J(A,B) = \frac{|A \cap B|}{|A \cup B|} = \frac{|A \cap B|}{|A| + |B| - |A \cap B|}$$

where

* $|A \cap B|$ is the intersection of vectors A and B or the number of categories where both vectors = present.
* $|A \cup B|$ is the union of vectors A and B or the total number of categories where *either* A or B is present.

We can define a simple function in R to calculate Jaccard similarity coefficients. These values will always range from 0 (no categories in common) to 1 (all categories in common).

```{r}
jaccard <- function(a, b) {
  intersection <- length(which((a-b) == 0))
  union <- length(a) + length(b) - intersection
  return(intersection / union)
}
```

In order to try this function out, we will create three simple vectors of values and then compare them. Note that this function compares matches for the position of each item in the vector. So if `vec1[1] = 1` and `vec2[1] = 1` that is considered a match:

```{r}
vec1 <- c(0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0)
vec2 <- c(0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1)
vec3 <- c(0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1)

jaccard(vec1, vec2)

jaccard(vec2, vec3)

jaccard(vec1, vec3)
```

If we want to run this for an entire incidence matrix we could roll it into another function. Note that this function automatically takes an incidence matrix, converts it into presence/absence data and then calculates Jaccard coefficients for every pair of nodes. The output is a square matrix of similarities from row to row. Let's try it out in our `cibola_clust` data we used above:

```{r}
jaccard_inc <- function(dat) {
  dat[dat > 0] <- 1
  out <- matrix(NA, nrow(dat), nrow(dat))
  for (i in seq_len(nrow(dat))) {
    for (j in seq_len(nrow(dat))) {
      out[i, j] <- jaccard(dat[i, ], dat[j, ])
    }
  }
  return(out)
}

cibola_j <- jaccard_inc(dat = cibola_clust)

cibola_j[1:4, 1:4]

```

Note that we have created R scripts with the two functions above as separate files. If you would like to initialize those functions without copying and pasting the code above, you can use the `source()` function to call a function straight from a file. These are both located in the "scripts" folder so we add that sub-folder to the file name in the argument. [Click here to download jaccard.R](scripts/jaccard.R) and [here for jaccard_inc.R](scripts/jaccard_inc.R).

```{r}
source("scripts/jaccard.R")
source("scripts/jaccard_inc.R")

```

#### Creating Network Objects from Similarity Matrices {- #NetFromSim}

Now that we have defined our measures of similarity, the next step is to convert these into network objects that our R packages will be able to work with. We can do this by either creating binary networks (where ties are either present or absent) or weighted networks (which in many cases are simply the raw similarity/distance matrices we calculated above). We will provide examples of both approaches, starting with simple binary networks. There are many ways to define networks from matrices like those we generated above and our examples below should not been seen as an exhaustive set of procedures.

##### Creating binary network objects {-}

First, we will produce a network object based on our BR similarity matrix created above. In this example, we define ties as present between pairs of sites when they share more than 65% commonality (BR > 0.65) in terms of the proportions of ceramics recovered from pairs of sites. 

In the code below, the `event2dichot` function (from the `statnet` package) takes our matrix and divides it into 1s and 0s based on the cut off we choose. Here we're using and `absolute` cut off meaning we're assigning a specific value to use as the cut off (0.65). We then send the output of this function to the network function just as before. 

```{r}
# Define our binary network object from BR similarity
brnet <-
  network(event2dichot(cibola_br,
                       method = "absolute",
                       thresh = 0.65),
                       directed = FALSE)
# Now let's add names for our nodes based on the row names
# of our original matrix
brnet %v% "vertex.names" <- row.names(cibola_clust)
# look at the results.
brnet

# plot network using default layout
set.seed(7564)
plot(brnet)
```

In the next chunk of code we will use the $\chi^2$ distances to create binary networks. This time, we will not use an absolute value to define ties as present, but instead will define those similarities greater than 80 percent of all similarities as present. We will then once again plot just as above.

```{r }
# Note we use 1 minus chacoX01 here so to convert a distance
# to a similarity
xnet <-
  network(event2dichot(cibola_x01,
                       method = "quantile",
                       thresh = 0.80),
                       directed = FALSE)
# Once again add vertex names as row names of data frame
xnet %v% "vertex.names" <- row.names(cibola_clust)
# look at the results
xnet

# plot network using default layout
set.seed(346)
plot(xnet)
```

Finally, we'll use our Jaccard coefficients and we will define our threshold for defining edges as present or absent as the grand mean of the entire similarity matrix using `method = "mean"`. There are more options in the `event2dichot` function. See the help material for more.

```{r }
jnet <-
  network(event2dichot(cibola_j,
                       method = "mean"),
                       directed = FALSE)

jnet %v% "vertex.names" <- row.names(cibola_clust)

jnet

# plot network using default layout
set.seed(343546)
plot(jnet)
```

##### Creating Weighted Network Objects {-}

It is also possible to use R to create weighted networks where individual edges are valued. We have found that this works reasonably well with networks of co-presence or something similar (counts of mentions in texts or monuments for example) but this does not perform well when applied to large similarity or distance matrices (because every possible link has a value, the network gets unwieldy very fast). In the latter case, we have found it is often better to just work directly with the underlying similarity/distance matrix.

If you do, however, chose to create a weighted network object from a similarity matrix it only requires a slight modification from the procedure above. In the chunk of code below, we will simply add the arguments `ignore.eval = FALSE` and `names.eval = "weight"` to let the network function know we would like weights to be retained and we would like that attribute called 'weight'. We will apply this to the matrix of Morisita similarities defined above and then plot the result. 

```{r }
# create weighted network object from co-occurrence matrix by
# adding the ignore.eval=F argument
mor_wt <- network(
  cibola_mor,
  directed = FALSE,
  ignore.eval = FALSE,
  names.eval = "weight"
)

mor_wt %v% "vertex.names" <- row.names(cibola_mor)
mor_wt

# plot weighted network using default layout
set.seed(4634)
plot(mor_wt)
```

The resulting network is nearly complete so it is a bit unwieldy for plotting but calculating network statistics on this weighted network can often still be useful as we will see in the exploratory analysis section.

### Ego Networks{#EgoNetworks}

When we aim to understand the relational environment within which an entity is embedded, because it is relevant for our research questions or because data collection challenges dictate this focus, archaeological network research can make use of so-called ego-networks: a type of network that includes a focal node (the so-called ego), the set of nodes the ego is connected to by an edge (the so-called alters) and the edges between this set of nodes. 

Extracting an ego-network from an existing igraph network object in R is very easy. Here we will extract and plot the ego-network for Apache Creek, the first site in the network files we created above. First we read in data and create a network object and then we apply the `igraph::make_ego_graph` function to the network object we created.

```{r, warning=F, message=F}
# Read in edge list file as data frame
cibola_edgelist <-
  read.csv(file = "data/Cibola_edgelist.csv", header = TRUE)

# Create graph object. The data frame is converted to a matrix as
# that is required by this specific function. Since this is an
# undirected network, directed = FALSE.
cibola_net <-
  igraph::graph_from_edgelist(as.matrix(cibola_edgelist),
                              directed = FALSE)

# Extract ego-networks
ego_nets <- make_ego_graph(cibola_net)

# Examine the first ego-network
ego_nets[[1]]

# Plot Apache Creek ego-network
set.seed(754)
plot(ego_nets[[1]])

# Plot Platt Ranch ego-network for comparison
set.seed(45367)
plot(ego_nets[[30]])
```

In these ego-networks, only nodes connected to the target nodes (Apache Creek in the first example and then Platt Ranch in the second) are shown and only edges among those included nodes are shown.

It is also possible to determine the size of ego-networks for an entire one-mode network using the `ego_size` function. The output of this function is a vector that can be further assigned as a network node attribute.

```{r}
ego_size(cibola_net)

```

### Multilayer Network{#Multinet}

In the simplest terms, multilayer networks are networks where a single set of nodes are connected by two or more sets of edges that each represent a different kind of relationship among the nodes. This is a relatively new area of network science in archaeological network research but we expect this will likely change in the coming years. There are now new R packages which help manage and analyze multilayer network data.

The `multinet` package (Rossi and Vega 2021) is designed to facilitate the analysis of multilayer networks. In order to explore some of the possibilities here we use example data and analyses included in this package. Specifically, we will look at the famous network data on Florentine families in the 14th century where connections were defined in terms of both business and marriage.

```{r}
# create object with Florentine multilayer network data
florentine <- ml_florentine()

# Examine the data
florentine
summary(florentine)

# plot the data
plot(florentine)

```

The `multinet` network objects are compatible with `igraph` and individual layers can be analyzed just like other `igraph` network objects. Where this `multinet` approach likely has greater utility is in conducting comparisons among layers or conducting analyses that take several layers into account simultaneously. A detailed exploration of this approach is beyond the scope of this document (but we provide a simple example below) and we suggest interested readers read the package information and tutorials associated with this package for more. In the example here we calculate degree across multiple layers using the `degree_ml` function and then run a Louvain cluster detection algorithm across all graph layers using `glouvain_ml`. Multilayer networks have considerable potential for archaeological data and we hope to see more research in this area in the future.

```{r}
# If we want to calculate degree centrality across multiple layers of a
# multilayer network, the multinet package can help us do that directly
# and quite simply.
multinet::degree_ml(florentine)

# Similarly, we could apply cluster detection algorithms to all layers
# of a multilayer network simultaneously.
multinet::glouvain_ml(florentine)
```

For an archaeological example of multilevel network analysis [this GitHub project](https://github.com/ajupton/archy-multilayer-nets) by Andy Upton.

## Converting Among Network Object Types{#ConvertingNetworkFormats}

```{block, type="rmdtip"}
In most of the examples in this document we have been using the `igraph` package but for the similarity networks we chose to use `statnet` due to the convenience of functions for working directly with similarity data. Not to worry as it is easy to convert one format to another and preserve all of the attributes using a package called `intergraph`. By way of example below we can covert the weighted network object we created in the previous step and convert it to a `igraph` object and view the attributes using the `asIgraph` function. If we wanted to go the other direction and covert a `igraph` object to a `network` object (which is the format all of the `statnet` package require) we would instead use `asNetwrok`.
```

Here is a simple example:

```{r, warning=F, message=F}
mor_wt_i <- asIgraph(mor_wt)
mor_wt_i

# view first 10 edge weights to show that they are retained
E(mor_wt_i)$weight[1:10]
```

And back in the other direction:

```{r}
mor_new <- asNetwork(mor_wt_i)
mor_new

(mor_new %e% "weight")[1:10]
```