About My Research

Pedigree Reconstruction: Text

Pedigree Reconstruction and Kinship Inference (A Cautionary Tale)

Throughout most of my career I have worked on the problem of pedigree reconstruction. I would say I made two particularly important contributions to this field. I was the first was to show that it was possible to reconstruct a multigenerational pedigree without benefit of any age-related data (Almudevar 2003). The trick is essentially to recognize a pedigree as a Bayesian network.

My second contribution was to show that it is feasible to accurately partition a large sample of individuals from a single generation into sibling groups based on genetic data, without the benefit of parental data. Why is this problem difficult? The distributional properties of genetic data are well known, so it is always possible to calculate a statistical likelihood of the data for any given sibling partition. Then all we need to do is calculate the likelihood for all partitions, then choose the one with the maximum likelihood.

The problem with this approach is that the number of partitions of, say, N individuals grows very quickly. In fact, this number is well known in combinatorics, referred to as the Bell number. There are 5 partitions of the three labels {1,2,3} ([123], [1][23], [2][13], [3][12], [1][2][3]), so we have Bell number B[3] = 5. A recursive formula exists which gives B[4] = 15. Then B[10] = 115975. The problem I was originally given was to partition 781 Atlantic salmon into siblings, so calculating the likelihood for each partition was not an option.

When we get genetic data, we usually have data for K > 1 loci (a locus is a location on a specific chromosome). The more loci we have, the more accurate the inference. At each locus, the genes we observe obey the laws of Mendelian inheritance. That means that a true sibling group can have no more than four types of genes among them at a single locus, since they have the same two parents, who themselves can have no more than four types of genes among them (there are other restrictions, but we'll work with this for now). Suppose among a sample of 781 Atlantic salmon we observe, say 10 distinct genes at a single locus. It is certainly much easier to search for subsets of 4 genes from 10 than to search over partitions of 781 labels. Of course, one locus would not be enough to usefully resolve a sibling partition, but the information from multiple loci can be combined to accomplish this. This was the subject of Almudevar & Field (1999).

In order to complete the problem for N = 781, I had to introduce a modified version of the algorithm. To limit the number of solutions considered, a lower bound T on the size of an inferred sibling group could be set. The larger the value T, the smaller the number of putative sibling groups considered. So suppose we set T = 100. We might find one sibling group of size, say, 125. If we remove this sibling group from the sample, we now have 781 - 125 = 656 individuals. Then apply the algorithm again to the 656 remaining individuals, setting the lower bound to, say, T = 50. We might find another sibling group. Remove that one, repeat with a lower value of T, until we have found all the sibling groups (we call them maximal sibling groups (MSG)). Note that 2 individuals can never be excluded as siblings, so for the unrestricted algorithm set T = 3.

From Almudevar & Field (1999):

"An estimate of the sibling structure of the complete salmon data set was then attempted. The sample size was too large to compute the MSGs from the list using a threshold of 3, so the threshold was set higher, then subsequently reduced until an entire partition was reconstructed. In particular, a list of MSGs using a threshold of 100 was constructed, then the partitioning algorithm was applied. The resulting sibling groups were removed from the data set, then the process was repeated using a threshold of 50. Finally, the process was repeated with the remaining individuals using a threshold of 3. The estimated partition was reconstructed from the results of the three steps. The total computation time was approximately 11 hours on a SPARCstation 4."

"A very close agreement was found between the true and the estimated partition. For each true sibling group there clearly corresponded an estimated sibling group. For each pair of true sibling group and corresponding estimated sibling group it was possible to identify individuals present in one group but not the other. These individuals may be considered to be classification errors. Table 4 summarizes the numbers of error classifications attained by the 12 true sibling groups. The table indicates a total of 13 classification errors."

A newer version of this algorithm partitioned the same data in 34.1 seconds with one classification error (Almudevar & Anderson 2012).

Unfortunately, a number of authors attempted to replicate the results of Almudevar & Field (1999) on the same data, but without using the threshold modification. The algorithm will not complete under this condition, and this was clearly reported in Almudevar & Field (1999), as indicated in bold font above. However, the threshold modification was not mentioned in these reports, giving the impression that the results of Almudevar & Field (1999) could not be replicated.

This seems to have had a cascade of consequences. I reported some of these in two peer-reviewed publications (Almudevar 2011, Almudevar & Anderson 2012). The bottom line, however, is that progress in this field actually moved backwards, despite the all-encompassing rigor imposed on the scientific community by the peer-review system.

Pedigree Reconstruction: Text

A selection of my publications in this field

Almudevar A, Field C (1999) Estimation of single generation sibling relationships based on DNA markers. Journal of Agricultural, Biological and Environmental Statistics. 4:136-165.
Almudevar A (2001) A bootstrap assessment of variability in pedigree reconstruction based on DNA markers. Biometrics. 57:757-763.
Almudevar A (2001) Most powerful permutation invariant tests for relatedness hypotheses based on genotypic data. Biometrics. 57:1080-1088.
Almudevar A (2003) A simulated annealing algorithm for maximum likelihood pedigree reconstruction. Theoretical Population Biology. 63:63-75.
Almudevar A (2007) A graphical approach to relatedness inference. Theoretical Population Biology. 71:213-229.
Almudevar A (2011) A commentary on some recent methods in sibling group reconstruction based on set coverings. Optimization Methods and Software. 26:993-1003.
Almudevar A, Anderson EC (2012) A new version of PRT software for sibling groups reconstruction with comments regarding several issues in the sibling reconstruction problem. Molecular Ecology Resources. 12(1):164-78.
Almudevar A, Lacombe J (2012) On the choice of prior density for the Bayesian analysis of pedigree structure. Theoretical Population Biology. 81(2):131-143.
Almudevar A (2016) An information theoretic approach to pedigree reconstruction. Theoretical Population Biology. 107:52-64.

Pedigree Reconstruction: Image