http://stato-ontology.org/
Alejandra Gonzalez-Beltran (http://orcid.org/0000-0003-3499-8262)
STATO: the statistical methods ontology
Camille Maumet (http://orcid.org/0000-0002-6290-553X)
STATO is the statistical methods ontology. It contains concepts and properties related to statistical methods, probability distributions and other concepts related to statistical analysis, including relationships to study designs and plots.
stat-ontology@googlegroups.com
This Ontology is distributed under a Creative Commons Attribution License
RC1.4
http://creativecommons.org/licenses/by/3.0/
Philippe Rocca-Serra (http://orcid.org/0000-0001-9853-5668)
Thomas Nichols (http://orcid.org/0000-0002-4516-5103)
Chris Mungall (http://orcid.org/0000-0002-6601-2165)
Orlaith Burke
Statistical Method, Design of Experiment, Plots, Statistical Model
Nolan Nichols (http://orcid.org/0000-0003-1099-3328)
Hanna Cwiek (https://orcid.org/0000-0001-9113-567X)
https://github.com/ISA-tools/stato/issues
Relates an entity in the ontology to the name of the variable that is used to represent it in the code that generates the BFO OWL file from the lispy specification.
Really of interest to developers only
BFO OWL specification label
Relates an entity in the ontology to the term that is used to represent it in the the CLIF specification of BFO2
Person:Alan Ruttenberg
Really of interest to developers only
BFO CLIF specification label
editor preferred label
editor preferred label
editor preferred term
editor preferred term
editor preferred term~editor preferred label
The concise, meaningful, and human-friendly name for a class or property preferred by the ontology developers. (US-English)
PERSON:Daniel Schober
GROUP:OBI:<http://purl.obolibrary.org/obo/obi>
editor preferred label
editor preferred label
editor preferred term
editor preferred term
editor preferred term~editor preferred label
example
A phrase describing how a term should be used and/or a citation to a work which uses it. May also include other kinds of examples that facilitate immediate understanding, such as widely know prototypes or instances of a class, or cases where a relation is said to hold.
PERSON:Daniel Schober
GROUP:OBI:<http://purl.obolibrary.org/obo/obi>
example of usage
has curation status
PERSON:Alan Ruttenberg
PERSON:Bill Bug
PERSON:Melanie Courtot
OBI_0000281
has curation status
definition
definition
definition
textual definition
textual definition
The official OBI definition, explaining the meaning of a class or property. Shall be Aristotelian, formalized and normalized. Can be augmented with colloquial definitions.
The official definition, explaining the meaning of a class or property. Shall be Aristotelian, formalized and normalized. Can be augmented with colloquial definitions.
2012-04-05:
Barry Smith
The official OBI definition, explaining the meaning of a class or property: 'Shall be Aristotelian, formalized and normalized. Can be augmented with colloquial definitions' is terrible.
Can you fix to something like:
A statement of necessary and sufficient conditions explaining the meaning of an expression referring to a class or property.
Alan Ruttenberg
Your proposed definition is a reasonable candidate, except that it is very common that necessary and sufficient conditions are not given. Mostly they are necessary, occasionally they are necessary and sufficient or just sufficient. Often they use terms that are not themselves defined and so they effectively can't be evaluated by those criteria.
On the specifics of the proposed definition:
We don't have definitions of 'meaning' or 'expression' or 'property'. For 'reference' in the intended sense I think we use the term 'denotation'. For 'expression', I think we you mean symbol, or identifier. For 'meaning' it differs for class and property. For class we want documentation that let's the intended reader determine whether an entity is instance of the class, or not. For property we want documentation that let's the intended reader determine, given a pair of potential relata, whether the assertion that the relation holds is true. The 'intended reader' part suggests that we also specify who, we expect, would be able to understand the definition, and also generalizes over human and computer reader to include textual and logical definition.
Personally, I am more comfortable weakening definition to documentation, with instructions as to what is desirable.
We also have the outstanding issue of how to aim different definitions to different audiences. A clinical audience reading chebi wants a different sort of definition documentation/definition from a chemistry trained audience, and similarly there is a need for a definition that is adequate for an ontologist to work with.
PERSON:Daniel Schober
GROUP:OBI:<http://purl.obolibrary.org/obo/obi>
definition
definition
definition
textual definition
textual definition
editor note
An administrative note intended for its editor. It may not be included in the publication version of the ontology, so it should contain nothing necessary for end users to understand the ontology.
PERSON:Daniel Schober
GROUP:OBI:<http://purl.obfoundry.org/obo/obi>
editor note
term editor
Name of editor entering the term in the file. The term editor is a point of contact for information regarding the term. The term editor may be, but is not always, the author of the definition, which may have been worked upon by several people
20110707, MC: label update to term editor and definition modified accordingly. See https://github.com/information-artifact-ontology/IAO/issues/115.
PERSON:Daniel Schober
GROUP:OBI:<http://purl.obolibrary.org/obo/obi>
term editor
alternative term
An alternative name for a class or property which means the same thing as the preferred name (semantically equivalent)
PERSON:Daniel Schober
GROUP:OBI:<http://purl.obolibrary.org/obo/obi>
alternative term
definition source
formal citation, e.g. identifier in external database to indicate / attribute source(s) for the definition. Free text indicate / attribute source(s) for the definition. EXAMPLE: Author Name, URI, MeSH Term C04, PUBMED ID, Wiki uri on 31.01.2007
PERSON:Daniel Schober
Discussion on obo-discuss mailing-list, see http://bit.ly/hgm99w
GROUP:OBI:<http://purl.obolibrary.org/obo/obi>
definition source
curator note
An administrative note of use for a curator but of no use for a user
PERSON:Alan Ruttenberg
curator note
term tracker item
the URI for an OBI Terms ticket at sourceforge, such as https://sourceforge.net/p/obi/obi-terms/772/
An IRI or similar locator for a request or discussion of an ontology term.
Person: Jie Zheng, Chris Stoeckert, Alan Ruttenberg
Person: Jie Zheng, Chris Stoeckert, Alan Ruttenberg
The 'tracker item' can associate a tracker with a specific ontology term.
term tracker item
imported from
For external terms/classes, the ontology from which the term was imported
PERSON:Alan Ruttenberg
PERSON:Melanie Courtot
GROUP:OBI:<http://purl.obolibrary.org/obo/obi>
imported from
OBO foundry unique label
An alternative name for a class or property which is unique across the OBO Foundry.
The intended usage of that property is as follow: OBO foundry unique labels are automatically generated based on regular expressions provided by each ontology, so that SO could specify unique label = 'sequence ' + [label], etc. , MA could specify 'mouse + [label]' etc. Upon importing terms, ontology developers can choose to use the 'OBO foundry unique label' for an imported term or not. The same applies to tools .
PERSON:Alan Ruttenberg
PERSON:Bjoern Peters
PERSON:Chris Mungall
PERSON:Melanie Courtot
GROUP:OBO Foundry <http://obofoundry.org/>
OBO foundry unique label
elucidation
person:Alan Ruttenberg
Person:Barry Smith
Primitive terms in a highest-level ontology such as BFO are terms which are so basic to our understanding of reality that there is no way of defining them in a non-circular fashion. For these, therefore, we can provide only elucidations, supplemented by examples and by axioms
elucidation
has associated axiom(nl)
Person:Alan Ruttenberg
Person:Alan Ruttenberg
An axiom associated with a term expressed using natural language
has associated axiom(nl)
has associated axiom(fol)
Person:Alan Ruttenberg
Person:Alan Ruttenberg
An axiom expressed in first order logic using CLIF syntax
has associated axiom(fol)
ISA alternative term
An alternative term used by the ISA tools project (http://isa-tools.org).
Requested by Alejandra Gonzalez-Beltran
https://sourceforge.net/tracker/?func=detail&aid=3603413&group_id=177891&atid=886178
Person: Alejandra Gonzalez-Beltran
Person: Philippe Rocca-Serra
ISA tools project (http://isa-tools.org)
ISA alternative term
IEDB alternative term
An alternative term used by the IEDB.
PERSON:Randi Vita, Jason Greenbaum, Bjoern Peters
IEDB
IEDB alternative term
temporal interpretation
https://github.com/oborel/obo-relations/wiki/ROAndTime
an alternative term used for STATO statistical ontology and ISA team
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO alternative term
a R command syntax or link to a R documentation in support of Statistical Ontology Classes or Data Transformations
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
R command
an annotation property to provide a canonical command to invoke a method implementation using Python programming language
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
Python command
the most common series or system of written mathematical symbols used to represent the entity
AGB
preferred mathematical notation
Examples of a Contributor include a person, an
organisation, or a service. Typically, the name of a
Contributor should be used to indicate the entity.
An entity responsible for making contributions to the
content of the resource.
Contributor
Contributor
Examples of a Creator include a person, an organisation,
or a service. Typically, the name of a Creator should
be used to indicate the entity.
An entity primarily responsible for making the content
of the resource.
Creator
Creator
Typically, Date will be associated with the creation or
availability of the resource. Recommended best practice
for encoding the date value is defined in a profile of
ISO 8601 [W3CDTF] and follows the YYYY-MM-DD format.
A date associated with an event in the life cycle of the
resource.
Date
Date
Description may include but is not limited to: an abstract,
table of contents, reference to a graphical representation
of content or a free-text account of the content.
An account of the content of the resource.
Description
Description
Typically, Format may include the media-type or dimensions of
the resource. Format may be used to determine the software,
hardware or other equipment needed to display or operate the
resource. Examples of dimensions include size and duration.
Recommended best practice is to select a value from a
controlled vocabulary (for example, the list of Internet Media
Types [MIME] defining computer media formats).
The physical or digital manifestation of the resource.
Format
Format
The present resource may be derived from the Source resource
in whole or in part. Recommended best practice is to reference
the resource by means of a string or number conforming to a
formal identification system.
A reference to a resource from which the present resource
is derived.
Source
Source
Typically, a Subject will be expressed as keywords,
key phrases or classification codes that describe a topic
of the resource. Recommended best practice is to select
a value from a controlled vocabulary or formal
classification scheme.
The topic of the content of the resource.
Subject and Keywords
Subject and Keywords
Mark Miller
2018-05-11T13:47:29Z
label
label
is part of
my brain is part of my body (continuant parthood, two material entities)
my stomach cavity is part of my stomach (continuant parthood, immaterial entity is part of material entity)
this day is part of this year (occurrent parthood)
a core relation that holds between a part and its whole
Everything is part of itself. Any part of any part of a thing is itself part of that thing. Two distinct things cannot be part of each other.
Occurrents are not subject to change and so parthood between occurrents holds for all the times that the part exists. Many continuants are subject to change, so parthood between continuants will only hold at certain times, but this is difficult to specify in OWL. See https://code.google.com/p/obo-relations/wiki/ROAndTime
Parthood requires the part and the whole to have compatible classes: only an occurrent can be part of an occurrent; only a process can be part of a process; only a continuant can be part of a continuant; only an independent continuant can be part of an independent continuant; only an immaterial entity can be part of an immaterial entity; only a specifically dependent continuant can be part of a specifically dependent continuant; only a generically dependent continuant can be part of a generically dependent continuant. (This list is not exhaustive.)
A continuant cannot be part of an occurrent: use 'participates in'. An occurrent cannot be part of a continuant: use 'has participant'. A material entity cannot be part of an immaterial entity: use 'has location'. A specifically dependent continuant cannot be part of an independent continuant: use 'inheres in'. An independent continuant cannot be part of a specifically dependent continuant: use 'bearer of'.
part_of
part of
http://www.obofoundry.org/ro/#OBO_REL:part_of
has part
my body has part my brain (continuant parthood, two material entities)
my stomach has part my stomach cavity (continuant parthood, material entity has part immaterial entity)
this year has part this day (occurrent parthood)
a core relation that holds between a whole and its part
Everything has itself as a part. Any part of any part of a thing is itself part of that thing. Two distinct things cannot have each other as a part.
Occurrents are not subject to change and so parthood between occurrents holds for all the times that the part exists. Many continuants are subject to change, so parthood between continuants will only hold at certain times, but this is difficult to specify in OWL. See https://code.google.com/p/obo-relations/wiki/ROAndTime
Parthood requires the part and the whole to have compatible classes: only an occurrent have an occurrent as part; only a process can have a process as part; only a continuant can have a continuant as part; only an independent continuant can have an independent continuant as part; only a specifically dependent continuant can have a specifically dependent continuant as part; only a generically dependent continuant can have a generically dependent continuant as part. (This list is not exhaustive.)
A continuant cannot have an occurrent as part: use 'participates in'. An occurrent cannot have a continuant as part: use 'has participant'. An immaterial entity cannot have a material entity as part: use 'location of'. An independent continuant cannot have a specifically dependent continuant as part: use 'bearer of'. A specifically dependent continuant cannot have an independent continuant as part: use 'inheres in'.
has_part
has part
realized in
this disease is realized in this disease course
this fragility is realized in this shattering
this investigator role is realized in this investigation
is realized by
realized_in
[copied from inverse property 'realizes'] to say that b realizes c at t is to assert that there is some material entity d & b is a process which has participant d at t & c is a disposition or role of which d is bearer_of at t& the type instantiated by b is correlated with the type instantiated by c. (axiom label in BFO2 Reference: [059-003])
Paraphrase of elucidation: a relation between a realizable entity and a process, where there is some material entity that is bearer of the realizable entity and participates in the process, and the realizable entity comes to be realized in the course of the process
realized in
realizes
this disease course realizes this disease
this investigation realizes this investigator role
this shattering realizes this fragility
to say that b realizes c at t is to assert that there is some material entity d & b is a process which has participant d at t & c is a disposition or role of which d is bearer_of at t& the type instantiated by b is correlated with the type instantiated by c. (axiom label in BFO2 Reference: [059-003])
Paraphrase of elucidation: a relation between a process and a realizable entity, where there is some material entity that is bearer of the realizable entity and participates in the process, and the realizable entity comes to be realized in the course of the process
realizes
preceded by
An example is: translation preceded_by transcription; aging preceded_by development (not however death preceded_by aging). Where derives_from links classes of continuants, preceded_by links classes of processes. Clearly, however, these two relations are not independent of each other. Thus if cells of type C1 derive_from cells of type C, then any cell division involving an instance of C1 in a given lineage is preceded_by cellular processes involving an instance of C. The assertion P preceded_by P1 tells us something about Ps in general: that is, it tells us something about what happened earlier, given what we know about what happened later. Thus it does not provide information pointing in the opposite direction, concerning instances of P1 in general; that is, that each is such as to be succeeded by some instance of P. Note that an assertion to the effect that P preceded_by P1 is rather weak; it tells us little about the relations between the underlying instances in virtue of which the preceded_by relation obtains. Typically we will be interested in stronger relations, for example in the relation immediately_preceded_by, or in relations which combine preceded_by with a condition to the effect that the corresponding instances of P and P1 share participants, or that their participants are connected by relations of derivation, or (as a first step along the road to a treatment of causality) that the one process in some way affects (for example, initiates or regulates) the other.
is preceded by
preceded_by
http://www.obofoundry.org/ro/#OBO_REL:preceded_by
preceded by
precedes
precedes
has measurement unit label
This document is about information artifacts and their representations
is_about is a (currently) primitive relation that relates an information artifact to an entity.
7/6/2009 Alan Ruttenberg. Following discussion with Jonathan Rees, and introduction of "mentions" relation. Weaken the is_about relationship to be primitive.
We will try to build it back up by elaborating the various subproperties that are more precisely defined.
Some currently missing phenomena that should be considered "about" are predications - "The only person who knows the answer is sitting beside me" , Allegory, Satire, and other literary forms that can be topical without explicitly mentioning the topic.
person:Alan Ruttenberg
Smith, Ceusters, Ruttenberg, 2000 years of philosophy
is about
A person's name denotes the person. A variable name in a computer program denotes some piece of memory. Lexically equivalent strings can denote different things, for instance "Alan" can denote different people. In each case of use, there is a case of the denotation relation obtaining, between "Alan" and the person that is being named.
denotes is a primitive, instance-level, relation obtaining between an information content entity and some portion of reality. Denotation is what happens when someone creates an information content entity E in order to specifically refer to something. The only relation between E and the thing is that E can be used to 'pick out' the thing. This relation connects those two together. Freedictionary.com sense 3: To signify directly; refer to specifically
2009-11-10 Alan Ruttenberg. Old definition said the following to emphasize the generic nature of this relation. We no longer have 'specifically denotes', which would have been primitive, so make this relation primitive.
g denotes r =def
r is a portion of reality
there is some c that is a concretization of g
every c that is a concretization of g specifically denotes r
person:Alan Ruttenberg
Conversations with Barry Smith, Werner Ceusters, Bjoern Peters, Michel Dumontier, Melanie Courtot, James Malone, Bill Hogan
denotes
m is a quality measurement of q at t when
q is a quality
there is a measurement process p that has specified output m, a measurement datum, that is about q
8/6/2009 Alan Ruttenberg: The strategy is to be rather specific with this relationship. There are other kinds of measurements that are not of qualities, such as those that measure time. We will add these as separate properties for the moment and see about generalizing later
From the second IAO workshop [Alan Ruttenberg 8/6/2009: not completely current, though bringing in comparison is probably important]
This one is the one we are struggling with at the moment. The issue is what a measurement measures. On the one hand saying that it measures the quality would include it "measuring" the bearer = referring to the bearer in the measurement. However this makes comparisons of two different things not possible. On the other hand not having it inhere in the bearer, on the face of it, breaks the audit trail.
Werner suggests a solution based on "Magnitudes" a proposal for which we are awaiting details.
--
From the second IAO workshop, various comments, [commented on by Alan Ruttenberg 8/6/2009]
unit of measure is a quality, e.g. the length of a ruler.
[We decided to hedge on what units of measure are, instead talking about measurement unit labels, which are the information content entities that are about whatever measurement units are. For IAO we need that information entity in any case. See the term measurement unit label]
[Some struggling with the various subflavors of is_about. We subsequently removed the relation represents, and describes until and only when we have a better theory]
a represents b means either a denotes b or a describes
describe:
a describes b means a is about b and a allows an inference of at least one quality of b
We have had a long discussion about denotes versus describes.
From the second IAO workshop: An attempt at tieing the quality to the measurement datum more carefully.
a is a magnitude means a is a determinate quality particular inhering in some bearer b existing at a time t that can be represented/denoted by an information content entity e that has parts denoting a unit of measure, a number, and b. The unit of measure is an instance of the determinable quality.
From the second meeting on IAO:
An attempt at defining assay using Barry's "reliability" wording
assay:
process and has_input some material entity
and has_output some information content entity
and which is such that instances of this process type reliably generate
outputs that describes the input.
This one is the one we are struggling with at the moment. The issue is what a measurement measures. On the one hand saying that it measures the quality would include it "measuring" the bearer = referring to the bearer in the measurement. However this makes comparisons of two different things not possible. On the other hand not having it inhere in the bearer, on the face of it, breaks the audit trail.
Werner suggests a solution based on "Magnitudes" a proposal for which we are awaiting details.
Alan Ruttenberg
is quality measurement of
relating a cartesian spatial coordinate datum to a unit label that together with the values represent a point
has coordinate unit label
relates a process to a time-measurement-datum that represents the duration of the process
Person:Alan Ruttenberg
is duration of
inverse of the relation of is quality measurement of
2009/10/19 Alan Ruttenberg. Named 'junk' relation useful in restrictions, but not a real instance relationship
Person:Alan Ruttenberg
is quality measured as
relates a time stamped measurement datum to the time measurement datum that denotes the time when the measurement was taken
Alan Ruttenberg
has time stamp
relates a time stamped measurement datum to the measurement datum that was measured
Alan Ruttenberg
has measurement datum
is_supported_by_data
The relation between the conclusion "Gene tpbA is involved in EPS production" and the data items produced using two sets of organisms, one being a tpbA knockout, the other being tpbA wildtype tested in polysacharide production assays and analyzed using an ANOVA.
The relation between a data item and a conclusion where the conclusion is the output of a data interpreting process and the data item is used as an input to that process
OBI
OBI
Philly 2011 workshop
is_supported_by_data
has_specified_input
has_specified_input
see is_input_of example_of_usage
A relation between a planned process and a continuant participating in that process that is not created during the process. The presence of the continuant during the process is explicitly specified in the plan specification which the process realizes the concretization of.
8/17/09: specified inputs of one process are not necessarily specified inputs of a larger process that it is part of. This is in contrast to how 'has participant' works.
PERSON: Alan Ruttenberg
PERSON: Bjoern Peters
PERSON: Larry Hunter
PERSON: Melanie Coutot
has_specified_input
is_specified_input_of
some Autologous EBV(Epstein-Barr virus)-transformed B-LCL (B lymphocyte cell line) is_input_for instance of Chromum Release Assay described at https://wiki.cbil.upenn.edu/obiwiki/index.php/Chromium_Release_assay
A relation between a planned process and a continuant participating in that process that is not created during the process. The presence of the continuant during the process is explicitly specified in the plan specification which the process realizes the concretization of.
Alan Ruttenberg
PERSON:Bjoern Peters
is_specified_input_of
has_specified_output
has_specified_output
A relation between a planned process and a continuant participating in that process. The presence of the continuant at the end of the process is explicitly specified in the objective specification which the process realizes the concretization of.
PERSON: Alan Ruttenberg
PERSON: Bjoern Peters
PERSON: Larry Hunter
PERSON: Melanie Courtot
has_specified_output
is_specified_output_of
is_specified_output_of
A relation between a planned process and a continuant participating in that process. The presence of the continuant at the end of the process is explicitly specified in the objective specification which the process realizes the concretization of.
Alan Ruttenberg
PERSON:Bjoern Peters
is_specified_output_of
is_proxy_for
position on a gel is_proxy_for mass and charge of molecule in an western blot. Florescent intensity is_proxy_for amount of protein labeled with GFP. Examples:
A260/A280 (of a DNA sample) is_proxy_for DNA-purity. NMR Sample scan is a proxy for sample quality.
Within the assay mentioned here: https://wiki.cbil.upenn.edu/obiwiki/index.php/Chromium_Release_assay
level of radioactivity is_proxy_for level of toxicity
A relation between continuant instances c1 and c2 where within an experiment/ protocol application, measurement of c1 is used to determine what a measurement of c2 would be.
A relation between continuant instances c1 and c2 where within a protocol
application, measurement of c1 is related to a what would be the
measurement of c2.
(another definition)
Alan Ruttenberg
is_proxy_for
achieves_planned_objective
A cell sorting process achieves the objective specification 'material separation objective'
This relation obtains between a planned process and a objective specification when the criteria specified in the objective specification are met at the end of the planned process.
BP, AR, PPPB branch
PPPB branch derived
modified according to email thread from 1/23/09 in accordince with DT and PPPB branch
achieves_planned_objective
has grain
the relation of the cells in the finger of the skin to the finger, in which an indeterminate number of grains are parts of the whole by virtue of being grains in a collective that is part of the whole, and in which removing one granular part does not nec- essarily damage or diminish the whole. Ontological Whether there is a fixed, or nearly fixed number of parts - e.g. fingers of the hand, chambers of the heart, or wheels of a car - such that there can be a notion of a single one being missing, or whether, by contrast, the number of parts is indeterminate - e.g., cells in the skin of the hand, red cells in blood, or rubber molecules in the tread of the tire of the wheel of the car.
Discussion in Karslruhe with, among others, Alan Rector, Stefan Schulz, Marijke Keet, Melanie Courtot, and Alan Ruttenberg. Definition take from the definition of granular parthood in the cited paper. Needs work to put into standard form
PERSON: Alan Ruttenberg
PAPER: Granularity, scale and collectivity: When size does and does not matter, Alan Rector, Jeremy Rogers, Thomas Bittner, Journal of Biomedical Informatics 39 (2006) 333-349
has grain
objective_achieved_by
This relation obtains between a a objective specification and a planned process when the criteria specified in the objective specification are met at the end of the planned process.
OBI
OBI
objective_achieved_by
is member of organization
Relating a legal person or organization to an organization in the case where the legal person or organization has a role as member of the organization.
2009/10/01 Alan Ruttenberg. Barry prefers generic is-member-of. Question of what the range should be. For now organization. Is organization a population? Would the same relation be used to record members of a population
JZ: Discussed on May 7, 2012 OBI dev call. Bjoern points out that we need to allow for organizations to be members of organizations. And agreed by the other OBI developers. So, human and organization were specified in 'Domains'. The textual definition was updated based on it.
Person:Alan Ruttenberg
Person:Helen Parkinson
Person:Alan Ruttenberg
Person:Helen Parkinson
2009/09/28 Alan Ruttenberg. Fucoidan-use-case
is member of organization
has organization member
Relating an organization to a legal person or organization.
See tracker:
https://sourceforge.net/tracker/index.php?func=detail&aid=3512902&group_id=177891&atid=886178
Person: Jie Zheng
has organization member
specifies value of
A relation between a value specification and an entity which the specification is about.
specifies value of
has value specification
A relation between an information content entity and a value specification that specifies its value.
PERSON: James A. Overton
OBI
has value specification
inheres in
this fragility inheres in this vase
this red color inheres in this apple
a relation between a specifically dependent continuant (the dependent) and an independent continuant (the bearer), in which the dependent specifically depends on the bearer for its existence
A dependent inheres in its bearer at all times for which the dependent exists.
inheres_in
inheres in
bearer of
this apple is bearer of this red color
this vase is bearer of this fragility
a relation between an independent continuant (the bearer) and a specifically dependent continuant (the dependent), in which the dependent specifically depends on the bearer for its existence
A bearer can have many dependents, and its dependents can exist for different periods of time, but none of its dependents can exist when the bearer does not exist.
bearer_of
is bearer of
bearer of
participates in
this blood clot participates in this blood coagulation
this input material (or this output material) participates in this process
this investigator participates in this investigation
a relation between a continuant and a process, in which the continuant is somehow involved in the process
participates_in
participates in
has participant
this blood coagulation has participant this blood clot
this investigation has participant this investigator
this process has participant this input material (or this output material)
a relation between a process and a continuant, in which the continuant is somehow involved in the process
Has_participant is a primitive instance-level relation between a process, a continuant, and a time at which the continuant participates in some way in the process. The relation obtains, for example, when this particular process of oxygen exchange across this particular alveolar membrane has_participant this particular sample of hemoglobin at this particular time.
has_participant
http://www.obofoundry.org/ro/#OBO_REL:has_participant
has participant
A journal article is an information artifact that inheres in some number of printed journals. For each copy of the printed journal there is some quality that carries the journal article, such as a pattern of ink. The journal article (a generically dependent continuant) is concretized as the quality (a specifically dependent continuant), and both depend on that copy of the printed journal (an independent continuant).
An investigator reads a protocol and forms a plan to carry out an assay. The plan is a realizable entity (a specifically dependent continuant) that concretizes the protocol (a generically dependent continuant), and both depend on the investigator (an independent continuant). The plan is then realized by the assay (a process).
A relationship between a generically dependent continuant and a specifically dependent continuant, in which the generically dependent continuant depends on some independent continuant in virtue of the fact that the specifically dependent continuant also depends on that same independent continuant. A generically dependent continuant may be concretized as multiple specifically dependent continuants.
is concretized as
A journal article is an information artifact that inheres in some number of printed journals. For each copy of the printed journal there is some quality that carries the journal article, such as a pattern of ink. The quality (a specifically dependent continuant) concretizes the journal article (a generically dependent continuant), and both depend on that copy of the printed journal (an independent continuant).
An investigator reads a protocol and forms a plan to carry out an assay. The plan is a realizable entity (a specifically dependent continuant) that concretizes the protocol (a generically dependent continuant), and both depend on the investigator (an independent continuant). The plan is then realized by the assay (a process).
A relationship between a specifically dependent continuant and a generically dependent continuant, in which the generically dependent continuant depends on some independent continuant in virtue of the fact that the specifically dependent continuant also depends on that same independent continuant. Multiple specifically dependent continuants can concretize the same generically dependent continuant.
concretizes
this catalysis function is a function of this enzyme
a relation between a function and an independent continuant (the bearer), in which the function specifically depends on the bearer for its existence
A function inheres in its bearer at all times for which the function exists, however the function need not be realized at all the times that the function exists.
function_of
is function of
function of
this red color is a quality of this apple
a relation between a quality and an independent continuant (the bearer), in which the quality specifically depends on the bearer for its existence
A quality inheres in its bearer at all times for which the quality exists.
is quality of
quality_of
quality of
this investigator role is a role of this person
a relation between a role and an independent continuant (the bearer), in which the role specifically depends on the bearer for its existence
A role inheres in its bearer at all times for which the role exists, however the role need not be realized at all the times that the role exists.
is role of
role_of
role of
this enzyme has function this catalysis function (more colloquially: this enzyme has this catalysis function)
a relation between an independent continuant (the bearer) and a function, in which the function specifically depends on the bearer for its existence
A bearer can have many functions, and its functions can exist for different periods of time, but none of its functions can exist when the bearer does not exist. A function need not be realized at all the times that the function exists.
has_function
has function
this apple has quality this red color
a relation between an independent continuant (the bearer) and a quality, in which the quality specifically depends on the bearer for its existence
A bearer can have many qualities, and its qualities can exist for different periods of time, but none of its qualities can exist when the bearer does not exist.
has_quality
has quality
this person has role this investigator role (more colloquially: this person has this role of investigator)
a relation between an independent continuant (the bearer) and a role, in which the role specifically depends on the bearer for its existence
A bearer can have many roles, and its roles can exist for different periods of time, but none of its roles can exist when the bearer does not exist. A role need not be realized at all the times that the role exists.
has_role
has role
derives from
this cell derives from this parent cell (cell division)
this nucleus derives from this parent nucleus (nuclear division)
a relation between two distinct material entities, the new entity and the old entity, in which the new entity begins to exist when the old entity ceases to exist, and the new entity inherits the significant portion of the matter of the old entity
This is a very general relation. More specific relations are preferred when applicable, such as 'directly develops from'.
derives_from
derives from
this parent cell derives into this cell (cell division)
this parent nucleus derives into this nucleus (nuclear division)
a relation between two distinct material entities, the old entity and the new entity, in which the new entity begins to exist when the old entity ceases to exist, and the new entity inherits the significant portion of the matter of the old entity
This is a very general relation. More specific relations are preferred when applicable, such as 'directly develops into'. To avoid making statements about a future that may not come to pass, it is often better to use the backward-looking 'derives from' rather than the forward-looking 'derives into'.
derives_into
derives into
is location of
my head is the location of my brain
this cage is the location of this rat
a relation between two independent continuants, the location and the target, in which the target is entirely within the location
Most location relations will only hold at certain times, but this is difficult to specify in OWL. See https://code.google.com/p/obo-relations/wiki/ROAndTime
location_of
location of
located in
my brain is located in my head
this rat is located in this cage
a relation between two independent continuants, the target and the location, in which the target is entirely within the location
Location as a relation between instances: The primitive instance-level relation c located_in r at t reflects the fact that each continuant is at any given time associated with exactly one spatial region, namely its exact location. Following we can use this relation to define a further instance-level location relation - not between a continuant and the region which it exactly occupies, but rather between one continuant and another. c is located in c1, in this sense, whenever the spatial region occupied by c is part_of the spatial region occupied by c1. Note that this relation comprehends both the relation of exact location between one continuant and another which obtains when r and r1 are identical (for example, when a portion of fluid exactly fills a cavity), as well as those sorts of inexact location relations which obtain, for example, between brain and head or between ovum and uterus
Most location relations will only hold at certain times, but this is difficult to specify in OWL. See https://code.google.com/p/obo-relations/wiki/ROAndTime
located_in
http://www.obofoundry.org/ro/#OBO_REL:located_in
located in
move to BFO?
Allen
A relation that holds between two occurrents. This is a grouping relation that collects together all the Allen relations.
temporal relation
property to indicate that a design declares a variable; the inverse property is 'is declared by'
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
declares
property to indicate the variables declared by a design; the inverse property is 'declares'
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
is declared by
the relationship between a fraction and the number above the line
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
AGB
has numerator
relationship between a planned process and the plan specification that it carries out; it is defined as equivalent to the composed relationship (realizes o concretizes)
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
AGB
executes
This is the inverse of 'specifies value of' and it is intended to say things such as 'compound' 'assumes values specified by' 'independent variable specification'
A relation between an entity and a value specification, where the value specification is about the entity.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
AGB
assumes values specified by
relationship between an element and a set it belongs to
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
AGB
is member of
relationship between a set and one of its elements
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
AGB
has member
Inverse relation of 'denotes', where denotation is what happens when someone creates an information content entity E in order to specifically refer to something (from 'denotes' definition).
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
is denoted by
the relationship between a fraction and the number below the line (or divisor)
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
AGB
has denominator
has effect on
has fixed effect on
has interaction effect on
has random effect on
has order in sequence
Relationship between a parameter of a model and the estimate produced by estimation process as used in statistical modeling.
estimate of
computed_from is a relation between 2 information content entity denoting how one is derived from another on through the application of a data transformation or computation process.
computed from
is model for
is modeled by
has measurement value
has x coordinate value
has y coordinate value
has specified numeric value
A relation between a value specification and a number that quantifies it.
A range of 'real' might be better than 'float'. For now we follow 'has measurement value' until we can consider technical issues with SPARQL queries and reasoning.
PERSON: James A. Overton
OBI
has specified numeric value
has specified value
A relation between a value specification and a literal.
This is not an RDF/OWL object property. It is intended to link a value found in e.g. a database column of 'M' (the literal) to an instance of a value specification class, which can then be linked to indicate that this is about the biological gender of a human subject.
OBI
has specified value
A relationship (data property) between an entity and its value.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
has value
entity
Entity
Julius Caesar
Verdi’s Requiem
the Second World War
your body mass index
BFO 2 Reference: In all areas of empirical inquiry we encounter general terms of two sorts. First are general terms which refer to universals or types:animaltuberculosissurgical procedurediseaseSecond, are general terms used to refer to groups of entities which instantiate a given universal but do not correspond to the extension of any subuniversal of that universal because there is nothing intrinsic to the entities in question by virtue of which they – and only they – are counted as belonging to the given group. Examples are: animal purchased by the Emperortuberculosis diagnosed on a Wednesdaysurgical procedure performed on a patient from Stockholmperson identified as candidate for clinical trial #2056-555person who is signatory of Form 656-PPVpainting by Leonardo da VinciSuch terms, which represent what are called ‘specializations’ in [81
Entity doesn't have a closure axiom because the subclasses don't necessarily exhaust all possibilites. For example Werner Ceusters 'portions of reality' include 4 sorts, entities (as BFO construes them), universals, configurations, and relations. It is an open question as to whether entities as construed in BFO will at some point also include these other portions of reality. See, for example, 'How to track absolutely everything' at http://www.referent-tracking.com/_RTU/papers/CeustersICbookRevised.pdf
An entity is anything that exists or has existed or will exist. (axiom label in BFO2 Reference: [001-001])
entity
continuant
Continuant
An entity that exists in full at any time in which it exists at all, persists through time while maintaining its identity and has no temporal parts.
BFO 2 Reference: Continuant entities are entities which can be sliced to yield parts only along the spatial dimension, yielding for example the parts of your table which we call its legs, its top, its nails. ‘My desk stretches from the window to the door. It has spatial parts, and can be sliced (in space) in two. With respect to time, however, a thing is a continuant.’ [60, p. 240
Continuant doesn't have a closure axiom because the subclasses don't necessarily exhaust all possibilites. For example, in an expansion involving bringing in some of Ceuster's other portions of reality, questions are raised as to whether universals are continuants
A continuant is an entity that persists, endures, or continues to exist through time while maintaining its identity. (axiom label in BFO2 Reference: [008-002])
if b is a continuant and if, for some t, c has_continuant_part b at t, then c is a continuant. (axiom label in BFO2 Reference: [126-001])
if b is a continuant and if, for some t, cis continuant_part of b at t, then c is a continuant. (axiom label in BFO2 Reference: [009-002])
if b is a material entity, then there is some temporal interval (referred to below as a one-dimensional temporal region) during which b exists. (axiom label in BFO2 Reference: [011-002])
(forall (x y) (if (and (Continuant x) (exists (t) (continuantPartOfAt y x t))) (Continuant y))) // axiom label in BFO2 CLIF: [009-002]
(forall (x y) (if (and (Continuant x) (exists (t) (hasContinuantPartOfAt y x t))) (Continuant y))) // axiom label in BFO2 CLIF: [126-001]
(forall (x) (if (Continuant x) (Entity x))) // axiom label in BFO2 CLIF: [008-002]
(forall (x) (if (Material Entity x) (exists (t) (and (TemporalRegion t) (existsAt x t))))) // axiom label in BFO2 CLIF: [011-002]
continuant
occurrent
Occurrent
An entity that has temporal parts and that happens, unfolds or develops through time.
BFO 2 Reference: every occurrent that is not a temporal or spatiotemporal region is s-dependent on some independent continuant that is not a spatial region
BFO 2 Reference: s-dependence obtains between every process and its participants in the sense that, as a matter of necessity, this process could not have existed unless these or those participants existed also. A process may have a succession of participants at different phases of its unfolding. Thus there may be different players on the field at different times during the course of a football game; but the process which is the entire game s-depends_on all of these players nonetheless. Some temporal parts of this process will s-depend_on on only some of the players.
Occurrent doesn't have a closure axiom because the subclasses don't necessarily exhaust all possibilites. An example would be the sum of a process and the process boundary of another process.
Simons uses different terminology for relations of occurrents to regions: Denote the spatio-temporal location of a given occurrent e by 'spn[e]' and call this region its span. We may say an occurrent is at its span, in any larger region, and covers any smaller region. Now suppose we have fixed a frame of reference so that we can speak not merely of spatio-temporal but also of spatial regions (places) and temporal regions (times). The spread of an occurrent, (relative to a frame of reference) is the space it exactly occupies, and its spell is likewise the time it exactly occupies. We write 'spr[e]' and `spl[e]' respectively for the spread and spell of e, omitting mention of the frame.
An occurrent is an entity that unfolds itself in time or it is the instantaneous boundary of such an entity (for example a beginning or an ending) or it is a temporal or spatiotemporal region which such an entity occupies_temporal_region or occupies_spatiotemporal_region. (axiom label in BFO2 Reference: [077-002])
Every occurrent occupies_spatiotemporal_region some spatiotemporal region. (axiom label in BFO2 Reference: [108-001])
b is an occurrent entity iff b is an entity that has temporal parts. (axiom label in BFO2 Reference: [079-001])
(forall (x) (if (Occurrent x) (exists (r) (and (SpatioTemporalRegion r) (occupiesSpatioTemporalRegion x r))))) // axiom label in BFO2 CLIF: [108-001]
(forall (x) (iff (Occurrent x) (and (Entity x) (exists (y) (temporalPartOf y x))))) // axiom label in BFO2 CLIF: [079-001]
occurrent
ic
IndependentContinuant
a chair
a heart
a leg
a molecule
a spatial region
an atom
an orchestra.
an organism
the bottom right portion of a human torso
the interior of your mouth
A continuant that is a bearer of quality and realizable entity entities, in which other entities inhere and which itself cannot inhere in anything.
b is an independent continuant = Def. b is a continuant which is such that there is no c and no t such that b s-depends_on c at t. (axiom label in BFO2 Reference: [017-002])
For any independent continuant b and any time t there is some spatial region r such that b is located_in r at t. (axiom label in BFO2 Reference: [134-001])
For every independent continuant b and time t during the region of time spanned by its life, there are entities which s-depends_on b during t. (axiom label in BFO2 Reference: [018-002])
(forall (x t) (if (IndependentContinuant x) (exists (r) (and (SpatialRegion r) (locatedInAt x r t))))) // axiom label in BFO2 CLIF: [134-001]
(forall (x t) (if (and (IndependentContinuant x) (existsAt x t)) (exists (y) (and (Entity y) (specificallyDependsOnAt y x t))))) // axiom label in BFO2 CLIF: [018-002]
(iff (IndependentContinuant a) (and (Continuant a) (not (exists (b t) (specificallyDependsOnAt a b t))))) // axiom label in BFO2 CLIF: [017-002]
independent continuant
s-region
SpatialRegion
BFO 2 Reference: Spatial regions do not participate in processes.
Spatial region doesn't have a closure axiom because the subclasses don't exhaust all possibilites. An example would be the union of a spatial point and a spatial line that doesn't overlap the point, or two spatial lines that intersect at a single point. In both cases the resultant spatial region is neither 0-dimensional, 1-dimensional, 2-dimensional, or 3-dimensional.
A spatial region is a continuant entity that is a continuant_part_of spaceR as defined relative to some frame R. (axiom label in BFO2 Reference: [035-001])
All continuant parts of spatial regions are spatial regions. (axiom label in BFO2 Reference: [036-001])
(forall (x y t) (if (and (SpatialRegion x) (continuantPartOfAt y x t)) (SpatialRegion y))) // axiom label in BFO2 CLIF: [036-001]
(forall (x) (if (SpatialRegion x) (Continuant x))) // axiom label in BFO2 CLIF: [035-001]
spatial region
2d-s-region
TwoDimensionalSpatialRegion
an infinitely thin plane in space.
the surface of a sphere-shaped part of space
A two-dimensional spatial region is a spatial region that is of two dimensions. (axiom label in BFO2 Reference: [039-001])
(forall (x) (if (TwoDimensionalSpatialRegion x) (SpatialRegion x))) // axiom label in BFO2 CLIF: [039-001]
two-dimensional spatial region
process
Process
a process of cell-division, \ a beating of the heart
a process of meiosis
a process of sleeping
the course of a disease
the flight of a bird
the life of an organism
your process of aging.
An occurrent that has temporal proper parts and for some time t, p s-depends_on some material entity at t.
p is a process = Def. p is an occurrent that has temporal proper parts and for some time t, p s-depends_on some material entity at t. (axiom label in BFO2 Reference: [083-003])
BFO 2 Reference: The realm of occurrents is less pervasively marked by the presence of natural units than is the case in the realm of independent continuants. Thus there is here no counterpart of ‘object’. In BFO 1.0 ‘process’ served as such a counterpart. In BFO 2.0 ‘process’ is, rather, the occurrent counterpart of ‘material entity’. Those natural – as contrasted with engineered, which here means: deliberately executed – units which do exist in the realm of occurrents are typically either parasitic on the existence of natural units on the continuant side, or they are fiat in nature. Thus we can count lives; we can count football games; we can count chemical reactions performed in experiments or in chemical manufacturing. We cannot count the processes taking place, for instance, in an episode of insect mating behavior.Even where natural units are identifiable, for example cycles in a cyclical process such as the beating of a heart or an organism’s sleep/wake cycle, the processes in question form a sequence with no discontinuities (temporal gaps) of the sort that we find for instance where billiard balls or zebrafish or planets are separated by clear spatial gaps. Lives of organisms are process units, but they too unfold in a continuous series from other, prior processes such as fertilization, and they unfold in turn in continuous series of post-life processes such as post-mortem decay. Clear examples of boundaries of processes are almost always of the fiat sort (midnight, a time of death as declared in an operating theater or on a death certificate, the initiation of a state of war)
(iff (Process a) (and (Occurrent a) (exists (b) (properTemporalPartOf b a)) (exists (c t) (and (MaterialEntity c) (specificallyDependsOnAt a c t))))) // axiom label in BFO2 CLIF: [083-003]
process
disposition
Disposition
an atom of element X has the disposition to decay to an atom of element Y
certain people have a predisposition to colon cancer
children are innately disposed to categorize objects in certain ways.
the cell wall is disposed to filter chemicals in endocytosis and exocytosis
BFO 2 Reference: Dispositions exist along a strength continuum. Weaker forms of disposition are realized in only a fraction of triggering cases. These forms occur in a significant number of cases of a similar type.
b is a disposition means: b is a realizable entity & b’s bearer is some material entity & b is such that if it ceases to exist, then its bearer is physically changed, & b’s realization occurs when and because this bearer is in some special physical circumstances, & this realization occurs in virtue of the bearer’s physical make-up. (axiom label in BFO2 Reference: [062-002])
If b is a realizable entity then for all t at which b exists, b s-depends_on some material entity at t. (axiom label in BFO2 Reference: [063-002])
(forall (x t) (if (and (RealizableEntity x) (existsAt x t)) (exists (y) (and (MaterialEntity y) (specificallyDepends x y t))))) // axiom label in BFO2 CLIF: [063-002]
(forall (x) (if (Disposition x) (and (RealizableEntity x) (exists (y) (and (MaterialEntity y) (bearerOfAt x y t)))))) // axiom label in BFO2 CLIF: [062-002]
disposition
realizable
RealizableEntity
the disposition of this piece of metal to conduct electricity.
the disposition of your blood to coagulate
the function of your reproductive organs
the role of being a doctor
the role of this boundary to delineate where Utah and Colorado meet
A specifically dependent continuant that inheres in continuant entities and are not exhibited in full at every time in which it inheres in an entity or group of entities. The exhibition or actualization of a realizable entity is a particular manifestation, functioning or process that occurs under certain circumstances.
To say that b is a realizable entity is to say that b is a specifically dependent continuant that inheres in some independent continuant which is not a spatial region and is of a type instances of which are realized in processes of a correlated type. (axiom label in BFO2 Reference: [058-002])
All realizable dependent continuants have independent continuants that are not spatial regions as their bearers. (axiom label in BFO2 Reference: [060-002])
(forall (x t) (if (RealizableEntity x) (exists (y) (and (IndependentContinuant y) (not (SpatialRegion y)) (bearerOfAt y x t))))) // axiom label in BFO2 CLIF: [060-002]
(forall (x) (if (RealizableEntity x) (and (SpecificallyDependentContinuant x) (exists (y) (and (IndependentContinuant y) (not (SpatialRegion y)) (inheresIn x y)))))) // axiom label in BFO2 CLIF: [058-002]
realizable entity
0d-s-region
ZeroDimensionalSpatialRegion
A zero-dimensional spatial region is a point in space. (axiom label in BFO2 Reference: [037-001])
(forall (x) (if (ZeroDimensionalSpatialRegion x) (SpatialRegion x))) // axiom label in BFO2 CLIF: [037-001]
zero-dimensional spatial region
quality
Quality
the ambient temperature of this portion of air
the color of a tomato
the length of the circumference of your waist
the mass of this piece of gold.
the shape of your nose
the shape of your nostril
a quality is a specifically dependent continuant that, in contrast to roles and dispositions, does not require any further process in order to be realized. (axiom label in BFO2 Reference: [055-001])
If an entity is a quality at any time that it exists, then it is a quality at every time that it exists. (axiom label in BFO2 Reference: [105-001])
(forall (x) (if (Quality x) (SpecificallyDependentContinuant x))) // axiom label in BFO2 CLIF: [055-001]
(forall (x) (if (exists (t) (and (existsAt x t) (Quality x))) (forall (t_1) (if (existsAt x t_1) (Quality x))))) // axiom label in BFO2 CLIF: [105-001]
quality
sdc
SpecificallyDependentContinuant
Reciprocal specifically dependent continuants: the function of this key to open this lock and the mutually dependent disposition of this lock: to be opened by this key
of one-sided specifically dependent continuants: the mass of this tomato
of relational dependent continuants (multiple bearers): John’s love for Mary, the ownership relation between John and this statue, the relation of authority between John and his subordinates.
the disposition of this fish to decay
the function of this heart: to pump blood
the mutual dependence of proton donors and acceptors in chemical reactions [79
the mutual dependence of the role predator and the role prey as played by two organisms in a given interaction
the pink color of a medium rare piece of grilled filet mignon at its center
the role of being a doctor
the shape of this hole.
the smell of this portion of mozzarella
A continuant that inheres in or is borne by other entities. Every instance of A requires some specific instance of B which must always be the same.
b is a relational specifically dependent continuant = Def. b is a specifically dependent continuant and there are n > 1 independent continuants c1, … cn which are not spatial regions are such that for all 1 i < j n, ci and cj share no common parts, are such that for each 1 i n, b s-depends_on ci at every time t during the course of b’s existence (axiom label in BFO2 Reference: [131-004])
b is a specifically dependent continuant = Def. b is a continuant & there is some independent continuant c which is not a spatial region and which is such that b s-depends_on c at every time t during the course of b’s existence. (axiom label in BFO2 Reference: [050-003])
Specifically dependent continuant doesn't have a closure axiom because the subclasses don't necessarily exhaust all possibilites. We're not sure what else will develop here, but for example there are questions such as what are promises, obligation, etc.
(iff (RelationalSpecificallyDependentContinuant a) (and (SpecificallyDependentContinuant a) (forall (t) (exists (b c) (and (not (SpatialRegion b)) (not (SpatialRegion c)) (not (= b c)) (not (exists (d) (and (continuantPartOfAt d b t) (continuantPartOfAt d c t)))) (specificallyDependsOnAt a b t) (specificallyDependsOnAt a c t)))))) // axiom label in BFO2 CLIF: [131-004]
(iff (SpecificallyDependentContinuant a) (and (Continuant a) (forall (t) (if (existsAt a t) (exists (b) (and (IndependentContinuant b) (not (SpatialRegion b)) (specificallyDependsOnAt a b t))))))) // axiom label in BFO2 CLIF: [050-003]
specifically dependent continuant
role
Role
John’s role of husband to Mary is dependent on Mary’s role of wife to John, and both are dependent on the object aggregate comprising John and Mary as member parts joined together through the relational quality of being married.
the priest role
the role of a boundary to demarcate two neighboring administrative territories
the role of a building in serving as a military target
the role of a stone in marking a property boundary
the role of subject in a clinical trial
the student role
A realizable entity the manifestation of which brings about some result or end that is not essential to a continuant in virtue of the kind of thing that it is but that can be served or participated in by that kind of continuant in some kinds of natural, social or institutional contexts.
BFO 2 Reference: One major family of examples of non-rigid universals involves roles, and ontologies developed for corresponding administrative purposes may consist entirely of representatives of entities of this sort. Thus ‘professor’, defined as follows,b instance_of professor at t =Def. there is some c, c instance_of professor role & c inheres_in b at t.denotes a non-rigid universal and so also do ‘nurse’, ‘student’, ‘colonel’, ‘taxpayer’, and so forth. (These terms are all, in the jargon of philosophy, phase sortals.) By using role terms in definitions, we can create a BFO conformant treatment of such entities drawing on the fact that, while an instance of professor may be simultaneously an instance of trade union member, no instance of the type professor role is also (at any time) an instance of the type trade union member role (any more than any instance of the type color is at any time an instance of the type length).If an ontology of employment positions should be defined in terms of roles following the above pattern, this enables the ontology to do justice to the fact that individuals instantiate the corresponding universals – professor, sergeant, nurse – only during certain phases in their lives.
b is a role means: b is a realizable entity & b exists because there is some single bearer that is in some special physical, social, or institutional set of circumstances in which this bearer does not have to be& b is not such that, if it ceases to exist, then the physical make-up of the bearer is thereby changed. (axiom label in BFO2 Reference: [061-001])
(forall (x) (if (Role x) (RealizableEntity x))) // axiom label in BFO2 CLIF: [061-001]
role
1d-s-region
OneDimensionalSpatialRegion
an edge of a cube-shaped portion of space.
A one-dimensional spatial region is a line or aggregate of lines stretching from one point in space to another. (axiom label in BFO2 Reference: [038-001])
(forall (x) (if (OneDimensionalSpatialRegion x) (SpatialRegion x))) // axiom label in BFO2 CLIF: [038-001]
one-dimensional spatial region
3d-s-region
ThreeDimensionalSpatialRegion
a cube-shaped region of space
a sphere-shaped region of space,
A three-dimensional spatial region is a spatial region that is of three dimensions. (axiom label in BFO2 Reference: [040-001])
(forall (x) (if (ThreeDimensionalSpatialRegion x) (SpatialRegion x))) // axiom label in BFO2 CLIF: [040-001]
three-dimensional spatial region
gdc
GenericallyDependentContinuant
The entries in your database are patterns instantiated as quality instances in your hard drive. The database itself is an aggregate of such patterns. When you create the database you create a particular instance of the generically dependent continuant type database. Each entry in the database is an instance of the generically dependent continuant type IAO: information content entity.
the pdf file on your laptop, the pdf file that is a copy thereof on my laptop
the sequence of this protein molecule; the sequence that is a copy thereof in that protein molecule.
A continuant that is dependent on one or other independent continuant bearers. For every instance of A requires some instance of (an independent continuant type) B but which instance of B serves can change from time to time.
b is a generically dependent continuant = Def. b is a continuant that g-depends_on one or more other entities. (axiom label in BFO2 Reference: [074-001])
(iff (GenericallyDependentContinuant a) (and (Continuant a) (exists (b t) (genericallyDependsOnAt a b t)))) // axiom label in BFO2 CLIF: [074-001]
generically dependent continuant
function
Function
the function of a hammer to drive in nails
the function of a heart pacemaker to regulate the beating of a heart through electricity
the function of amylase in saliva to break down starch into sugar
BFO 2 Reference: In the past, we have distinguished two varieties of function, artifactual function and biological function. These are not asserted subtypes of BFO:function however, since the same function – for example: to pump, to transport – can exist both in artifacts and in biological entities. The asserted subtypes of function that would be needed in order to yield a separate monoheirarchy are not artifactual function, biological function, etc., but rather transporting function, pumping function, etc.
A function is a disposition that exists in virtue of the bearer’s physical make-up and this physical make-up is something the bearer possesses because it came into being, either through evolution (in the case of natural biological entities) or through intentional design (in the case of artifacts), in order to realize processes of a certain sort. (axiom label in BFO2 Reference: [064-001])
(forall (x) (if (Function x) (Disposition x))) // axiom label in BFO2 CLIF: [064-001]
function
material
MaterialEntity
a flame
a forest fire
a human being
a hurricane
a photon
a puff of smoke
a sea wave
a tornado
an aggregate of human beings.
an energy wave
an epidemic
the undetached arm of a human being
An independent continuant that is spatially extended whose identity is independent of that of other entities and can be maintained through time.
BFO 2 Reference: Material entities (continuants) can preserve their identity even while gaining and losing material parts. Continuants are contrasted with occurrents, which unfold themselves in successive temporal parts or phases [60
BFO 2 Reference: Object, Fiat Object Part and Object Aggregate are not intended to be exhaustive of Material Entity. Users are invited to propose new subcategories of Material Entity.
BFO 2 Reference: ‘Matter’ is intended to encompass both mass and energy (we will address the ontological treatment of portions of energy in a later version of BFO). A portion of matter is anything that includes elementary particles among its proper or improper parts: quarks and leptons, including electrons, as the smallest particles thus far discovered; baryons (including protons and neutrons) at a higher level of granularity; atoms and molecules at still higher levels, forming the cells, organs, organisms and other material entities studied by biologists, the portions of rock studied by geologists, the fossils studied by paleontologists, and so on.Material entities are three-dimensional entities (entities extended in three spatial dimensions), as contrasted with the processes in which they participate, which are four-dimensional entities (entities extended also along the dimension of time).According to the FMA, material entities may have immaterial entities as parts – including the entities identified below as sites; for example the interior (or ‘lumen’) of your small intestine is a part of your body. BFO 2.0 embodies a decision to follow the FMA here.
A material entity is an independent continuant that has some portion of matter as proper or improper continuant part. (axiom label in BFO2 Reference: [019-002])
Every entity which has a material entity as continuant part is a material entity. (axiom label in BFO2 Reference: [020-002])
every entity of which a material entity is continuant part is also a material entity. (axiom label in BFO2 Reference: [021-002])
(forall (x) (if (MaterialEntity x) (IndependentContinuant x))) // axiom label in BFO2 CLIF: [019-002]
(forall (x) (if (and (Entity x) (exists (y t) (and (MaterialEntity y) (continuantPartOfAt x y t)))) (MaterialEntity x))) // axiom label in BFO2 CLIF: [021-002]
(forall (x) (if (and (Entity x) (exists (y t) (and (MaterialEntity y) (continuantPartOfAt y x t)))) (MaterialEntity x))) // axiom label in BFO2 CLIF: [020-002]
material entity
immaterial
ImmaterialEntity
BFO 2 Reference: Immaterial entities are divided into two subgroups:boundaries and sites, which bound, or are demarcated in relation, to material entities, and which can thus change location, shape and size and as their material hosts move or change shape or size (for example: your nasal passage; the hold of a ship; the boundary of Wales (which moves with the rotation of the Earth) [38, 7, 10
immaterial entity
peptide
Amide derived from two or more amino carboxylic acid molecules (the same or different) by formation of a covalent bond from the carbonyl carbon of one to the nitrogen atom of another with formal loss of water. The term is usually applied to structures formed from alpha-amino acids, but it includes those derived from any amino carboxylic acid. X = OH, OR, NH2, NHR, etc.
peptide
deoxyribonucleic acid
High molecular weight, linear polymers, composed of nucleotides containing deoxyribose and linked by phosphodiester bonds; DNA contain the genetic information of organisms.
deoxyribonucleic acid
molecular entity
Any constitutionally or isotopically distinct atom, molecule, ion, ion pair, radical, radical ion, complex, conformer etc., identifiable as a separately distinguishable entity.
We are assuming that every molecular entity has to be completely connected by chemical bonds. This excludes protein complexes, which are comprised of minimally two separate molecular entities. We will follow up with Chebi to ensure this is their understanding as well
molecular entity
atom
A chemical entity constituting the smallest component of an element having the chemical properties of the element.
atom
nucleic acid
A macromolecule made up of nucleotide units and hydrolysable into certain pyrimidine or purine bases (usually adenine, cytosine, guanine, thymine, uracil), D-ribose or 2-deoxy-D-ribose and phosphoric acid.
nucleic acid
ribonucleic acid
High molecular weight, linear polymers, composed of nucleotides containing ribose and linked by phosphodiester bonds; RNA is central to the synthesis of proteins.
ribonucleic acid
macromolecule
A macromolecule is a molecule of high relative molecular mass, the structure of which essentially comprises the multiple repetition of units derived, actually or conceptually, from molecules of low relative molecular mass.
polymer
macromolecule
cell
cell
PMID:18089833.Cancer Res. 2007 Dec 15;67(24):12018-25. "...Epithelial cells were harvested from histologically confirmed adenocarcinomas .."
A material entity of anatomical origin (part of or deriving from an organism) that has as its parts a maximally connected cell compartment surrounded by a plasma membrane.
cell
cell
cultured cell
A cell in vitro that is or has been maintained or propagated as part of a cell culture.
cultured cell
experimentally modified cell in vitro
A cell in vitro that has undergone physical changes as a consequence of a deliberate and specific experimental procedure.
experimentally modified cell in vitro
molecular_function
A molecular process that can be carried out by the action of a single macromolecular machine, usually via direct physical interactions with other molecular entities. Function in this sense denotes an action, or activity, that a gene product (or a complex) performs. These actions are described from two distinct but related perspectives: (1) biochemical activity, and (2) role as a component in a larger system/process.
GO:molecular_function
catalytic activity
Catalysis of a biochemical reaction at physiological temperatures. In biologically catalyzed reactions, the reactants are known as substrates, and the catalysts are naturally occurring macromolecular substances known as enzymes. Enzymes possess specific binding sites for substrates, and are usually composed wholly or largely of protein, but RNA that has catalytic activity (ribozyme) is often also regarded as enzymatic.
catalytic activity
biological_process
A biological process represents a specific objective that the organism is genetically programmed to achieve. Biological processes are often described by their outcome or ending state, e.g., the biological process of cell division results in the creation of two daughter cells (a divided cell) from a single parent cell. A biological process is accomplished by a particular set of molecular functions carried out by specific gene products (or macromolecular complexes), often in a highly regulated manner and in a particular temporal sequence.
biological_process
gene expression
The process in which a gene's sequence is converted into a mature gene product or products (proteins or RNA). This includes the production of an RNA transcript as well as any processing to produce a mature RNA product or an mRNA or circRNA (for protein-coding genes) and the translation of that mRNA or circRNA into protein. Protein maturation is included when required to form an active form of a product from an inactive precursor form.
gene expression
protein complex
A ribosome is a protein complex
A stable macromolecular complex composed (only) of two or more polypeptide subunits along with any covalently attached molecules (such as lipid anchors or oligosaccharide) or non-protein prosthetic groups (such as nucleotides or metal ions). Prosthetic group in this context refers to a tightly bound cofactor. The component polypeptide subunits may be identical.
protein complex
conditional specification
a directive information entity that specifies what should happen if the trigger condition is fulfilled
PlanAndPlannedProcess Branch
OBI branch derived
OBI_0000349
conditional specification
measurement unit label
Examples of measurement unit labels are liters, inches, weight per volume.
A measurement unit label is as a label that is part of a scalar measurement datum and denotes a unit of measure.
2009-03-16: provenance: a term measurement unit was
proposed for OBI (OBI_0000176) , edited by Chris Stoeckert and
Cristian Cocos, and subsequently moved to IAO where the objective for
which the original term was defined was satisfied with the definition
of this, different, term.
2009-03-16: review of this term done during during the OBI workshop winter 2009 and the current definition was considered acceptable for use in OBI. If there is a need to modify this definition please notify OBI.
PERSON: Alan Ruttenberg
PERSON: Melanie Courtot
measurement unit label
objective specification
In the protocol of a ChIP assay the objective specification says to identify protein and DNA interaction.
a directive information entity that describes an intended process endpoint. When part of a plan specification the concretization is realized in a planned process in which the bearer tries to effect the world so that the process endpoint is achieved.
2009-03-16: original definition when imported from OBI read: "objective is an non realizable information entity which can serve as that proper part of a plan towards which the realization of the plan is directed."
2014-03-31: In the example of usage ("In the protocol of a ChIP assay the objective specification says to identify protein and DNA interaction") there is a protocol which is the ChIP assay protocol. In addition to being concretized on paper, the protocol can be concretized as a realizable entity, such as a plan that inheres in a person. The objective specification is the part that says that some protein and DNA interactions are identified. This is a specification of a process endpoint: the boundary in the process before which they are not identified and after which they are. During the realization of the plan, the goal is to get to the point of having the interactions, and participants in the realization of the plan try to do that.
Answers the question, why did you do this experiment?
PERSON: Alan Ruttenberg
PERSON: Barry Smith
PERSON: Bjoern Peters
PERSON: Jennifer Fostel
goal specification
OBI Plan and Planned Process/Roles Branch
OBI_0000217
objective specification
Pour the contents of flask 1 into flask 2
a directive information entity that describes an action the bearer will take
Alan Ruttenberg
OBI Plan and Planned Process branch
action specification
datum label
A label is a symbol that is part of some other datum and is used to either partially define the denotation of that datum or to provide a means for identifying the datum as a member of the set of data with the same label
http://www.golovchenko.org/cgi-bin/wnsearch?q=label#4n
GROUP: IAO
9/22/11 BP: changed the rdfs:label for this class from 'label' to 'datum label' to convey that this class is not intended to cover all kinds of labels (stickers, radiolabels, etc.), and not even all kind of textual labels, but rather the kind of labels occuring in a datum.
datum label
information carrier
In the case of a printed paperback novel the physicality of the ink and of the paper form part of the information bearer. The qualities of appearing black and having a certain pattern for the ink and appearing white for the paper form part of the information carrier in this case.
A quality of an information bearer that imparts the information content
12/15/09: There is a concern that some ways that carry information may be processes rather than qualities, such as in a 'delayed wave carrier'.
2014-03-10: We are not certain that all information carriers are qualities. There was a discussion of dropping it.
PERSON: Alan Ruttenberg
Smith, Ceusters, Ruttenberg, 2000 years of philosophy
information carrier
data item
Data items include counts of things, analyte concentrations, and statistical summaries.
a data item is an information content entity that is intended to be a truthful statement about something (modulo, e.g., measurement precision or other systematic errors) and is constructed/acquired by a method which reliably tends to produce (approximately) truthful statements.
2/2/2009 Alan and Bjoern discussing FACS run output data. This is a data item because it is about the cell population. Each element records an event and is typically further composed a set of measurment data items that record the fluorescent intensity stimulated by one of the lasers.
2009-03-16: data item deliberatly ambiguous: we merged data set and datum to be one entity, not knowing how to define singular versus plural. So data item is more general than datum.
2009-03-16: removed datum as alternative term as datum specifically refers to singular form, and is thus not an exact synonym.
2014-03-31: See discussion at http://odontomachus.wordpress.com/2014/03/30/aboutness-objects-propositions/
JAR: datum -- well, this will be very tricky to define, but maybe some
information-like stuff that might be put into a computer and that is
meant, by someone, to denote and/or to be interpreted by some
process... I would include lists, tables, sentences... I think I might
defer to Barry, or to Brian Cantwell Smith
JAR: A data item is an approximately justified approximately true approximate belief
PERSON: Alan Ruttenberg
PERSON: Chris Stoeckert
PERSON: Jonathan Rees
data
data item
symbol
a serial number such as "12324X"
a stop sign
a written proper name such as "OBI"
An information content entity that is a mark(s) or character(s) used as a conventional representation of another entity.
20091104, MC: this needs work and will most probably change
2014-03-31: We would like to have a deeper analysis of 'mark' and 'sign' in the future (see https://github.com/information-artifact-ontology/IAO/issues/154).
PERSON: James A. Overton
PERSON: Jonathan Rees
based on Oxford English Dictionary
symbol
information content entity
Examples of information content entites include journal articles, data, graphical layouts, and graphs.
A generically dependent continuant that is about some thing.
2014-03-10: The use of "thing" is intended to be general enough to include universals and configurations (see https://groups.google.com/d/msg/information-ontology/GBxvYZCk1oc/-L6B5fSBBTQJ).
information_content_entity 'is_encoded_in' some digital_entity in obi before split (040907). information_content_entity 'is_encoded_in' some physical_document in obi before split (040907).
Previous. An information content entity is a non-realizable information entity that 'is encoded in' some digital or physical entity.
PERSON: Chris Stoeckert
OBI_0000142
information content entity
1
1
10 feet. 3 ml.
a scalar measurement datum is a measurement datum that is composed of two parts, numerals and a unit label.
2009-03-16: we decided to keep datum singular in scalar measurement datum, as in
this case we explicitly refer to the singular form
Would write this as: has_part some 'measurement unit label' and has_part some numeral and has_part exactly 2, except for the fact that this won't let us take advantage of OWL reasoning over the numbers. Instead use has measurment value property to represent the same. Use has measurement unit label (subproperty of has_part) so we can easily say that there is only one of them.
PERSON: Alan Ruttenberg
PERSON: Melanie Courtot
scalar measurement datum
An information content entity whose concretizations indicate to their bearer how to realize them in a process.
2009-03-16: provenance: a term realizable information entity was proposed for OBI (OBI_0000337) , edited by the PlanAndPlannedProcess branch. Original definition was "is the specification of a process that can be concretized and realized by an actor" with alternative term "instruction".It has been subsequently moved to IAO where the objective for which the original term was defined was satisfied with the definitionof this, different, term.
2013-05-30 Alan Ruttenberg: What differentiates a directive information entity from an information concretization is that it can have concretizations that are either qualities or realizable entities. The concretizations that are realizable entities are created when an individual chooses to take up the direction, i.e. has the intention to (try to) realize it.
8/6/2009 Alan Ruttenberg: Changed label from "information entity about a realizable" after discussions at ICBO
Werner pushed back on calling it realizable information entity as it isn't realizable. However this name isn't right either. An example would be a recipe. The realizable entity would be a plan, but the information entity isn't about the plan, it, once concretized, *is* the plan. -Alan
PERSON: Alan Ruttenberg
PERSON: Bjoern Peters
directive information entity
dot plot
Dot plot of SSC-H and FSC-H.
A dot plot is a report graph which is a graphical representation of data where each data point is represented by a single dot placed on coordinates corresponding to data point values in particular dimensions.
person:Allyson Lister
person:Chris Stoeckert
OBI_0000123
group:OBI
dot plot
graph
A diagram that presents one or more tuples of information by mapping those tuples in to a two dimensional space in a non arbitrary way.
PERSON: Lawrence Hunter
person:Alan Ruttenberg
person:Allyson Lister
OBI_0000240
group:OBI
graph
rule
example to be added
a rule is an executable which guides, defines, restricts actions
MSI
PRS
OBI_0500021
PRS
rule
algorithm
PMID: 18378114.Genomics. 2008 Mar 28. LINKGEN: A new algorithm to process data in genetic linkage studies.
A plan specification which describes the inputs and output of mathematical functions as well as workflow of execution for achieving an predefined objective. Algorithms are realized usually by means of implementation as computer programs for execution by automata.
Philippe Rocca-Serra
PlanAndPlannedProcess Branch
OBI_0000270
adapted from discussion on OBI list (Matthew Pocock, Christian Cocos, Alan Ruttenberg)
algorithm
curation status specification
The curation status of the term. The allowed values come from an enumerated list of predefined terms. See the specification of these instances for more detailed definitions of each enumerated value.
Better to represent curation as a process with parts and then relate labels to that process (in IAO meeting)
PERSON:Bill Bug
GROUP:OBI:<http://purl.obolibrary.org/obo/obi>
OBI_0000266
curation status specification
data set
Intensity values in a CEL file or from multiple CEL files comprise a data set (as opposed to the CEL files themselves).
A data item that is an aggregate of other data items of the same type that have something in common. Averages and distributions can be determined for data sets.
2009/10/23 Alan Ruttenberg. The intention is that this term represent collections of like data. So this isn't for, e.g. the whole contents of a cel file, which includes parameters, metadata etc. This is more like java arrays of a certain rather specific type
2014-05-05: Data sets are aggregates and thus must include two or more data items. We have chosen not to add logical axioms to make this restriction.
person:Allyson Lister
person:Chris Stoeckert
OBI_0000042
group:OBI
data set
image
An image is an affine projection to a two dimensional surface, of measurements of some quality of an entity or entities repeated at regular intervals across a spatial range, where the measurements are represented as color and luminosity on the projected on surface.
person:Alan Ruttenberg
person:Allyson
person:Chris Stoeckert
OBI_0000030
group:OBI
image
data about an ontology part is a data item about a part of an ontology, for example a term
Person:Alan Ruttenberg
data about an ontology part
plan specification
PMID: 18323827.Nat Med. 2008 Mar;14(3):226.New plan proposed to help resolve conflicting medical advice.
A directive information entity with action specifications and objective specifications as parts that, when concretized, is realized in a process in which the bearer tries to achieve the objectives by taking the actions specified.
2009-03-16: provenance: a term a plan was proposed for OBI (OBI_0000344) , edited by the PlanAndPlannedProcess branch. Original definition was " a plan is a specification of a process that is realized by an actor to achieve the objective specified as part of the plan". It has been subsequently moved to IAO where the objective for which the original term was defined was satisfied with the definitionof this, different, term.
2014-03-31: A plan specification can have other parts, such as conditional specifications.
Alternative previous definition: a plan is a set of instructions that specify how an objective should be achieved
Alan Ruttenberg
OBI Plan and Planned Process branch
OBI_0000344
2/3/2009 Comment from OBI review.
Action specification not well enough specified.
Conditional specification not well enough specified.
Question whether all plan specifications have objective specifications.
Request that IAO either clarify these or change definitions not to use them
plan specification
measurement datum
Examples of measurement data are the recoding of the weight of a mouse as {40,mass,"grams"}, the recording of an observation of the behavior of the mouse {,process,"agitated"}, the recording of the expression level of a gene as measured through the process of microarray experiment {3.4,luminosity,}.
A measurement datum is an information content entity that is a recording of the output of a measurement such as produced by a device.
2/2/2009 is_specified_output of some assay?
person:Chris Stoeckert
OBI_0000305
group:OBI
measurement datum
version number
A version number is an information content entity which is a sequence of characters borne by part of each of a class of manufactured products or its packaging and indicates its order within a set of other products having the same name.
Note: we feel that at the moment we are happy with a general version number, and that we will subclass as needed in the future. For example, see 7. genome sequence version
GROUP: IAO
version number
conclusion textual entity
that fucoidan has a small statistically significant effect on AT3 level but no useful clinical effect as in-vivo anticoagulant, a paraphrase of part of the last paragraph of the discussion section of the paper 'Pilot clinical study to evaluate the anticoagulant activity of fucoidan', by Lowenthal et. al.PMID:19696660
A textual entity that expresses the results of reasoning about a problem, for instance as typically found towards the end of scientific papers.
2009/09/28 Alan Ruttenberg. Fucoidan-use-case
2009/10/23 Alan Ruttenberg: We need to work on the definition still
Person:Alan Ruttenberg
conclusion textual entity
scatter plot
Comparison of gene expression values in two samples can be displayed in a scatter plot
A scatterplot is a graph which uses Cartesian coordinates to display values for two variables for a set of data. The data is displayed as a collection of points, each having the value of one variable determining the position on the horizontal axis and the value of the other variable determining the position on the vertical axis.
PERSON:Chris Stoeckert
PERSON:James Malone
PERSON:Melanie Courtot
scattergraph
WEB: http://en.wikipedia.org/wiki/Scatterplot
scatter plot
textual entity
Words, sentences, paragraphs, and the written (non-figure) parts of publications are all textual entities
A textual entity is a part of a manifestation (FRBR sense), a generically dependent continuant whose concretizations are patterns of glyphs intended to be interpreted as words, formulas, etc.
AR, (IAO call 2009-09-01): a document as a whole is not typically a textual entity, because it has pictures in it - rather there are parts of it that are textual entities. Examples: The title, paragraph 2 sentence 7, etc.
MC, 2009-09-14 (following IAO call 2009-09-01): textual entities live at the FRBR (http://en.wikipedia.org/wiki/Functional_Requirements_for_Bibliographic_Records) manifestation level. Everything is significant: line break, pdf and html versions of same document are different textual entities.
PERSON: Lawrence Hunter
text
textual entity
table
| T F
--+-----
T | T F
F | F F
A textual entity that contains a two-dimensional arrangement of texts repeated at regular intervals across a spatial range, such that the spatial relationships among the constituent texts expresses propositions
PERSON: Lawrence Hunter
table
figure
Any picture, diagram or table
An information content entity consisting of a two dimensional arrangement of information content entities such that the arrangement itself is about something.
PERSON: Lawrence Hunter
figure
diagram
A molecular structure ribbon cartoon showing helices, turns and sheets and their relations to each other in space.
A figure that expresses one or more propositions
PERSON: Lawrence Hunter
diagram
document
A journal article, patent application, laboratory notebook, or a book
A collection of information content entities intended to be understood together as a whole
PERSON: Lawrence Hunter
document
1
A cartesian spatial coordinate datum is a representation of a point in a spatial region, in which equal changes in the magnitude of a coordinate value denote length qualities with the same magnitude
2009-08-18 Alan Ruttenberg - question to BFO list about whether the BFO sense of the lower dimensional regions is that they are always part of actual space (the three dimensional sort) http://groups.google.com/group/bfo-discuss/browse_thread/thread/9d04e717e39fb617
Alan Ruttenberg
AR notes: We need to discuss whether it should include site.
cartesian spatial coordinate datum
http://groups.google.com/group/bfo-discuss/browse_thread/thread/9d04e717e39fb617
1
A cartesion spatial coordinate datum that uses one value to specify a position along a one dimensional spatial region
Alan Ruttenberg
one dimensional cartesian spatial coordinate datum
1
1
A cartesion spatial coordinate datum that uses two values to specify a position within a two dimensional spatial region
Alan Ruttenberg
two dimensional cartesian spatial coordinate datum
A scalar measurement datum that is the result of measurement of mass quality
2009/09/28 Alan Ruttenberg. Fucoidan-use-case
Person:Alan Ruttenberg
mass measurement datum
A scalar measurement datum that is the result of measuring a temporal interval
2009/09/28 Alan Ruttenberg. Fucoidan-use-case
Person:Alan Ruttenberg
time measurement datum
Recording the current temperature in a laboratory notebook. Writing a journal article. Updating a patient record in a database.
a planned process in which a document is created or added to by including the specified input in it.
6/11/9: Edited at OBI workshop. We need to be able identify a child form of information artifact which corresponds to something enduring (not brain like). This used to be restricted to physical document or digital entity as the output, but that excludes e.g. an audio cassette tape
Bjoern Peters
wikipedia http://en.wikipedia.org/wiki/Documenting
documenting
line graph
A line graph is a type of graph created by connecting a series of data
points together with a line.
PERSON:Chris Stoeckert
PERSON:Melanie Courtot
line chart
GROUP:OBI
WEB: http://en.wikipedia.org/wiki/Line_chart
line graph
The sentence "The article has Pubmed ID 12345." contains a CRID that has two parts: one part is the CRID symbol, which is '12345'; the other part denotes the CRID registry, which is Pubmed.
A symbol that is part of a CRID and that is sufficient to look up a record from the CRID's registry.
PERSON: Alan Ruttenberg
PERSON: Bill Hogan
PERSON: Bjoern Peters
PERSON: Melanie Courtot
CRID symbol
Original proposal from Bjoern, discussions at IAO calls
centrally registered identifier symbol
The sentence "The article has Pubmed ID 12345." contains a CRID that has two parts: one part is the CRID symbol, which is '12345'; the other part denotes the CRID registry, which is Pubmed.
An information content entity that consists of a CRID symbol and additional information about the CRID registry to which it belongs.
2014-05-05: In defining this term we take no position on what the CRID denotes. In particular do not assume it denotes a *record* in the CRID registry (since the registry might not have 'records').
Alan, IAO call 20101124: potentially the CRID denotes the instance it was associated with during creation.
Note, IAO call 20101124: URIs are not always CRID, as not centrally registered. We acknowledge that CRID is a subset of a larger identifier class, but this subset fulfills our current needs. OBI PURLs are CRID as they are registered with OCLC. UPCs (Universal Product Codes from AC Nielsen)are not CRID as they are not centrally registered.
PERSON: Alan Ruttenberg
PERSON: Bill Hogan
PERSON: Bjoern Peters
PERSON: Melanie Courtot
CRID
Original proposal from Bjoern, discussions at IAO calls
centrally registered identifier
PubMed is a CRID registry. It has a dataset of PubMed identifiers associated with journal articles.
A CRID registry is a dataset of CRID records, each consisting of a CRID symbol and additional information which was recorded in the dataset through a assigning a centrally registered identifier process.
PERSON: Alan Ruttenberg
PERSON: Bill Hogan
PERSON: Bjoern Peters
PERSON: Melanie Courtot
CRID registry
Original proposal from Bjoern, discussions at IAO calls
centrally registered identifier registry
time stamped measurement datum
pmid:20604925 - time-lapse live cell microscopy
A data set that is an aggregate of data recording some measurement at a number of time points. The time series data set is an ordered list of pairs of time measurement data and the corresponding measurement data acquired at that time.
Alan Ruttenberg
experimental time series
time sampled measurement data set
Viruses
Viruses
Euteleostomi
bony vertebrates
Euteleostomi
Bacteria
eubacteria
Bacteria
Archaea
Archaea
Eukaryota
eucaryotes
eukaryotes
Eukaryota
Euarchontoglires
Euarchontoglires
Tetrapoda
tetrapods
Tetrapoda
Amniota
amniotes
Amniota
Opisthokonta
Opisthokonta
Bilateria
Bilateria
Mammalia
mammals
Mammalia
Vertebrata <Metazoa>
Vertebrata
vertebrates
Vertebrata <Metazoa>
Homo sapiens
human
human being
man
Homo sapiens
fluorescent reporter intensity
A measurement datum that represents the output of a scanner measuring the intensity value for each fluorescent reporter.
person:Chris Stoeckert
group:OBI
From the DT branch: This term and definition were originally submitted by the community to our branch, but we thought they best fit DENRIE. However we see several issues with this. First of all the name 'probe' might not be used in OBI. Instead we have a 'reporter' role. Also, albeit the term 'probe intensity' is often used in communities such as the microarray one, the name 'probe' is ambiguous (some use it to refer to what's on the array, some use it to refer to what's hybed to the array). Furthermore, this concept could possibly be encompassed by combining different OBI terms, such as the roles of analyte, detector and reporter (you need something hybed to a probe on the array to get an intensity) and maybe a more general term for 'measuring intensities'. We need to find the right balance between what is consistent with OBI and combinations of its terms and what is user-friendly. Finally, note that 'intensity' is already in the OBI .owl file and is also in PATO. Why didn't OBI import it from PATO? This might be a problem.
fluorescent reporter intensity
planned process
planned process
Injecting mice with a vaccine in order to test its efficacy
A processual entity that realizes a plan which is the concretization of a plan specification.
'Plan' includes a future direction sense. That can be problematic if plans are changed during their execution. There are however implicit contingencies for protocols that an agent has in his mind that can be considered part of the plan, even if the agent didn't have them in mind before. Therefore, a planned process can diverge from what the agent would have said the plan was before executing it, by adjusting to problems encountered during execution (e.g. choosing another reagent with equivalent properties, if the originally planned one has run out.)
We are only considering successfully completed planned processes. A plan may be modified, and details added during execution. For a given planned process, the associated realized plan specification is the one encompassing all changes made during execution. This means that all processes in which an agent acts towards achieving some
objectives is a planned process.
Bjoern Peters
branch derived
6/11/9: Edited at workshop. Used to include: is initiated by an agent
This class merges the previously separated objective driven process and planned process, as they the separation proved hard to maintain. (1/22/09, branch call)
planned process
biological feature identification objective
Biological_feature_identification_objective is an objective role carried out by the proposition defining the aim of a study designed to examine or characterize a particular biological feature.
Jennifer Fostel
biological feature identification objective
processed material
Examples include gel matrices, filter paper, parafilm and buffer solutions, mass spectrometer, tissue samples
Is a material entity that is created or changed during material processing.
PERSON: Alan Ruttenberg
processed material
investigation
Lung cancer investigation using expression profiling, a stem cell transplant investigation, biobanking is not an investigation, though it may be part of an investigation
a planned process that consists of parts: planning, study design execution, documentation and which produce conclusion(s).
Bjoern Peters
OBI branch derived
Could add specific objective specification
Following OBI call November 2012,26th: it was decided there was no need for adding "achieves objective of drawing conclusion" as existing relations were providing equivalent ability. this note closes the issue and validates the class definition to be part of the OBI core
editor = PRS
study
investigation
evaluant role
When a specimen of blood is assayed for glucose concentration, the blood has the evaluant role. When measuring the mass of a mouse, the evaluant is the mouse. When measuring the time of DNA replication, the evaluant is the DNA. When measuring the intensity of light on a surface, the evaluant is the light source.
a role that inheres in a material entity that is realized in an assay in which data is generated about the bearer of the evaluant role
Role call - 17nov-08: JF and MC think an evaluant role is always specified input of a process. Even in the case where we have an assay taking blood as evaluant and outputting blood, the blood is not the specified output at the end of the assay (the concentration of glucose in the blood is)
examples of features that could be described in an evaluant: quality.... e.g. "contains 10 pg/ml IL2", or "no glucose detected")
GROUP: Role Branch
OBI
Feb 10, 2009. changes after discussion at OBI Consortium Workshop Feb 2-6, 2009. accepted as core term.
evaluant role
assay
Assay the wavelength of light emitted by excited Neon atoms. Count of geese flying over a house.
A planned process with the objective to produce information about the material entity that is the evaluant, by physically examining it or its proxies.
12/3/12: BP: the reference to the 'physical examination' is included to point out that a prediction is not an assay, as that does not require physical examiniation.
PlanAndPlannedProcess Branch
measuring
scientific observation
OBI branch derived
study assay
any method
assay
quantitative confidence value
A data item which is used to indicate the degree of uncertainty about a measurement.
person:Chris Stoeckert
group:OBI
quantitative confidence value
culture medium
A growth medium or culture medium is a substance in which microorganisms or cells can grow. Wikipedia, growth medium, Feb 29, 2008
a processed material that provides the needed nourishment for microorganisms or cells grown in vitro.
changed from a role to a processed material based on on Aug 22, 2011 dev call. Details see the tracker item: http://sourceforge.net/tracker/?func=detail&aid=3325270&group_id=177891&atid=886178
Modification made by JZ.
Person: Jennifer Fostel, Jie Zheng
OBI
culture medium
reagent role
Buffer, dye, a catalyst, a solvating agent.
A role inhering in a biological or chemical entity that is intended to be applied in a scientific technique to participate (or have molecular components that participate) in a chemical reaction that facilitates the generation of data about some entity distinct from the bearer, or the generation of some specified material output distinct from the bearer.
PERSON:Matthew Brush
reagent
PERSON:Matthew Brush
Feb 10, 2009. changes after discussion at OBI Consortium Workshop Feb 2-6, 2009. accepted as core term.
May 28 2013. Updated definition taken from ReO based on discussions initiated in Philly 2011 workshop. Former defnition described a narrower view of reagents in chemistry that restricts bearers of the role to be chemical entities ("a role played by a molecular entity used to produce a chemical reaction to detect, measure, or produce other substances"). Updated definition allows for broader view of reagents in the domain of biomedical research to include larger materials that have parts that participate chemically in a molecular reaction or interaction.
(copied from ReO)
Reagents are distinguished from instruments or devices that also participate in scientific techniques by the fact that reagents are chemical or biological in nature and necessarily participate in or have parts that participate in some chemical interaction or reaction during their intended participation in some technique. By contrast, instruments do not participate in a chemical reaction/interaction during the technique.
Reagents are distinguished from study subjects/evaluants in that study subjects and evaluants are that about which conclusions are drawn and knowledge is sought in an investigation - while reagents, by definition, are not. It should be noted, however, that reagent and study subject/evaluant roles can be borne by instances of the same type of material entity - but a given instance will realize only one of these roles in the execution of a given assay or technique. For example, taq polymerase can bear a reagent role or an evaluant role. In a DNA sequencing assay aimed at generating sequence data about some plasmid, the reagent role of the taq polymerase is realized. In an assay to evaluate the quality of the taq polymerase itself, the evaluant/study subject role of the taq is realized, but not the reagent role since the taq is the subject about which data is generated.
In regard to the statement that reagents are 'distinct' from the specified outputs of a technique, note that a reagent may be incorporated into a material output of a technique, as long as the IDENTITY of this output is distinct from that of the bearer of the reagent role. For example, dNTPs input into a PCR are reagents that become part of the material output of this technique, but this output has a new identity (ie that of a 'nucleic acid molecule') that is distinct from the identity of the dNTPs that comprise it. Similarly, a biotin molecule input into a cell labeling technique are reagents that become part of the specified output, but the identity of the output is that of some modified cell specimen which shares identity with the input unmodified cell specimen, and not with the biotin label. Thus, we see that an important criteria of 'reagent-ness' is that it is a facilitator, and not the primary focus of an investigation or material processing technique (ie not the specified subject/evaluant about which knowledge is sought, or the specified output material of the technique).
reagent role
material processing
A cell lysis, production of a cloning vector, creating a buffer.
A planned process which results in physical changes in a specified input material
PERSON: Bjoern Peters
PERSON: Frank Gibson
PERSON: Jennifer Fostel
PERSON: Melanie Courtot
PERSON: Philippe Rocca Serra
material transformation
OBI branch derived
material processing
study subject role
Human subjects in a clinical trial, rats in a toxicogenomics study, tissue cutlures subjected to drug tests, fish observed in an ecotoxicology study.
Parasite example: people are infected with a parasite which is then extracted; the particpant under investigation could be the parasite, the people, or a population of which the people are members, depending on the nature of the study.
Lake example: a lake could realize this role in an investigation that assays pollution levels in samples of water taken from the lake.
A role that is realized through the execution of a study design in which the bearer of the role participates and in which data about that bearer is collected.
A participant can realize both "specimen role" and "participant under investigation role" at the same time. However "participant under investigation role" is distinct from "specimen role", since a specimen could somehow be involved in an investigation without being the thing that is under investigation.
GROUP: Role Branch
OBI
Following OBI call November 2012,26th:
1. it was decided there was no need for moving the children class and making them siblings of study subject role.
2. it also settles the disambiguation about 'study subject'. This is about the individual participating in the investigation/study, Not the 'topic' (as in 'toxicity study') of the investigation/study
This note closes the issue and validates the class definition to be part of the OBI core
editor = PRS
participant under investigation role
specimen role
liver section; a portion of a culture of cells; a nemotode or other animal once no longer a subject (generally killed); portion of blood from a patient.
a role borne by a material entity that is gained during a specimen collection process and that can be realized by use of the specimen in an investigation
22Jun09. The definition includes whole organisms, and can include a human. The link between specimen role and study subject role has been removed. A specimen taken as part of a case study is not considered to be a population representative, while a specimen taken as representing a population, e.g. person taken from a cohort, blood specimen taken from an animal) would be considered a population representative and would also bear material sample role.
Note: definition is in specimen creation objective which is defined as an objective to obtain and store a material entity for potential use as an input during an investigation.
blood taken from animal: animal continues in study, whereas blood has role specimen.
something taken from study subject, leaves the study and becomes the specimen.
parasite example
- when parasite in people we study people, people are subjects and parasites are specimen
- when parasite extracted, they become subject in the following study
specimen can later be subject.
GROUP: Role Branch
OBI
specimen role
sequence feature identification objective
Sequence_feature_identification_objective is a biological_feature_identification_objective role describing a study designed to examine or characterize molecular features exhibited at the level of a macromolecular sequence, e.g. nucleic acid, protein, polysaccharide.
Jennifer Fostel
sequence feature identification objective
intervention design
PMID: 18208636.Br J Nutr. 2008 Jan 22;:1-11.Effect of vitamin D supplementation on bone and vitamin D status among Pakistani immigrants in Denmark: a randomised double-blinded placebo-controlled intervention study.
An intervention design is a study design in which a controlled process applied to the subjects (the intervention) serves as the independent variable manipulated by the experimentalist. The treatment (perturbation or intervention) defined can be defined as a combination of values taken by independent variable manipulated by the experimentalists are applied to the recruited subjects assigned (possibly by applying specific methods) to treatment groups. The specificity of intervention design is the fact that independent variables are being manipulated and a response of the biological system is evaluated via response variables as monitored by possibly a series of assays.
Philppe Rocca-Serra
OBI branch derived
intervention design
gene list
Gene lists may arise from analysis to determine differentially expressed genes, may be collected from the literature for involvement in a particular process or pathway (e.g., inflammation), or may be the input for gene set enrichment analysis.
A data set of the names or identifiers of genes that are the outcome of an analysis or have been put together for the purpose of an analysis.
person:Chris Stoeckert
group:OBI
kind of report. (alan) need to be careful to distinguish from output of a data transformation or calculation. A gene list is a report when it is published as such? Relates to question of whether report is a whole, or whether it can be a part of some other narrative object.
gene list
molecular feature identification objective
Molecular_feature_identification_objective is a biological_feature_identification_objective role describing a study designed to examine or characterize molecular features of a biological system, e.g. expression profiling, copy number of molecular components, epigenetic modifications.
Jennifer Fostel
molecular feature identification objective
cDNA library
PMID:6110205. collection of cDNA derived from mouse splenocytes.
Mixed population of cDNAs (complementaryDNA) made from mRNA from a defined source, usually a specific cell type. This term should be associated only to nucleic acid interactors not to their proteins product. For instance in 2h screening use living cells (MI:0349) as sample process.
ALT DEF (PRS):: a cDNA library is a collection of host cells, typically E.Coli cells but not exclusively. modified by transfer of plasmid DNA molecule used as vector containing a fragment or totality of cDNA molecule (the insert) . cDNA library may have an array of role and applications.
PERSON: Luisa Montecchi
PERSON: Philippe Rocca-Serra
GROUP: PSI
PRS: 22022008. class moved under population,
modification of definition and replacement of biomaterials in previous definition with 'material'
addition of has_role restriction
cDNA library
p-value
PMID:19696660
in contrast to the in-vivo data AT-III increased significantly from
113.5% at baseline to 117% after 4 days (n = 10, P-value= 0.02; Table 2).
A quantitative confidence value that represents the probability of obtaining a result at least as extreme as that actually obtained, assuming that the actual value was the result of chance alone.
Addition of restriction 'output of null hypothesis testing' by AGB and PRS while working on STATO
May be outside the scope of OBI long term, is needed so is retained
Alejandra Gonzalez-Beltran
PERSON:Chris Stoeckert
Philippe Rocca-Serra
WEB: http://en.wikipedia.org/wiki/P-value
p
p-value
population
PMID12564891. Environ Sci Technol. 2003 Jan 15;37(2):223-8. Effects of historic PCB exposures on the reproductive success of the Hudson River striped bass population.
a population is a collection of individuals from the same taxonomic class living, counted or sampled at a particular site or in a particular area
1/28/2013, BP, on the call it was raised that we may want to switch to an external ontology for all populatin terms:
http://code.google.com/p/popcomm-ontology/
PERSON: Philippe Rocca-Serra
adapted from Oxford English Dictionnary
rem1: collection somehow always involve a selection process
population
imaging assay
An imaging assay is an assay to produce a picture of an entity. definition_source: OBI.
PlanAndPlannedProcess Branch
OBI branch derived
imaging assay
organization
PMID: 16353909.AAPS J. 2005 Sep 22;7(2):E274-80. Review. The joint food and agriculture organization of the United Nations/World Health Organization Expert Committee on Food Additives and its role in the evaluation of the safety of veterinary drug residues in foods.
An entity that can bear roles, has members, and has a set of organization rules. Members of organizations are either organizations themselves or individual people. Members can bear specific organization member roles that are determined in the organization rules. The organization rules also determine how decisions are made on behalf of the organization by the organization members.
BP: The definition summarizes long email discussions on the OBI developer, roles, biomaterial and denrie branches. It leaves open if an organization is a material entity or a dependent continuant, as no consensus was reached on that. The current placement as material is therefore temporary, in order to move forward with development. Here is the entire email summary, on which the definition is based:
1) there are organization_member_roles (president, treasurer, branch
editor), with individual persons as bearers
2) there are organization_roles (employer, owner, vendor, patent holder)
3) an organization has a charter / rules / bylaws, which specify what roles
there are, how they should be realized, and how to modify the
charter/rules/bylaws themselves.
It is debatable what the organization itself is (some kind of dependent
continuant or an aggregate of people). This also determines who/what the
bearer of organization_roles' are. My personal favorite is still to define
organization as a kind of 'legal entity', but thinking it through leads to
all kinds of questions that are clearly outside the scope of OBI.
Interestingly enough, it does not seem to matter much where we place
organization itself, as long as we can subclass it (University, Corporation,
Government Agency, Hospital), instantiate it (Affymetrix, NCBI, NIH, ISO,
W3C, University of Oklahoma), and have it play roles.
This leads to my proposal: We define organization through the statements 1 -
3 above, but without an 'is a' statement for now. We can leave it in its
current place in the is_a hierarchy (material entity) or move it up to
'continuant'. We leave further clarifications to BFO, and close this issue
for now.
PERSON: Alan Ruttenberg
PERSON: Bjoern Peters
PERSON: Philippe Rocca-Serra
PERSON: Susanna Sansone
GROUP: OBI
organization
dye role
A molecular label role which inheres in a material entity and which is realized in the process of detecting a molecular dye that imparts color to some material of interest.
Jennifer Fostel
dye
A substance used to color materials www.answers.com/topic/dye 19feb09
dye role
protocol
PCR protocol, has objective specification, amplify DNA fragment of interest, and has action specification describes the amounts of experimental reagents used (e..g. buffers, dNTPS, enzyme), and the temperature and cycle time settings for running the PCR.
A plan specification which has sufficient level of detail and quantitative information to communicate it between investigation agents, so that different investigation agents will reliably be able to independently reproduce the process.
PlanAndPlannedProcess Branch
OBI branch derived + wikipedia (http://en.wikipedia.org/wiki/Protocol_%28natural_sciences%29)
study protocol
protocol
adding a material entity into a target
Injecting a drug into a mouse. Adding IL-2 to a cell culture. Adding NaCl into water.
is a process with the objective to place a material entity bearing the 'material to be added role' into a material bearing the 'target of material addition role'.
Class was renamed from 'administering substance', as this is commonly used only for additions into organisms.
BP
branch derived
adding a material entity into a target
analyte role
Glucose in blood (measured in an assay to determine the concentration of glucose).
A measurand role borne by a molecular entity or an atom and realized in an analyte assay which achieves the objective to measure the magnitude/concentration/amount of the analyte in the entity bearing evaluant role.
interestingly, an analyte is still an analyte even if it is not detected. for this reason it does not bear a specified input role
pH (technically the inverse log of [H+]) may be considered a quality; this remains to be tested.
qualities such as weight, color are not assayed but measured, so they do not fall into this category.
GROUP: Role Branch
OBI
Feb 10, 2009. changes after discussion at OBI Consortium Workshop Feb 2-6, 2009. accepted as core term.
analyte role
material to be added role
drug added to a buffer contained in a tube; substance injected into an animal;
material to be added role is a protocol participant role realized by a material which is added into a material bearing the target of material addition role in a material addition process
Role Branch
OBI
9 March 09 from discussion with PA branch
material to be added role
interpreting data
Concluding that a gene is upregulated in a tissue sample based on the band intensity in a western blot. Concluding that a patient has a infection based on measurement of an elevated body temperature and reported headache. Concluding that there were problems in an investigation because data from PCR and microarray are conflicting. Concluding that 'defects in gene XYZ cause cancer due to improper DNA repair' based on data from experiments in that study that gene XYZ is involved in DNA repair, and the conclusion of a previous study that cancer patients have an increased number of mutations in this gene.
A planned process in which data gathered in an investigation is evaluated in the context of existing knowledge with the objective to generate more general conclusions or to conclude that the data does not allow one to draw general conclusion
PERSON: Bjoern Peters
PERSON: Jennifer Fostel
Bjoern Peters
drawing a conclusion based on data
planning
The process of a scientist thinking about and deciding what reagents to use as part of a protocol for an experiment. Note that the scientist could be human or a "robot scientist" executing software.
a process of creating or modifying a plan specification
7/18/2011 BP: planning used to itself be a planned process. Barry Smith pointed out that this would lead to an infinite regression, as there would have to be a plan to conduct a planning process, which in itself would be the result of planning etc. Therefore, the restrictions on 'planning' were loosened to allow for informal processes that result in an 'ad hoc plan '. This required changing from 'has_specified_output some plan specifiction' to 'has_participant some plan specification'.
Bjoern Peters
Bjoern Peters
Plans and Planned Processes Branch
planning
light emission function
A light emission function is an excitation function to excite a material to a specific excitation state that it emits light.
Bill Bug
Daniel Schober
Frank Gibson
Melanie Courtot
light emission function
contain function
A syringe, a beaker
A contain function is a function to constrain a material entities location in space
Bill Bug
Daniel Schober
Frank Gibson
Melanie Courtot
contain function
heat function
A heat function is a function that increases the internal kinetic energy of a material
Bill Bug
Daniel Schober
Frank Gibson
Melanie Courtot
heat function
material separation function
A material separation function is a function that increases the resolution between two or more material entities. The to distinction between the entities is usually based on some associated physical quality.
Bill Bug
Daniel Schober
Frank Gibson
Melanie Courtot
material separation function
excitation function
A excitation function is a function to inject energy by bombarding a material with energetic particles (e.g., photons) thereby imbuing internal material components such as electrons with additional energy. These internal, 'excited' particles may lead to the rupturing of covalent chemical bonds or may quickly relax back to there unexcited state with an exponential time course thereby locally emitting energy in the form of photons.
Bill Bug
Daniel Schober
Frank Gibson
Melanie Courtot
excitation function
filter function
A filter function is a function to prevent the flow of certain entities based on a quality or qualities of the entity while allowing entities which have different qualities to pass through
Frank Gibson
filter function
cool function
A cool function is a function to decrease the internal kinetic energy of a material below the initial kinetic energy of that type of material.
Daniel Schober
Frank Gibson
Melanie Courtot
cool function
solid support function
Taped, glued, pinned, dried or molecularly bonded to a solid support
A solid support function is a function of a device on which an entity is kept in a defined position and prevented in its movement
Daniel Schober
Frank Gibson
Melanie Courtot
solid support function
environment control function
An environmental control function is a function that regulates a contained environment within specified parameter ranges. For example the control of light exposure, humidity and temperature.
Bill Bug
Daniel Schober
Frank Gibson
Melanie Courtot
environment control function
sort function
A sort function is a function to distinguish material components based on some associated physical quality or entity and to partition the separate components into distinct fractions according to a defined order.
Daniel Schober
Frank Gibson
Melanie Courtot
sort function
cloning vector role
pBluescript plays the role of a cloning vector
A material to be added role played by a small, self-replicating DNA or RNA molecule - usually a plasmid or chromosome - and realized in a process whereby foreign DNA or RNA is inserted into the vector during the process of cloning.
JZ: related tracker: https://sourceforge.net/p/obi/obi-terms/102/
PERSON: Helen Parkinson
cloning vector role
cloning insert role
cloning insert role is a role which inheres in DNA or RNA and is realized by the process of being inserted into a cloning vector in a cloning process.
Feb 20, 2009. from Wikipedia: cloning of any DNA fragment essentially involves four steps: DNA fragmentation with restriction endonucleases, ligation of DNA fragments to a vector, transfection, and screening/selection. There are multiple processes involved, it is not just "cloning process"
GROUP: Role branch
OBII and Wikipedia
cloning insert role
extract
Up-regulation of inflammatory signalings by areca nut extract and role of cyclooxygenase-2 -1195G>a polymorphism reveal risk of oral cancer. Cancer Res. 2008 Oct 15;68(20):8489-98. PMID: 18922923
an extract is a material entity which results from an extraction process
PERSON: Philippe Rocca-Serra
extracted material
GROUP: OBI Biomatrial Branch
extract
transcription profiling assay
Whole genome transcription profiling of Anaplasma phagocytophilum in human and tick host cells by tiling array analysis. BMC Genomics. 2008 Jul 31;9:364. PMID: 18671858
An assay which aims to provide information about gene expression and transcription activity using ribonucleic acids collected from a material entity using a range of techniques and instrument such as DNA sequencers, DNA microarrays, Northern Blot
Philippe Rocca-Serra
gene expression profiling
OBI
transcription profiling
transcription profiling assay
averaging objective
A mean calculation which has averaging objective is a descriptive statistics calculation in which the mean is calculated by taking the sum of all of the observations in a data set divided by the total number of observations. It gives a measure of the 'center of gravity' for the data set. It is also known as the first moment.
An averaging objective is a data transformation objective where the aim is to perform mean calculations on the input of the data transformation.
Elisabetta Manduchi
James Malone
PERSON: Elisabetta Manduchi
averaging objective
enzyme
(protein or rna) or has_part (protein or rna) and
has_function some GO:0003824 (catalytic activity)
MC: known issue: enzyme doesn't classify under material entity for now as it isn't stated that anything
that has_part some material entity is a material entity. If we add as equivalent classes to material entity has_part some material entity and part_of some material entity (each one in his own necessary and sufficient block) Pellet in P3 doesn't classify any more.
person: Melanie Courtot
GROUP:OBI
enzyme
adding material objective
creating a mouse infected with LCM virus
is the specification of an objective to add a material into a target material. The adding is asymmetric in the sense that the target material largely retains its identity
BP
adding material objective
genotyping assay
High-throughput genotyping of oncogenic human papilloma viruses with MALDI-TOF mass spectrometry. Clin Chem. 2008 Jan;54(1):86-92. Epub 2007 Nov 2.PMID: 17981923
an assay which generates data about a genotype from a specimen of genomic DNA. A variety of
techniques and instruments can be used to produce information about sequence variation at particular genomic positions.
Philippe Rocca-Serra
genotype profiling, SNP genotyping
OBI Biomaterial
SNP analysis
genotyping assay
analyte measurement objective
The objective to measure the concentration of glucose in a blood sample
an assay objective to determine the presence or concentration of an analyte in the evaluant
PERSON: Bjoern Peters
PPPB branch
analyte measurement objective
assay objective
the objective to determine the weight of a mouse.
an objective specification to determine a specified type of information about an evaluated entity (the material entity bearing evaluant role)
PPPB branch
PPPB branch
assay objective
analyte assay
example of usage: In lab test for blood glucose, the test is the assay, the blood bears evaluant_role and glucose bears the analyte role. The evaluant is considered an input to the assay and the information entity that records the measurement of glucose concentration the output
An assay with the objective to capture information about the presence, concentration, or amount of an analyte in an evaluant.
2013-09-23: simplify equivalent axiom
Note: is_realization of some analyte role isn't always true, for example when there is none of the analyte in the evaluant. For the moment we are writing it this way, but when the information ontology is further worked out this will be replaced with a condition discussing the measurement.
logical def modified to remove expression below, as some analyte assays report below the level of detection, and therefore not a scalar measurement datum, replaced by measurement datum
and
('has measurement unit label' some 'measurement unit label') and
('is quality measurement of' some 'molecular concentration'))
PERSON:Bjoern Peters, Helen Parkinson, Philippe Rocca-Serra, Alan Ruttenberg
PERSON:Bjoern Peters
PERSON:Helen Parkinson
PERSON:Philippe Rocca-Serra
PERSON:Alan Ruttenberg
GROUP:OBI Planned process branch
analyte assay
target of material addition role
peritoneum of an animal receiving an interperitoneal injection; solution in a tube receiving additional material; location of absorbed material following a dermal application.
target of material addition role is a role realized by an entity into which a material is added in a material addition process
From Branch discussion with BP, AR, MC -- there is a need for the recipient to interact with the administered material. for example, a tooth receiving a filling was not considered to be a target role.
GROUP: Role Branch
OBI
target of material addition role
normalized data set
A data set that is produced as the output of a normalization data transformation.
PERSON: James Malone
PERSON: Melanie Courtot
normalized data set
measure function
A glucometer measures blood glucose concentration, the glucometer has a measure function.
Measure function is a function that is borne by a processed material and realized in a process in which information about some entity is expressed relative to some reference.
PERSON: Daniel Schober
PERSON: Helen Parkinson
PERSON: Melanie Courtot
PERSON:Frank Gibson
measure function
material transformation objective
The objective to create a mouse infected with LCM virus. The objective to create a defined solution of PBS.
an objective specifiction that creates an specific output object from input materials.
PERSON: Bjoern Peters
PERSON: Frank Gibson
PERSON: Jennifer Fostel
PERSON: Melanie Courtot
PERSON: Philippe Rocca-Serra
artifact creation objective
GROUP: OBI PlanAndPlannedProcess Branch
material transformation objective
study design execution
injecting a mouse with PBS solution, weighing it, and recording the weight according to a study design.
a planned process that carries out a study design
removed axiom has_part some (assay or 'data transformation') per discussion on protocol application mailing list to improve reasoner performance. The axiom is still desired.
branch derived
6/11/9: edited at workshop. Used to be: study design execution is a process with the objective to generate data according to a concretized study design. The execution of a study design is part of an investigation, and minimally consists of an assay or data transformation.
study design execution
DNA sequencing
Genomic deletions of OFD1 account for 23% of oral-facial-digital type 1 syndrome after negative DNA sequencing. Thauvin-Robinet C, Franco B, Saugier-Veber P, Aral B, Gigot N, Donzel A, Van Maldergem L, Bieth E, Layet V, Mathieu M, Teebi A, Lespinasse J, Callier P, Mugneret F, Masurel-Paulet A, Gautier E, Huet F, Teyssier JR, Tosi M, Frébourg T, Faivre L. Hum Mutat. 2008 Nov 19. PMID: 19023858
DNA sequencing is a sequencing process which uses deoxyribonucleic acid as input and results in a the creation of DNA sequence information artifact using a DNA sequencer instrument.
Philippe Rocca-Serra
OBI Branch derived
nucleotide sequencing
DNA sequencing
material separation objective
The objective to obtain multiple aliquots of an enzyme preparation. The objective to obtain cells contained in a sample of blood.
is an objective to transform a material entity into spatially separated components.
PPPB branch
PPPB branch
material separation objective
clustered data set
A clustered data set is the output of a K means clustering data transformation
A data set that is produced as the output of a class discovery data transformation and consists of a data set with assigned discovered class labels.
PERSON: James Malone
PERSON: Monnie McGee
data set with assigned discovered class labels
AR thinks could be a data item instead
clustered data set
data set of features
A data set that is produced as the output of a descriptive statistical calculation data transformation and consists of producing a data set that represents one or more features of interest about the input data set.
PERSON: James Malone
PERSON: Monnie McGee
data set of features
differential expression analysis data transformation
A differential expression analysis data transformation is a data transformation that has objective differential expression analysis and that consists of
James Malone
Melanie Courtot
Monnie McGee
WEB:
differential expression analysis data transformation
material combination
Mixing two fluids. Adding salt into water. Injecting a mouse with PBS.
is a material processing with the objective to combine two or more material entities as input into a single material entity as output.
created at workshop as parent class for 'adding material into target', which is asymmetric, while combination encompasses all addition processes.
bp
bp
material combination
specimen collection process
drawing blood from a patient for analysis, collecting a piece of a plant for depositing in a herbarium, buying meat from a butcher in order to measure its protein content in an investigation
A planned process with the objective of collecting a specimen.
Note: definition is in specimen creation objective which is defined as an objective to obtain and store a material entity for potential use as an input during an investigation.
Philly2013: A specimen collection can have as part a material entity acquisition, such as ordering from a bank. The distinction is that specimen collection necessarily involves the creation of a specimen role. However ordering cell lines cells from ATCC for use in an investigation is NOT a specimen collection, because the cell lines already have a specimen role.
Philly2013: The specimen_role for the specimen is created during the specimen collection process.
label changed to 'specimen collection process' on 10/27/2014, details see tracker:
http://sourceforge.net/p/obi/obi-terms/716/
Bjoern Peters
specimen collection
5/31/2012: This process is not necessarily an acquisition, as specimens may be collected from materials already in posession
6/9/09: used at workshop
specimen collection process
error corrected data set
A data set that is produced as the output of an error correction data transformation and consists of producing a data set which has had erroneous contributions from the input to the data transformation removed (corrected for).
PERSON: James Malone
PERSON: Monnie McGee
error corrected data set
error correction data transformation
An error correction data transformation is a data transformation that has the objective of error correction, where the aim is to remove (correct for) erroneous contributions from the input to the data transformation.
James Malone
Monnie McGee
EDITORS
error correction data transformation
sample from organism
a material obtained from an organism in order to be a representative of the whole
5/29: This is a helper class for now
we need to work on this: Is taking a urine sample a material separation process? If not, we will need to specify what 'taking a sample from organism' entails. We can argue that the objective to obtain a urine sample from a patient is enough to call it a material separation process, but it could dilute what material separation was supposed to be about.
sample from organism
statistical hypothesis test
"A statistical test provides a mechanism for making quantitative decisions about a process or processes".
A statistical hypothesis test data transformation is a data transformation that has objective statistical hypothesis test.
Alejandra Gonzalez-Beltran
James Malone
Philippe Rocca-Serra
PERSON: James Malone
http://www.itl.nist.gov/div898/handbook/prc/section1/prc13.htm
NHST
Null Hypothesis Statistical Testing
statistical hypothesis testing
statistical hypothesis test
center value
A data item that is produced as the output of a center calculation data transformation and represents the center value of the input data.
PERSON: James Malone
PERSON: Monnie McGee
median
center value
statistical hypothesis test objective
is a data transformation objective where the aim is to estimate statistical significance with the aim of proving or disproving a hypothesis by means of some data transformation
James Malone
Person:Helen Parkinson
hypothesis test objective
WEB: http://en.wikipedia.org/wiki/Statistical_hypothesis_testing
statistical hypothesis test objective
portioning objective
The objective to obtain multiple aliquots of an enzyme preparation.
A material separation objective aiming to separate material into multiple portions, each of which contains a similar composition of the input material.
portioning objective
average value
A data item that is produced as the output of an averaging data transformation and represents the average value of the input data.
PERSON: James Malone
PERSON: Monnie McGee
arithmetic mean
average value
separation into different composition objective
The objective to obtain cells contained in a sample of blood.
A material separation objective aiming to separate a material entity that has parts of different types, and end with at least one output that is a material with parts of fewer types (modulo impurities).
We should be using has the grain relations or concentrations to distinguish the portioning and other sub-objectives
separation into different composition objective
specimen collection objective
The objective to collect bits of excrement in the rainforest. The objective to obtain a blood sample from a patient.
A objective specification to obtain a material entity for potential use as an input during an investigation.
Bjoern Peters
Bjoern Peters
specimen collection objective
material combination objective
is an objective to obtain an output material that contains several input materials.
PPPB branch
bp
material combination objective
paired-end library
PMID: 19339662. Next-generation DNA sequencing of paired-end tags (PET) for transcriptome and genome analyses. Genome Res. 2009 Apr;19(4):521-32. Fullwood MJ, Wei CL, Liu ET, Ruan Y.
is a collection of short paired tags from the two ends of DNA fragments are extracted and covalently linked as ditag constructs
Philippe Rocca-Serra
mate-paired library
paired-end tag (PET) library
adapted from information provided by Solid web site
paired-end library
k-nearest neighbors
A k-nearest neighbors is a data transformation which achieves a class discovery or partitioning objective, in which an input data object with vector y is assigned to a class label based upon the k closest training data set points to y; where k is the largest value that class label is assigned.
James Malone
k-NN
PERSON: James Malone
k-nearest neighbors
recombinant vector
A recombinant vector is created by a recombinant vector cloning process, and contains nucleic acids that can be amplified. It retains functions of the original cloning vector.
recombinant vector
single fragment library
is a collection of short tags from DNA fragments, are extracted and covalently linked as single tag constructs
Philippe Rocca-Serra
fragment library
single fragment library
cloning vector
A cloning vector is an engineered material that is used as an input material for a recombinant vector cloning process to carry inserted nucleic acids. It contains an origin of replication for a specific destination host organism, encodes for a selectable gene product and contains a cloning site.
cloning vector
1
2
1
true
1
true
1
2
1
Student's t-test
Studen't t-test is a data transformation with the objective of a statistical hypothesis test in which the test statistic has a Student's t distribution if the null hypothesis is true. It is applied when the population is assumed to be normally distributed but the sample sizes are small enough that the statistic on which inference is based is not normally distributed because it relies on an uncertain estimate of standard deviation rather than on a precisely known value.
Alejandra Gonzalez-Beltran
James Malone
Philippe Rocca-Serra
t-test
WEB: http://en.wikipedia.org/wiki/T-test
t.test(dependent variable ~ independant variable, data = dataset, var.equal = FALSE)
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/t.test.html
Student's t-test
material sample role
a role borne by a portion of blood taken to represent all the blood in an organism; the role borne by a population of humans with HIV enrolled in a study taken to represent patients with HIV in general.
A material sample role is a specimen role borne by a material entity that is the output of a material sampling process.
7/13/09: Note that this is a relational role: between the sample taken and the 'sampled' material of which the sample is thought to be representative off.
material sample role
material sampling process
A specimen gathering process with the objective to obtain a specimen that is representative of the input material entity
material sampling process
material sample
blood drawn from patient to measure his systemic glucose level. A population of humans with HIV enrolled in a study taken to represent patients with HIV in general.
A material entity that has the material sample role
OBI: workshop
sample population
sample
material sample
independent variable specification
In a study in which gene expression is measured in patients between 8 month to 4 years old that have mild or severe malaria and in which the hypothesis is that gene expression in that age group is a function of disease status, disease status is the independent variable.
a directive information entity that is part of a study design. Independent variables are entities whose values are selected to determine its relationship to an observed phenomenon (the dependent variable). In such an experiment, an attempt is made to find evidence that the values of the independent variable determine the values of the dependent variable (that which is being measured). The independent variable can be changed as required, and its values do not represent a problem requiring explanation in an analysis, but are taken simply as given. The dependent variable on the other hand, usually cannot be directly controlled
2/2/2009 Original definition - In the design of experiments, independent variables are those whose values are controlled or selected by the person experimenting (experimenter) to determine its relationship to an observed phenomenon (the dependent variable). In such an experiment, an attempt is made to find evidence that the values of the independent variable determine the values of the dependent variable (that which is being measured). The independent variable can be changed as required, and its values do not represent a problem requiring explanation in an analysis, but are taken simply as given. The dependent variable on the other hand, usually cannot be directly controlled.
In the Philly 2013 workshop the label was chosen to distinguish it from "dependent variable" as used in statistical modelling. See: http://en.wikipedia.org/wiki/Statistical_modeling
an independent variable is a variable which assumes only values set by the operator according to a plan and which are expected to (or are being tested for) influence the ranges of values assumed by one or more dependent variables (also known as 'response variables').
PERSON: Alan Ruttenberg
PERSON: Bjoern Peters
PERSON: Chris Stoeckert
experimental factor
independent variable
Web: http://en.wikipedia.org/wiki/Dependent_and_independent_variables
2009-03-16: work has been done on this term during during the OBI workshop winter 2009 and the current definition was considered acceptable for use in OBI. If there is a need to modify thisdefinition please notify OBI.
study factor
explanatory variable
factor
study design independent variable
dependent variable specification
In a study in which gene expression is measured in patients between 8 month to 4 years old that have mild or severe malaria and in which the hypothesis is that gene expression in that age group is a function of disease status, the gene expression is the dependent variable.
dependent variable specification is part of a study design. The dependent variable is the event studied and expected to change when the independent variable varies.
2/2/2009 In the design of experiments, independent variables are those whose values are controlled or selected by the person experimenting (experimenter) to determine its relationship to an observed phenomenon (the dependent variable). In such an experiment, an attempt is made to find evidence that the values of the independent variable determine the values of the dependent variable (that which is being measured). The independent variable can be changed as required, and its values do not represent a problem requiring explanation in an analysis, but are taken simply as given. The dependent variable on the other hand, usually cannot be directly controlled.
In the Philly 2013 workshop the label was chosen to distinguish it from "dependent variable" as used in statistical modelling. See: http://en.wikipedia.org/wiki/Statistical_modeling
PERSON: Alan Ruttenberg
PERSON: Bjoern Peters
PERSON: Chris Stoeckert
dependent variable
WEB: http://en.wikipedia.org/wiki/Dependent_and_independent_variables
2009-03-16: work has been done on this term during during the OBI workshop winter 2009 and the current definition was considered acceptable for use in OBI. If there is a need to modify thisdefinition please notify OBI.
response variable
study design dependent variable
survival rate
A measurement data that represents the percentage of people or animals in a study or treatment group who are alive for a given period of time after diagnosis or initiation of monitoring.
Oliver He
adapted from wikipedia
http://en.wikipedia.org/wiki/Survival_rate
survival rate
multiple testing correction objective
Application of the Bonferroni correction
A multiple testing correction objectives is a data transformation objective where the aim is to correct for a set of statistical inferences considered simultaneously
multiple comparison correction objective
http://en.wikipedia.org/wiki/Multiple_Testing_Correction
multiple testing correction objective
material maintenance objective
An objective specification maintains some or all of the qualities of a material over time.
PERSON: Bjoern Peters
PERSON: Bjoern Peters
material maintenance objective
primary structure of DNA macromolecule
a quality of a DNA molecule that inheres in its bearer due to the order of its DNA nucleotide residues.
placeholder for SO
BP et al
primary structure of DNA macromolecule
measurement device
A ruler, a microarray scanner, a Geiger counter.
A device in which a measure function inheres.
GROUP:OBI Philly workshop
OBI
measurement device
material maintenance
a process with that achieves the objective to maintain some or all of the characteristics of an input material over time
material maintenance
polyA RNA extraction
A RNA extraction process typically involving the use of poly dT oligomers in which the desired output material is polyA RNA.
Person: Chris Stoeckert
Person: Jie Zheng
UPenn Group
polyA RNA extraction
1
2
Likelihood-ratio test
Likelihood-ratio is a data transformation which tests whether there is evidence of the need to move from a simple model to a more complicated one (where the simple model is nested within the complicated one); tests of the goodness-of-fit between two models.
date: March 2013
AGB and PRS provide formal definition expressed the test in terms of output and input, specifying the nature of the variables, the purpose of the test and the distribution used.
Alejandra Gonzales-Beltran
Philippe Rocca-Serra
Tina Boussard
lrtest()
http://hosho.ees.hokudai.ac.jp/~kubo/Rdoc/library/lmtest/html/lrtest.html
Likelihood-ratio test
survival curve
A survival curve is a report graph which is a graphical representation of data where the percentage of survival is plotted as a function of time.
Alejandra Gonzalez-Beltran
PERSON:Chris Stoeckert
PERSON:James Malone
PERSON:Melanie Courtot
Philippe Rocca-Serra
WEB: http://www.graphpad.com/www/book/survive.htm
survival curve
flow cytometry assay
Using a flow cytometer to quantitate the percent of CD3 positive cells in a population by labeling them with a FITC tagged anti-CD3 antibody.
A cytometry assay in which an input cell population is put in solution, is passed by a laser, and optical sensors are used to detect scattering of the laser light and/or fluorescence of specific markers to count and characterize the particles in solution.
IEDB
IEDB
flow cytometry assay
labeled specimen
A specimen that has been modified in order to be able to detect it in future experiments
added during call 3/1/2010
OBI group
labeled specimen
study intervention
the part of the execution of an intervention design study which is varied between two or more subjects in the study
PERSON: Bjoern Peters
GROUP: OBI
study intervention
material separation device
flow cytometer
A device with a separation function realized in a planed process
material separation device
categorical measurement datum
A measurement datum that is reported on a categorical scale
Bjoern Peters
nominal mesurement datum
Bjoern Peters
categorical measurement datum
processed specimen
A tissue sample that has been sliced and stained for a histology study.
A blood specimen that has been centrifuged to obtain the white blood cells.
A specimen that has been intentionally physically modified.
Bjoern Peters
Bjoern Peters
A tissue sample that has been sliced and stained for a histology study.
processed specimen
categorical label
The labels 'positive' vs. 'negative', or 'left handed', 'right handed', 'ambidexterous', or 'strongly binding', 'weakly binding' , 'not binding', or '+++', '++', '+', '-' etc. form scales of categorical labels.
A label that is part of a categorical datum and that indicates the value of the data item on the categorical scale.
Bjoern Peters
Bjoern Peters
categorical label
in live cell assay
An assay in which a measurement is made by observing entities located in a live cell.
in live cell assay
container
A device that can be used to restrict the location of material entities over time
03/21/2010: Added to allow classification of children (similar to what we want to do for 'measurement device'. Lookint at what classifies here, we may want to reconsider a contain function assigned to a part of an entity is necessarily also a function of the whole (e.g. is a centrifuge a container because it has test tubes as parts?)
PERSON: Bjoern Peters
container
device
A voltmeter is a measurement device which is intended to perform some measure function.
An autoclave is a device that sterlizes instruments or contaminated waste by applying high temperature and pressure.
A material entity that is designed to perform a function in a scientific investigation, but is not a reagent.
2012-12-17 JAO: In common lab usage, there is a distinction made between devices and reagents that is difficult to model. Therefore we have chosen to specifically exclude reagents from the definition of "device", and are enumerating the types of roles that a reagent can perform.
2013-6-5 MHB: The following clarifications are outcomes of the May 2013 Philly Workshop. Reagents are distinguished from devices that also participate in scientific techniques by the fact that reagents are chemical or biological in nature and necessarily participate in some chemical interaction or reaction during the realization of their experimental role. By contrast, devices do not participate in such chemical reactions/interactions. Note that there are cases where devices use reagent components during their operation, where the reagent-device distinction is less clear. For example:
(1) An HPLC machine is considered a device, but has a column that holds a stationary phase resin as an operational component. This resin qualifies as a device if it participates purely in size exclusion, but bears a reagent role that is realized in the running of a column if it interacts electrostatically or chemically with the evaluant. The container the resin is in (“the column”) considered alone is a device. So the entire column as well as the entire HPLC machine are devices that have a reagent as an operating part.
(2) A pH meter is a device, but its electrode component bears a reagent role in virtue of its interacting directly with the evaluant in execution of an assay.
(3) A gel running box is a device that has a metallic lead as a component that participates in a chemical reaction with the running buffer when a charge is passed through it. This metallic lead is considered to have a reagent role as a component of this device realized in the running of a gel.
In the examples above, a reagent is an operational component of a device, but the device itself does not realize a reagent role (as bearing a reagent role is not transitive across the part_of relation). In this way, the asserted disjointness between a reagent and device holds, as both roles are never realized in the same bearer during execution of an assay.
PERSON: Helen Parkinson
instrument
OBI development call 2012-12-17.
device
sequence data
example of usage: the representation of a nucleotide sequence in FASTA format used for a sequence similarity search.
A measurement datum that representing the primary structure of a macromolecule(it's sequence) sometimes associated with an indicator of confidence of that measurement.
Person:Chris Stoeckert
GROUP: OBI
sequence data
dose
An organism has been injected 1ml of vaccine
A measurement datum that measures the quantity of something that may be administered to an organism or that an organism may be exposed to. Quantities of nutrients, drugs, vaccines and toxins are referred to as doses.
dose
nucleic acid extract
An extract that is the output of an extraction process in which nucleic acid molecules are isolated from a specimen.
PERSON: Jie Zheng
UPenn Group
nucleic acid extract
light emission device
A light source is an optical subsystem that provides light for use in a distant area using a delivery system (e.g., fiber optics)
a device which has a function to emit light.
Person:Helen Parkinson
OBI
light emission device
environmental control device
A growth chamber is an environmental control device.
An environmental control device is a device which has the function to control some aspect of the environment such as temperature, or humidity.
Helen Parkinson
OBI
environmental control device
labeled nucleic acid extract
a labeled specimen that is the output of a labeling process and has grain labeled nucleic acid for detection of the nucleic acid in future experiments.
Person: Jie Zheng
labeled extract
MO_221 labeledExtract
labeled extract
labeled nucleic acid extract
dose response curve
A data item of paired values, one indicating the dose of a material, the other quantitating a measured effect at that dose. The dosing intervals are chosen so that effect values be interpolated by a plotting a curve.
Bjoern Peters; Randi Vita
Philippe Rocca-Serra, Alejandra Gonzalez-Beltran
dose response curve
genetic population background information
genotype information 'C57BL/6J Hnf1a+/-' in this case, C57BL/6J is the genetic population background information
a genetic characteristics information which is a part of genotype information that identifies the population of organisms
proposed and discussed on San Diego OBI workshop, March 2011
Group: OBI group
Group: OBI group
genetic population background information
FWER adjusted p-value
http://ugrad.stat.ubc.ca/R/library/LPE/html/mt.rawp2adjp.html
A quantitative confidence value resulting from a multiple testing error correction method which adjusts the p-value used as input to control for Type I error in the context of multiple pairwise tests
Addition of restriction 'output of null hypothesis testing' and specified output by AGB and PRS while working on STATO
PERS:Philippe Rocca-Serra
adapted from wikipedia (http://en.wikipedia.org/wiki/Familywise_error_rate)
Family-wise type I error rate
FWER adjusted p-value
RNA-seq assay
An assay in which sequencing technology (e.g. Solexa/454) is used to generate RNA sequence, analyse the transcibed regions of the genome, and or to quantitate transcript abundance
PERSON: James Malone
transcription profiling by high throughput sequencing
EFO_0002770 transcription profiling by high throughput sequencing
JZ: should be inferred as 'DNA sequencing'. Will check in the future.
an assay that uses high-throughput sequencing technologies to sequence cDNA in order to get information about a sample's RNA content. RNA-Seq provides researchers with efficient ways to measure transcriptome data experimentally, allowing them to get information such as how different alleles of a gene are expressed, detect post-transcriptional mutations or identify gene fusions.
WEB: http://en.wikipedia.org/wiki/RNA-Seq
RNA-seq assay
genotype information
Genotype information can be: Mus musculus wild type (in this case the genetic population background information is Mus musculus), C57BL/6J Hnf1a+/- (in this case, C57BL/6J is the genetic population background information and Hnf1a+/- is the allele information
a genetic characteristics information that is about the genetic material of an organism and minimally includes information about the genetic background and can in addition contain information about specific alleles, genetic modifications, etc.
discussed on San Diego OBI workshop, March 2011
Group: OBI group
Group: OBI group
genotype information
transcription profiling identification objective
A molecular feature identification objective that aims to characterize the abundance of transcripts
Person: Chris Stoeckert, Jie Zheng
Group: Penn Group
transcription profiling identification objective
allele information
genotype information 'C57BL/6J Hnf1a+/-' in this case, Hnf1a+/- is the allele information
a genetic alteration information that about one of two or more alternative forms of a gene or marker sequence and differing from other alleles at one or more mutational sites based on sequence. Polymorphisms are included in this definition.
discussed on San Diego OBI workshop, March 2011
Person: Chris Stoeckert, Jie Zheng
MO_58 Allele
allele information
genetic alteration information
a genetic characteristics information that is about known changes or the lack thereof from the genetic background, including allele information, duplication, insertion, deletion, etc.
proposed and discussed on San Diego OBI workshop, March 2011
Group: OBI group
Group: OBI group
genetic alteration information
genetic characteristics information
a data item that is about genetic material including polymorphisms, disease alleles, and haplotypes.
Person: Chris Stoeckert, Jie Zheng
MO_66 IndividualGeneticCharacteristics
MO definition:
The genotype of the individual organism from which the biomaterial was derived. Individual genetic characteristics include polymorphisms, disease alleles, and haplotypes.
examples in ArrayExpress
wild_type
MutaMouse (CD2F1 mice with lambda-gt10LacZ integration)
AlfpCre; SNF5 flox/knockout
p53 knock out
C57Bl/6 gp130lox/lox MLC2vCRE/+
fer-15; fem-1
df/df
pat1-114/pat1-114 ade6-M210/ade6-M216 h+/h+ (cells are diploid)
genetic characteristics information
q-value
PMID: 20483222. Comp Biochem Physiol Part D Genomics Proteomics. 2008 Sep;3(3):234-42. Analysis of Sus scrofa liver proteome and identification of proteins differentially expressed between genders, and conventional and genetically enhanced lines.
"After controlling the false discovery rate (FDR</=0.1) using the Storey q value only four proteins (EPHX1, CAT, PAH, ST13) were shown to be differentially expressed between genders (Males/Females) and two proteins (SELENBP2, TAGLN) were differentially expressed between two lines (Transgenic/Conventional pigs)"
A quantitative confidence value that measures the minimum false discovery rate that is incurred when calling that test significant.
To compute q-values, it is necessary to know the p-value produced by a test and possibly set a false discovery rate level.
Addition of restriction 'output of null hypothesis testing' by AGB and PRS while working on STATO
PERS:Philippe Rocca-Serra
FDR adjusted p-value
Adapted from several sources, including
http://.en/wikipedia.org/wiki/False_discovery_rate
http://svitsrv25.epfl.ch/R-doc/library/qvalue.html
q
q-value
genotyping design
A study design that classifies an individual or group of individuals on the basis of alleles, haplotypes, SNPs.
Person: Chris Stoeckert, Jie Zheng
MO_560 genotyping_design
genotyping design
specimen from organism
A specimen that derives from an anatomical part or substance arising from an organism. Examples of tissue specimen include tissue, organ, physiological system, blood, or body location (arm).
PERSON: Chris Stoeckert, Jie Zheng
tissue specimen
MO_954 organism_part
specimen from organism
fluorescence detection assay
Using a laser to stimulate a cell culture that was previously labeled with fluorescent antibodies to detect light emmission at a different wavelength in order to determine the presence of surface markers the antibodies are specific for.
An assay in which a material's fluorescence is determined.
IEDB
IEDB
fluorescence detection assay
rate measurement datum
The rate of disassociation of a peptide from a complex with an MHC molecule measured by the ratio of bound and unbound peptide per unit of time.
A scalar measurement datum that represents the number of events occuring over a time interval
PERSON: Bjoern Peters, Randi Vita
IEDB
rate measurement datum
DNA sequence data
The part of a FASTA file that contains the letters ACTGGGAA
A sequence data item that is about the primary structure of DNA
OBI call; Bjoern Peters
OBI call; Melanie Courtout
8/29/11 call: This is added after a request from Melanie and Yu. They should review it further. This should be a child of 'sequence data', and as of the current definition will infer there.
DNA sequence data
selection criterion
rats should be aged between 6 and 8 weeks and weight between 180-250grams
A directive information entity which defines and states a principle of standard by which selection process may take place.
Person: Philippe Rocca-Serra
selection rule
OBI discussion summarized under the following tracker item : http://sourceforge.net/p/obi/obi-terms/678/
selection criterion
drawing a conclusion
Concluding that the length of the hypotenuse is equal to the square root of the sum of squares of the other two sides in a right-triangle.
Concluding that a gene is upregulated in a tissue sample based on the band intensity in a western blot. Concluding that a patient has a infection based on measurement of an elevated body temperature and reported headache. Concluding that there were problems in an investigation because data from PCR and microarray are conflicting.
A planned process in which new information is inferred from existing information.
drawing a conclusion
assay array
A device made to be used in an analyte assay for immobilization of substances that bind the analyte at regular spatial positions on a surface.
PERSON: Chris Stoeckert, Jie Zheng, Alan Ruttenberg
Penn Group
assay array
conclusion based on data
The conclusion that a gene is upregulated in a tissue sample based on the band intensity in a western blot. The conclusion that a patient has a infection based on measurement of an elevated body temperature and reported headache. The conclusion that there were problems in an investigation because data from PCR and microarray are conflicting.
The following are NOT conclusions based on data: data themselves; results from pure mathematics, e.g. "13 is prime".
An information content entity that is inferred from data.
In the Philly 2013 workshop, we recognized the limitations of "conclusion textual entity", and we introduced this as more general. The need for the 'textual entity' term going forward is up for future debate.
Group:2013 Philly Workshop group
Group:2013 Philly Workshop group
conclusion based on data
cell freezing medium
A processed material that serves as a liquid vehicle for freezing cells for long term quiescent stroage, which contains chemicls needed to sustain cell viability across freeze-thaw cycles.
PERSON: Matthew Brush
cell freezing medium
categorical value specification
A value specification that is specifies one category out of a fixed number of nominal categories
PERSON:Bjoern Peters
categorical value specification
1
1
scalar value specification
A value specification that consists of two parts: a numeral and a unit label
PERSON:Bjoern Peters
scalar value specification
value specification
The value of 'positive' in a classification scheme of "positive or negative"; the value of '20g' on the quantitative scale of mass.
An information content entity that specifies a value within a classification scheme or on a quantitative scale.
This term is currently a descendant of 'information content entity', which requires that it 'is about' something. A value specification of '20g' for a measurement data item of the mass of a particular mouse 'is about' the mass of that mouse. However there are cases where a value specification is not clearly about any particular. In the future we may change 'value specification' to remove the 'is about' requirement.
PERSON:Bjoern Peters
value specification
molecular-labeled material
a material entity that is the specified output of an addition of molecular label process that aims to label some molecular target to allow for its detection in a detection of molecular label assay
PERSON:Matthew Brush
OBI developer call, 3-12-12
molecular-labeled material
cytometry assay
An intracellular material detection by flow cytometry assay measuring peforin inside a culture of T cells.
An assay that measures properties of cells.
IEDB
IEDB
cytometry assay
physical store
a freezer. a humidity controlled box.
A container with an environmental control function.
For details see tracker item: http://sourceforge.net/p/obi/obi-terms/793/
Chris Stoeckert
Duke Biobank, OBIB
Biobank
physical store
measurand role
A role borne by a material entity and realized in an assay which achieves the objective to measure the magnitude/concentration/amount of the measurand in the entity bearing evaluant role.
Person: Alan Ruttenberg, Jie Zheng
https://en.wiktionary.org/wiki/measurand
https://github.com/obi-ontology/obi/issues/778
measurand role
organism
animal
fungus
plant
virus
A material entity that is an individual living system, such as animal, plant, bacteria or virus, that is capable of replicating or reproducing, growth and maintenance in the right environment. An organism may be unicellular or made up, like humans, of many billions of cells divided into specialized tissues and organs.
10/21/09: This is a placeholder term, that should ideally be imported from the NCBI taxonomy, but the high level hierarchy there does not suit our needs (includes plasmids and 'other organisms')
13-02-2009:
OBI doesn't take position as to when an organism starts or ends being an organism - e.g. sperm, foetus.
This issue is outside the scope of OBI.
GROUP: OBI Biomaterial Branch
WEB: http://en.wikipedia.org/wiki/Organism
organism
specimen
Biobanking of blood taken and stored in a freezer for potential future investigations stores specimen.
A material entity that has the specimen role.
Note: definition is in specimen creation objective which is defined as an objective to obtain and store a material entity for potential use as an input during an investigation.
PERSON: James Malone
PERSON: Philippe Rocca-Serra
GROUP: OBI Biomaterial Branch
specimen
cultured cell population
A cultured cell population applied in an experiment: "293 cells expressing TrkA were serum-starved for 18 hours and then neurotrophins were added for 10 min before cell harvest." (Lee, Ramee, et al. "Regulation of cell survival by secreted proneurotrophins." Science 294.5548 (2001): 1945-1948).
A cultured cell population maintained in vitro: "Rat cortical neurons from 15 day embryos are grown in dissociated cell culture and maintained in vitro for 8–12 weeks" (Dichter, Marc A. "Rat cortical neurons in cell culture: culture methods, cell morphology, electrophysiology, and synapse formation." Brain Research 149.2 (1978): 279-293).
A processed material comprised of a collection of cultured cells that has been continuously maintained together in culture and shares a common propagation history.
2013-6-5 MHB: This OBI class was formerly called 'cell culture', but label changed and definition updated following CLO alignment efforts in spring 2013, during which the intent of this class was clarified to refer to portions of a culture or line rather than a complete cell culture or line.
PERSON:Matthew Brush
cell culture sample
PERSON:Matthew Brush
The extent of a 'cultured cell population' is restricted only in that all cell members must share a propagation history (ie be derived through a common lineage of passages from an initial culture). In being defined in this way, this class can be used to refer to the populations that researchers actually use in the practice of science - ie are the inputs to culturing, experimentation, and sharing. The cells in such populations will be a relatively uniform population as they have experienced similar selective pressures due to their continuous co-propagation. And this population will also have a single passage number, again owing to their common passaging history. Cultured cell populations represent only a collection of cells (ie do not include media, culture dishes, etc), and include populations of cultured unicellular organisms or cultured multicellular organism cells. They can exist under active culture, stored in a quiescent state for future use, or applied experimentally.
cultured cell population
screening library
PMID: 15615535.J Med Chem. 2004 Dec 30;47(27):6864-74.A screening library for peptide activated G-protein coupled receptors. 1. The test set. [cdna_library, phage display library]
a screening library is a collection of materials engineered to identify qualities of a subset of its members during a screening process?
PRS: 22-02-2008: while working on definition of cDNA library and looking at current example of usage, a screening library should be a defined class -> any material library which has input_role in a screening protocol application
change biomaterial to material in definition
PERSON: Bjoern Peters
GROUP: IEDB
7/13/09: Need to clarify if this meets reagent role definition
screening library
data transformation
The application of a clustering protocol to microarray data or the application of a statistical testing method on a primary data set to determine a p-value.
A planned process that produces output data from input data.
Elisabetta Manduchi
Helen Parkinson
James Malone
Melanie Courtot
Philippe Rocca-Serra
Richard Scheuermann
Ryan Brinkman
Tina Hernandez-Boussard
data analysis
data processing
Branch editors
data transformation
differential expression analysis objective
Analyses implemented by the SAM (http://www-stat.stanford.edu/~tibs/SAM), PaGE (www.cbil.upenn.edu/PaGE) or GSEA (www.broad.mit.edu/gsea/) algorithms and software
A differential expression analysis objective is a data transformation objective whose input consists of expression levels of entities (such as transcripts or proteins), or of sets of such expression levels, under two or more conditions and whose output reflects which of these are likely to have different expression across such conditions.
Elisabetta Manduchi
PERSON: Elisabetta Manduchi
differential expression analysis objective
Benjamini and Hochberg false discovery rate correction method
Statistical significance of the 8 most represented biological processes (GO level 4) among E7 6 month upregulated genes following analysis with DAVID software; Benjamini-Hochberg FDR (false discovery rate)
A data transformation process in which the Benjamini and Hochberg method sequential p-value procedure is applied with the aim of correcting false discovery rate
2011-03-31: [PRS].
specified input and output of dt which were missing
Helen Parkinson
Philippe Rocca-Serra
Helen Parkinson
Benjamini and Hochberg false discovery rate correction method
k-means clustering
A k-means clustering is a data transformation which achieves a class discovery or partitioning objective, which takes as input a collection of objects (represented as points in multidimensional space) and which partitions them into a specified number k of clusters. The algorithm attempts to find the centers of natural clusters in the data. The most common form of the algorithm starts by partitioning the input points into k initial sets, either at random or using some heuristic data. It then calculates the mean point, or centroid, of each set. It constructs a new partition by associating each point with the closest centroid. Then the centroids are recalculated for the new clusters, and the algorithm repeated by alternate applications of these two steps until convergence, which is obtained when the points no longer switch clusters (or alternatively centroids are no longer changed).
Elisabetta Manduchi
James Malone
Philippe Rocca-Serra
WEB: http://en.wikipedia.org/wiki/K-means
k-means clustering
hierarchical clustering
A hierarchical clustering is a data transformation which achieves a class discovery objective, which takes as input data item and builds a hierarchy of clusters. The traditional representation of this hierarchy is a tree (visualized by a dendrogram), with the individual input objects at one end (leaves) and a single cluster containing every object at the other (root).
James Malone
WEB: http://en.wikipedia.org/wiki/Data_clustering#Hierarchical_clustering
hierarchical clustering
average linkage hierarchical clustering
An average linkage hierarchical clustering is an agglomerative hierarchical clustering which generates successive clusters based on a distance measure, where the distance between two clusters is calculated as the average distance between objects from the first cluster and objects from the second cluster.
Elisabetta Manduchi
PERSON: Elisabetta Manduchi
average linkage hierarchical clustering
complete linkage hierarchical clustering
an agglomerative hierarchical clustering which generates successive clusters based on a distance measure, where the distance between two clusters is calculated as the maximum distance between objects from the first cluster and objects from the second cluster.
Elisabetta Manduchi
PERSON: Elisabetta Manduchi
complete linkage hierarchical clustering
single linkage hierarchical clustering
A single linkage hierarchical clustering is an agglomerative hierarchical clustering which generates successive clusters based on a distance measure, where the distance between two clusters is calculated as the minimum distance between objects from the first cluster and objects from the second cluster.
Elisabetta Manduchi
PERSON: Elisabetta Manduchi
single linkage hierarchical clustering
Benjamini and Yekutieli false discovery rate correction method
The expression set was compared univariately between the stroke patients and controls, gene list was generated using False Discovery Rate correction (Benjamini and Yekutieli)
A data transformation in which the Benjamini and Yekutieli method is applied with the aim of correcting false discovery rate
2011-03-31: [PRS].
specified input and output of dt which were missing
Helen Parkinson
Philippe Rocca-Serra
Helen Parkinson
Benjamini and Yekutieli false discovery rate correction method
dimensionality reduction
A dimensionality reduction is data partitioning which transforms each input m-dimensional vector (x_1, x_2, ..., x_m) into an output n-dimensional vector (y_1, y_2, ..., y_n), where n is smaller than m.
Elisabetta Manduchi
James Malone
Melanie Courtot
Philippe Rocca-Serra
data projection
PERSON: Elisabetta Manduchi
PERSON: James Malone
PERSON: Melanie Courtot
dimensionality reduction
principal components analysis dimensionality reduction
A principal components analysis dimensionality reduction is a dimensionality reduction achieved by applying principal components analysis and by keeping low-order principal components and excluding higher-order ones.
Elisabetta Manduchi
James Malone
Melanie Courtot
Philippe Rocca-Serra
pca data reduction
PERSON: Elisabetta Manduchi
PERSON: James Malone
PERSON: Melanie Courtot
principal components analysis dimensionality reduction
Holm-Bonferroni family-wise error rate correction method
t-tests were used with the type I error adjusted for multiple comparisons, Holm's correction (HOLM 1979), and false discovery rate, http://www.genetics.org/cgi/content/full/172/2/1179
a data transformation that performs more than one hypothesis test simultaneously, a closed-test procedure, that controls the familywise error rate for all the k hypotheses at level α in the strong sense. Objective: multiple testing correction
2011-03-14: [PRS]. Class Label has been changed to address the conflict with the definition
Also added restriction to specify the output to be a FWER adjusted p-value
The 'editor preferred term' should be removed
Person:Helen Parkinson
Philippe Rocca-Serra
WEB: http://en.wikipedia.org/wiki/Holm%E2%80%93Bonferroni_method
Bonferroni adjustment method
Holm-Bonferroni family-wise error rate correction method
family wise error rate correction method
A family wise error rate correction method is a multiple testing procedure that controls the probability of at least one false positive.
2011-03-31: [PRS].
creating a defined class by specifying the necessary output of dt
allows correct classification of FWER dt
Monnie McGee
Philippe Rocca-Serra
FWER correction
Dudoit, Sandrine and van der Laan, Mark J. (2008) Multiple Testing Procedures with Applications to Genomics. New York: Springer , p. 19
family wise error rate correction method
descriptive statistical calculation objective
A descriptive statistical calculation objective is a data transformation objective which concerns any calculation intended to describe a feature of a data set, for example, its center or its variability.
Elisabetta Manduchi
James Malone
Melanie Courtot
Monnie McGee
PERSON: Elisabetta Manduchi
PERSON: James Malone
PERSON: Melanie Courtot
PERSON: Monnie McGee
descriptive statistical calculation objective
survival analysis objective
Kaplan meier data transformation
A data transformation objective which has the data transformation aims to model time to event data (where events are e.g. death and or disease recurrence); the purpose of survival analysis is to model the underlying distribution of event times and to assess the dependence of the event time on other explanatory variables
PERSON: James Malone
PERSON: Tina Boussard
survival analysis
http://en.wikipedia.org/wiki/Survival_analysis
survival analysis objective
multiple testing correction method
A multiple testing correction method is a hypothesis test performed simultaneously on M > 1 hypotheses. Multiple testing procedures produce a set of rejected hypotheses that is an estimate for the set of false null hypotheses while controlling for a suitably define Type I error rate
Monnie McGee
multiple testing procedure
PAPER: Dudoit, Sandrine and van der Laan, Mark J. (2008) Multiple Testing Procedures with Applications to Genomics. New York: Springer , p. 9-10.
multiple testing correction method
logarithmic transformation
A logarithmic transformation is a data transformation consisting in the application of the logarithm function with a given base a (where a>0 and a is not equal to 1) to a (one dimensional) positive real number input. The logarithm function with base a can be defined as the inverse of the exponential function with the same base. See e.g. http://en.wikipedia.org/wiki/Logarithm.
Elisabetta Manduchi
WEB: http://en.wikipedia.org/wiki/Logarithm
logarithmic transformation
regression analysis method
Regression analysis is a descriptive statistics technique that examines the relation of a dependent variable (response variable) to specified independent variables (explanatory variables). Regression analysis can be used as a descriptive method of data analysis (such as curve fitting) without relying on any assumptions about underlying processes generating the data.
Date:2013-11-15
Person: AGB,PRS
Adding restrictions, specifying model + parameter estimation process
change of label from 'regression analysis method' to 'regression analysis'
Alejandra Gonzalez-Beltran
Philippe Rocca-Serra
Tina Hernandez-Boussard
BOOK: Richard A. Berk, Regression Analysis: A Constructive Critique, Sage Publications (2004) 978-0761929048
regression analysis
regression analysis method
principal component regression
The Principal Component Regression method is a regression analysis method that combines the Principal Component Analysis (PCA)spectral decomposition with an Inverse Least Squares (ILS) regression method to create a quantitative model for complex samples. Unlike quantitation methods based directly on Beer's Law which attempt to calculate the absorbtivity coefficients for the constituents of interest from a direct regression of the constituent concentrations onto the spectroscopic responses, the PCR method regresses the concentrations on the PCA scores.
Tina Hernandez-Boussard
WEB: : http://www.thermo.com/com/cda/resources/resources_detail/1,2166,13414,00.html
principal component regression
data visualization
Generation of a heatmap from a microarray dataset
An planned process that creates images, diagrams or animations from the input data.
Elisabetta Manduchi
James Malone
Melanie Courtot
Tina Boussard
data encoding as image
visualization
PERSON: Elisabetta Manduchi
PERSON: James Malone
PERSON: Melanie Courtot
PERSON: Tina Boussard
Possible future hierarchy might include this:
information_encoding
>data_encoding
>>image_encoding
data visualization
mode calculation
A mode calculation is a descriptive statistics calculation in which the mode is calculated which is the most common value in a data set. It is most often used as a measure of center for discrete data.
James Malone
Monnie McGee
PERSON: James Malone
PERSON: Monnie McGee
From Monnie's file comments - need to add center_calculation role but it doesn't exist yet - (editor note added by James Jan 2008)
mode calculation
median calculation
A median calculation is a descriptive statistics calculation in which the midpoint of the data set (the 0.5 quantile) is calculated. First, the observations are sorted in increasing order. For an odd number of observations, the median is the middle value of the sorted data. For an even number of observations, the median is the average of the two middle values.
James Malone
Monnie McGee
PERSON: James Malone
PERSON: Monnie McGee
From Monnie's file comments - need to add center_calculation role but it doesn't exist yet - (editor note added by James Jan 2008)
median calculation
agglomerative hierarchical clustering
An agglomerative hierarchical clustering is a hierarchical clustering which starts with separate clusters and then successively combines these clusters until there is only one cluster remaining.
Elisabetta Manduchi
James Malone
bottom-up hierarchical clustering
PERSON: Elisabetta Manduchi
agglomerative hierarchical clustering
divisive hierarchical clustering
A divisive hierarchical clustering is a hierarchical clustering which starts with a single cluster and then successively splits resulting clusters until only clusters of individual objects remain.
Elisabetta Manduchi
James Malone
top-down hierarchical clustering
PERSON: Elisabetta Manduchi
divisive hierarchical clustering
false discovery rate correction method
The false discovery rate is a data transformation used in multiple hypothesis testing to correct for multiple comparisons. It controls the expected proportion of incorrectly rejected null hypotheses (type I errors) in a list of rejected hypotheses. It is a less conservative comparison procedure with greater power than familywise error rate (FWER) control, at a cost of increasing the likelihood of obtaining type I errors. .
2011-03-31: [PRS].
creating a defined class by specifying the necessary output of dt
allows correct classification of FDR dt
Monnie McGee
Philippe Rocca-Serra
FDR correction method
Dudoit, Sandrine and van der Laan, Mark J. (2008) Multiple Testing Procedures with Applications to Genomics. New York: Springer , p. 21 and http://www.wikidoc.org/index.php/False_discovery_rate
false discovery rate correction method
data transformation objective
normalize objective
An objective specification to transformation input data into output data
Modified definition in 2013 Philly OBI workshop
James Malone
PERSON: James Malone
data transformation objective
data normalization objective
Quantile transformation which has normalization objective can be used for expression microarray assay normalization and it is referred to as "quantile normalization", according to the procedure described e.g. in PMID 12538238.
A normalization objective is a data transformation objective where the aim is to remove
systematic sources of variation to put the data on equal footing in order
to create a common base for comparisons.
Elisabetta Manduchi
Helen Parkinson
James Malone
PERSON: Elisabetta Manduchi
PERSON: Helen Parkinson
PERSON: James Malone
data normalization objective
correction objective
Type I error correction
A correction objective is a data transformation objective where the aim is to correct for error, noise or other impairments to the input of the data transformation or derived from the data transformation itself
James Malone
PERSON: James Malone
PERSON: Melanie Courtot
correction objective
normalization data transformation
A normalization data transformation is a data transformation that has objective normalization.
James Malone
PERSON: James Malone
normalization data transformation
averaging data transformation
An averaging data transformation is a data transformation that has objective averaging.
James Malone
PERSON: James Malone
averaging data transformation
partitioning data transformation
A partitioning data transformation is a data transformation that has objective partitioning.
James Malone
PERSON: James Malone
partitioning data transformation
partitioning objective
A k-means clustering which has partitioning objective is a data transformation in which the input data is partitioned into k output sets.
A partitioning objective is a data transformation objective where the aim is to generate a collection of disjoint non-empty subsets whose union equals a non-empty input set.
Elisabetta Manduchi
James Malone
PERSON: Elisabetta Manduchi
partitioning objective
class discovery data transformation
A class discovery data transformation (sometimes called unsupervised classification) is a data transformation that has objective class discovery.
James Malone
clustering data transformation
unsupervised classification data transformation
PERSON: James Malone
class discovery data transformation
center calculation objective
A mean calculation which has center calculation objective is a data transformation in which the center of the input data is discovered through the calculation of a mean average.
A center calculation objective is a data transformation objective where the aim is to calculate the center of an input data set.
James Malone
PERSON: James Malone
center calculation objective
class discovery objective
A class discovery objective (sometimes called unsupervised classification) is a data transformation objective where the aim is to organize input data (typically vectors of attributes) into classes, where the number of classes and their specifications are not known a priori. Depending on usage, the class assignment can be definite or probabilistic.
James Malone
clustering objective
discriminant analysis objective
unsupervised classification objective
PERSON: Elisabetta Manduchi
PERSON: James Malone
class discovery objective
center calculation data transformation
A center calculation data transformation is a data transformation that has objective of center calculation.
James Malone
PERSON: James Malone
center calculation data transformation
descriptive statistical calculation data transformation
A descriptive statistical calculation data transformation is a data transformation that has objective descriptive statistical calculation and which concerns any calculation intended to describe a feature of a data set, for example, its center or its variability.
James Malone
PERSON: James Malone
descriptive statistical calculation data transformation
error correction objective
Application of a multiple testing correction method
An error correction objective is a data transformation objective where the aim is to remove (correct for) erroneous contributions arising from the input data, or the transformation itself.
James Malone, Helen Parkinson
PERSON: James Malone
error correction objective
gene list visualization
Adata visualization which has input of a gene list and produces an output of a report graph which is capable of rendering data of this type.
James Malone
gene list visualization
survival analysis data transformation
A data transformation which has the objective of performing survival analysis.
James Malone
PERSON: James Malone
survival analysis data transformation
chi square test
The chi-square test is a data transformation with the objective of statistical hypothesis testing, in which the sampling distribution of the test statistic is a chi-square distribution when the null hypothesis is true, or any in which this is asymptotically true, meaning that the sampling distribution (if the null hypothesis is true) can be made to approximate a chi-square distribution as closely as desired by making the sample size large enough.
negociation with OBI hence definition and definition source are missing from this class
PERSON: James Malone
PERSON: Tina Boussard
chi square test
1
1
true
true
2
1
ANOVA
ANOVA or analysis of variance is a data transformation in which a statistical test of whether the means of several groups are all equal.
AGB and PRS augmented the class with formal definitions as part of STATO extension
Alejandra Gonzalez-Beltran
James Malone
Philippe Rocca-Serra
Analysis of Variance
stat.anova()
ANOVA
observation design
PMID: 12387964.Lancet. 2002 Oct 12;360(9340):1144-9.Deficiency of antibacterial peptides in patients with morbus Kostmann: an observation study.
observation design is a study design in which subjects are monitored in the absence of any active intervention by experimentalists.
Philippe Rocca-Serra
OBI branch derived
observation design
extraction
nucleic acid extraction using phenol chloroform
A material separation in which a desired component of an input material is separated from the remainder
Current the output of material processing defined as the molecular entity, main component in the output material entity, rather than the material entity that have grain molecular entity.
'nucleic acid extract' is the output of 'nucleic acid extraction' and has grain 'nucleic acid'. However, the output of 'nucleic acid extraction' is 'nucleic acid' rather than 'nucleic acid extract'. We are aware of this issue and will work it out in the future.
Person:Bjoern Peters
Philippe Rocca-Serra
extraction
group randomization
PMID: 18349405. Randomization reveals unexpected acute leukemias in Southwest Oncology Group prostate cancer trial. J Clin Oncol. 2008 Mar 20;26(9):1532-6.
A group assignment which relies on chance to assign materials to a group of materials in order to avoid bias in experimental set up.
Philippe Rocca-Serra
adapted from wikipedia [http://en.wikipedia.org/wiki/Randomization]
group randomization
nucleic acid hybridization
PMID: 18555787.Quantitative analysis of DNA hybridization in a flowthrough microarray for molecular testing. Anal Biochem. 2008 May 27.
a planned process by which totally or partially complementary, single-stranded nucleic acids are combined into a single molecule called heteroduplex or homoduplex to an extent depending on the amount of complementarity.
Philippe Rocca-Serra
adapted from wikipedia [http://en.wikipedia.org/wiki/Nucleic_acid_hybridization]
hybridization assay
nucleic acid hybridization
flow cell
Biofilm Flow Cell
Aparatus in the fluidic subsystem where the sheath and sample meet. Can be one of several types; jet-in-air, quartz cuvette, or a hybrid of the two. The sample flows through the center of a fluid column of sheath fluid in the flow cell.
Person:John Quinn
flow_cell
http://www.flocyte.com/FRTP/Resources/flow_cytometry_glossary.htm
flow cell
flow cytometer
FACS Calibur
A flow_cytometer is an instrument for counting, examining and sorting microscopic particles in suspension. It allows simultaneous multiparametric analysis of the physical and/or chemical characteristics of single cells flowing through an optical and/or electronic detection apparatus. A flow cytometer is an instrument that can be used to quantitatively measure the properties of individual cells in a flowing medium.
John Quinn
http://en.wikipedia.org/wiki/Flow_cytometer
flow cytometer
light source
A light source is an optical subsystem that provides light for use in a distant area using a delivery system (e.g., fiber optics). Light sources may include one of a variety of lamps (e.g., xenon, halogen, mercury). Most light sources are operated from line power, but some may be powered from batteries. They are mostly used in endoscopic, microscopic, and other examination and/or in surgical procedures. The light source is part of the optical subsystem. In a flow cytometer the light source directs high intensity light at particles at the interrogation point. The light source in a flow cytometer is usually a laser.
Elizabeth M. Goralczyk
John Quinn
Olga Tchuvatkina
Practical Flow Cytometry 4th Edition, Howard Shapiro, ISBN-10: 0471411256, ISBN-13: 978-0471411253
light source
obscuration bar
obscuration bar in a flow cytometer
An obscuration bar is a an optical subsystem which is a strip of metal or other material that serves to block out direct light from the illuminating beam. The obscuration bar prevents the bright light scattered in the forward directions from burning out the collection device.
Daniel Schober
Flow Cytometry: First Principles, by Alice Longobardi Givan, ISBN-10: 0471382248, ISBN-13: 978-0471382249
John Quinn
obscuration bar
optical filter
720 LP filter, 580/30 BP filter
An optical filter is an optical subsystem that selectively transmits light having certain properties (often, a particular range of wavelengths, that is, range of colours of light), while blocking the remainder. They are commonly used in photography, in many optical instruments, and to colour stage lighting Optical filters can be arranged to segregate and collect light by wave length.
John Quinn
http://en.wikipedia.org/wiki/Optical_filter
optical filter
photodetector
A photomultiplier tube, a photo diode
A photodetector is a device used to detect and measure the intensity of radiant energy through photoelectric action. In a cytometer, photodetectors measure either the number of photons of laser light scattered on impact with a cell (for example), or the flourescence emitted by excitation of a fluorescent dye.
John Quinn
http://einstein.stanford.edu/content/glossary/glossary.html
photodetector
DNA sequencer
ABI 377 DNA Sequencer, ABI 310 DNA Sequencer
A DNA sequencer is an instrument that determines the order of deoxynucleotides in deoxyribonucleic acid sequences.
Trish Whetzel
MO
DNA sequencer
hybridization chamber
Glass Array Hybridization Cassette
A device which is used to maintain constant contact of a liquid on an array. This can be either a glass vial or slide.
Trish Whetzel
MO_563 hybridization_chamber
hybridization chamber
cytometer
A cytometer is an instrument for counting and measuring cells.
Melanie Courtot
http://medical.merriam-webster.com/medical/cytometer
cytometer
microarray
An affymetrix U133 array is a microarray. Microarrays include 1 and 2-color arrays, custom and commercial arrays (e.g, Affymetrix, Agilent, Nimblegen, Illumina, etc.) for expression profiling, DNA variant detection, protein binding, and other genomic and functional genomic assays.
A processed material that is made to be used in an analyte assay. It consists of a physical immobilisation matrix in which substances that bind the analyte are placed in regular spatial position.
Daniel Schober
PERSON: Chris Stoeckert
microarray
DNA microarray
Moran G, Stokes C, Thewes S, Hube B, Coleman DC, Sullivan D (2004). "Comparative genomics using Candida albicans DNA microarrays reveals absence and divergence of virulence-associated genes in Candida dubliniensis". Microbiology 150: 3363-3382. doi:10.1099/mic.0.27221-0. PMID 15470115
A DNA-microarray is a microarray that is used as a physical 2D immobilisation matrix for DNA sequences. DNA microarray-bound DNA fragments are used as targets for a hybridising probed sample.
PERSON: Daniel Schober
PERSON: Frank Gibson
DNA Chip
DNA-array
Web:<http://en.wikipedia.org/wiki/DNA_microarray>@2008/03/03
DNA microarray
droplet sorter
A droplet sorter is part_of a flow cytometer sorter that converts the carrier fluid stream into individual droplets, and these droplets are directed into separate locations for recovery (enriching the original
sample for particles of interest based on qualities determined by gating) or disposal.
OBI Instrument branch
OBI Instrument branch
droplet sorter
study design
a matched pairs study design describes criteria by which subjects are identified as pairs which then undergo the same protocols, and the data generated is analyzed by comparing the differences between the paired subjects, which constitute the results of the executed study design.
A plan specification comprised of protocols (which may specify how and what kinds of data will be gathered) that are executed as part of an investigation and is realized during a study design execution.
Editor note: there is at least an implicit restriction on the kind of data transformations that can be done based on the measured data available.
PERSON: Chris Stoeckert
experimental design
rediscussed at length (MC/JF/BP). 12/9/08). The definition was clarified to differentiate it from protocol.
study design
This statement can actually be inferred from 'plan specification', because 'independent variable specification' is a subclass of 'is part of' some 'plan specification'
repeated measure design
PMID: 10959922.J Biopharm Stat. 2000 Aug;10(3):433-45.Equivalence in test assay method comparisons for the repeated-measure, matched-pair design in medical device studies: statistical considerations.
a study design which use the same individuals and exposure them to a set of conditions. The effect of order and practice can be confounding factor in such designs
PlanAndPlannedProcess Branch
http://www.holah.karoo.net/experimentaldesigns.htm
repeated measure design
cross over design
PMID: 17601993-Objective: HIV-infected patients with lipodystrophy (HIV-lipodystrophy) are insulin resistant and have elevated plasma free fatty acid (FFA) concentrations. We aimed to explore the mechanisms underlying FFA-induced insulin resistance in patients with HIV-lipodystrophy. Research Design and Methods: Using a randomized placebo-controlled cross-over design, we studied the effects of an overnight acipimox-induced suppression of FFA on glucose and FFA metabolism by using stable isotope labelled tracer techniques during basal conditions and a two-stage euglycemic, hyperinsulinemic clamp (20 mU insulin/m(2)/min; 50 mU insulin/m(2)/min) in nine patients with nondiabetic HIV-lipodystrophy. All patients received antiretroviral therapy. Biopsies from the vastus lateralis muscle were obtained during each stage of the clamp. Results: Acipimox treatment reduced basal FFA rate of appearance by 68.9% (52.6%-79.5%) and decreased plasma FFA concentration by 51.6 % (42.0%-58.9%), (both, P < 0.0001). Endogenous glucose production was not influenced by acipimox. During the clamp the increase in glucose-uptake was significantly greater after acipimox treatment compared to placebo (acipimox: 26.85 (18.09-39.86) vs placebo: 20.30 (13.67-30.13) mumol/kg/min; P < 0.01). Insulin increased phosphorylation of Akt (Thr(308)) and GSK-3beta (Ser(9)), decreased phosphorylation of glycogen synthase (GS) site 3a+b and increased GS-activity (I-form) in skeletal muscle (P < 0.01). Acipimox decreased phosphorylation of GS (site 3a+b) (P < 0.02) and increased GS-activity (P < 0.01) in muscle. Conclusion: The present study provides direct evidence that suppression of lipolysis in patients with HIV-lipodystrophy improves insulin-stimulated peripheral glucose-uptake. The increased glucose-uptake may in part be explained by increased dephosphorylation of GS (site 3a+b) resulting in increased GS activity.
a repeated measure design which ensures that experimental units receive, in sequence, the treatment (or the control), and then, after a specified time interval (aka *wash-out periods*), switch to the control (or treatment). In this design, subjects (patients in human context) serve as their own controls, and randomization may be used to determine the ordering which a subject receives the treatment and control
Philippe Rocca-Serra
(source: http://www.sbu.se/Filer/Content0/publikationer/1/literaturesearching_1993/glossary.html)
cross over design
matched pairs design
PMID: 17288613-BSTRACT: BACKGROUND: Physicians in Canadian emergency departments (EDs) annually treat 185,000 alert and stable trauma victims who are at risk for cervical spine (C-spine) injury. However, only 0.9% of these patients have suffered a cervical spine fracture. Current use of radiography is not efficient. The Canadian C-Spine Rule is designed to allow physicians to be more selective and accurate in ordering C-spine radiography, and to rapidly clear the C-spine without the need for radiography in many patients. The goal of this phase III study is to evaluate the effectiveness of an active strategy to implement the Canadian C-Spine Rule into physician practice. Specific objectives are to: 1) determine clinical impact, 2) determine sustainability, 3) evaluate performance, and 4) conduct an economic evaluation. METHODS: We propose a matched-pair cluster design study that compares outcomes during three consecutive 12-months before, after, and decay periods at six pairs of intervention and control sites. These 12 hospital ED sites will be stratified as teaching or community hospitals, matched according to baseline C-spine radiography ordering rates, and then allocated within each pair to either intervention or control groups. During the after period at the intervention sites, simple and inexpensive strategies will be employed to actively implement the Canadian C-Spine Rule. The following outcomes will be assessed: 1) measures of clinical impact, 2) performance of the Canadian C-Spine Rule, and 3) economic measures. During the 12-month decay period, implementation strategies will continue, allowing us to evaluate the sustainability of the effect. We estimate a sample size of 4,800 patients in each period in order to have adequate power to evaluate the main outcomes. DISCUSSION: Phase I successfully derived the Canadian C-Spine Rule and phase II confirmed the accuracy and safety of the rule, hence, the potential for physicians to improve care. What remains unknown is the actual change in clinical behaviors that can be affected by implementation of the Canadian C-Spine Rule, and whether implementation can be achieved with simple and inexpensive measures. We believe that the Canadian C-Spine Rule has the potential to significantly reduce health care costs and improve the efficiency of patient flow in busy Canadian EDs.
A matched pair design is a study design which use groups of individuals associated (hence matched) to each other based on a set of criteria, one member going to one treatment, the other member receiving the other treatment.
Philippe Rocca-Serra
http://www.holah.karoo.net/experimentaldesigns.htm
matched pairs design
parallel group design
PMID: 17408389-Purpose: Proliferative vitreoretinopathy (PVR) is the most important reason for blindness following retinal detachment. Presently, vitreous tamponades such as gas or silicone oil cannot contact the lower part of the retina. A heavier-than-water tamponade displaces the inflammatory and PVR-stimulating environment from the inferior area of the retina. The Heavy Silicone Oil versus Standard Silicone Oil Study (HSO Study) is designed to answer the question of whether a heavier-than-water tamponade improves the prognosis of eyes with PVR of the lower retina. Methods: The HSO Study is a multicentre, randomized, prospective controlled clinical trial comparing two endotamponades within a two-arm parallel group design. Patients with inferiorly and posteriorly located PVR are randomized to either heavy silicone oil or standard silicone oil as a tamponading agent. Three hundred and fifty consecutive patients are recruited per group. After intraoperative re-attachment, patients are randomized to either standard silicone oil (1000 cSt or 5000 cSt) or Densiron((R)) as a tamponading agent. The main endpoint criteria are complete retinal attachment at 12 months and change of visual acuity (VA) 12 months postoperatively compared with the preoperative VA. Secondary endpoints include complete retinal attachment before endotamponade removal, quality of life analysis and the number of retina affecting re-operation within 1 year of follow-up. Results: The design and early recruitment phase of the study are described. Conclusions: The results of this study will uncover whether or not heavy silicone oil improves the prognosis of eyes with PVR.
A parallel group design or independent measure design is a study design which uses unique experimental unit each experimental group, in other word no two individuals are shared between experimental groups, hence also known as parallel group design. Subjects of a treatment group receive a unique combination of independent variable values making up a treatment
Philippe Rocca-Serra
independent measure design
http://www.holah.karoo.net/experimentaldesigns.htm
parallel group design
randomized complete block design
http://www.stats.gla.ac.uk/steps/glossary/anova.html,(A researcher is carrying out a study of the effectiveness of four different skin creams for the treatment of a certain skin disease. He has eighty subjects and plans to divide them into 4 treatment groups of twenty subjects each. Using a randomised blocks& design, the subjects are assessed and put in blocks of four according to how severe their skin condition is; the four most severe cases are the first block, the next four most severe cases are the second block, and so on to the twentieth block. The four &members of each block are then randomly assigned, one to each of the four treatment groups. http://www.stats.gla.ac.uk/steps/glossary/anova.html#rbd))
A randomized complete block design is_a study design which assigns randomly treatments to block. The number of units per block equals the number of treatment so each block receives each treatment exactly once (hence the qualifier 'complete'). The design was originally devised from field trials used in agronomy and agriculture. The analysis assumes that there is no interaction between block and treatment. The method was then used in other settings So The randomised complete block design is a design in which the subjects are matched according to a variable which the experimenter wishes to control. The subjects are put into groups (blocks) of the same size as the number of treatments. The members of each block are then randomly assigned to different treatment groups.
Philippe Rocca-Serra
http://www.tufts.edu/~gdallal/ranblock.htm
randomized complete block design
2
latin square design
PMID: 17582121-Our objective was to examine the effects of dietary cation-anion difference (DCAD) with different concentrations of dietary crude protein (CP) on performance and acid-base status in early lactation cows. Six lactating Holstein cows averaging 44 d in milk were used in a 6 x 6 Latin square design with a 2 x 3 factorial arrangement of treatments: DCAD of -3, 22, or 47 milliequivalents (Na + K - Cl - S)/100 g of dry matter (DM), and 16 or 19% CP on a DM basis. Linear increases with DCAD occurred in DM intake, milk fat percentage, 4% fat-corrected milk production, milk true protein, milk lactose, and milk solids-not-fat. Milk production itself was unaffected by DCAD. Jugular venous blood pH, base excess and HCO3(-) concentration, and urine pH increased, but jugular venous blood Cl- concentration, urine titratable acidity, and net acid excretion decreased linearly with increasing DCAD. An elevated ratio of coccygeal venous plasma essential AA to nonessential AA with increasing DCAD indicated that N metabolism in the rumen was affected, probably resulting in more microbial protein flowing to the small intestine. Cows fed 16% CP had lower urea N in milk than cows fed 19% CP; the same was true for urea N in coccygeal venous plasma and urine. Dry matter intake, milk production, milk composition, and acid-base status did not differ between the 16 and 19% CP treatments. It was concluded that DCAD affected DM intake and performance of dairy cows in early lactation. Feeding 16% dietary CP to cows in early lactation, compared with 19% CP, maintained lactation performance while reducing urea N excretion in milk and urine.
Latin square design is_a study design which allows in its simpler form controlling 2 levels of nuisance variables (also known as blocking variables).he 2 nuisance factors are divided into a tabular grid with the property that each row and each column receive each treatment exactly once.
Philippe Rocca-Serra
Adapted from: http://www.itl.nist.gov/div898/handbook/pri/section3/pri3321.htm and
latin square design
3
graeco latin square design
PMID: 6846242-Beaton et al (Am J Clin Nutr 1979;32:2546-59) reported on the partitioning of variance in 1-day dietary data for the intake of energy, protein, total carbohydrate, total fat, classes of fatty acids, cholesterol, and alcohol. Using the same food intake data and the expanded National Heart, Lung and Blood Institute food composition data base, these analyses of sources of variance have been expanded to include classes of carbohydrate, vitamin A, vitamin C, thiamin, riboflavin, niacin, calcium, iron, total ash, caffeine, and crude fiber. The analyses relate to observed intakes (replicated six times) of 30 adult males and 30 adult females obtained under a paired Graeco-Latin square design with sequence of interview, interviewer, and day of the week as determinants. Neither sequence nor interviewer made consistent contribution to variance. In females, day of the week had a significant effect for several nutrients. The major partitioning of variance was between interindividual variation (between subjects) and intraindividual variation (within subjects) which included both true day-to-day variation in intake and methodological variation. For all except caffeine, the intraindividual variability of 1-day data was larger than the interindividual variability. For vitamin A, almost all of the variance was associated with day-to-day variability. One day data provide a very inadequate estimate of usual intake of individuals. In the design of nutrition studies it is critical that the intended use of dietary data be a major consideration in deciding on methodology. There is no ideal dietary method. There may be preferred methods for particular purposes.
Greco-Latin square design is a study design which relates to Latin square design
Philippe Rocca-Serra
Adapted from: http://www.itl.nist.gov/div898/handbook/pri/section3/pri3321.htm and
only 2 articles in pubmed ->probably irrelevant
Euler square design
orthogonal latin squares design
graeco latin square design
4
hyper graeco latin square design
PRS to do
Philippe Rocca-Serra
Adapted from: http://www.itl.nist.gov/div898/handbook/pri/section3/pri3321.htm and
no example found in pubmed->not in use in the community
hyper graeco latin square design
1
2
factorial design
PMID: 17582121-Our objective was to examine the effects of dietary cation-anion difference (DCAD) with different concentrations of dietary crude protein (CP) on performance and acid-base status in early lactation cows. Six lactating Holstein cows averaging 44 d in milk were used in a 6 x 6 Latin square design with a 2 x 3 factorial arrangement of treatments: DCAD of -3, 22, or 47 milliequivalents (Na + K - Cl - S)/100 g of dry matter (DM), and 16 or 19% CP on a DM basis. Linear increases with DCAD occurred in DM intake, milk fat percentage, 4% fat-corrected milk production, milk true protein, milk lactose, and milk solids-not-fat. Milk production itself was unaffected by DCAD. Jugular venous blood pH, base excess and HCO3(-) concentration, and urine pH increased, but jugular venous blood Cl- concentration, urine titratable acidity, and net acid excretion decreased linearly with increasing DCAD. An elevated ratio of coccygeal venous plasma essential AA to nonessential AA with increasing DCAD indicated that N metabolism in the rumen was affected, probably resulting in more microbial protein flowing to the small intestine. Cows fed 16% CP had lower urea N in milk than cows fed 19% CP; the same was true for urea N in coccygeal venous plasma and urine. Dry matter intake, milk production, milk composition, and acid-base status did not differ between the 16 and 19% CP treatments. It was concluded that DCAD affected DM intake and performance of dairy cows in early lactation. Feeding 16% dietary CP to cows in early lactation, compared with 19% CP, maintained lactation performance while reducing urea N excretion in milk and urine.
factorial design is_a study design which is used to evaluate two or more factors simultaneously. The treatments are combinations of levels of the factors. The advantages of factorial designs over one-factor-at-a-time experiments is that they are more efficient and they allow interactions to be detected. In statistics, a factorial design experiment is an experiment whose design consists of two or more factors, each with discrete possible values or levels, and whose experimental units take on all possible combinations of these levels across all such factors. Such an experiment allows studying the effect of each factor on the response variable, as well as the effects of interactions between factors on the response variable.
Philippe Rocca-Serra
http://www.stats.gla.ac.uk/steps/glossary/anova.html#facdes And from wikipedia (01/03/2007): http://en.wikipedia.org/wiki/Factorial_experiment)
factorial design
2
2x2 factorial design
PMID: 17561240-The present experiment evaluates the effects of intermittent exposure to a social stimulus on ethanol and water drinking in rats. Four groups of rats were arranged in a 2x2 factorial design with 2 levels of Social procedure (Intermittent Social vs Continuous Social) and 2 levels of sipper Liquid (Ethanol vs Water). Intermittent Social groups received 35 trials per session. Each trial consisted of the insertion of the sipper tube for 10 s followed by lifting of the guillotine door for 15 s. The guillotine door separated the experimental rat from the conspecific rat in the wire mesh cage during the 60 s inter-trial interval. The Continuous Social groups received similar procedures except that the guillotine door was raised during the entire duration of the session. For the Ethanol groups, the concentrations of ethanol in the sipper [3, 4, 6, 8, 10, 12, 14, and 16% (vol/vol)] increased across sessions, while the Water groups received 0% ethanol (water) in the sipper throughout the experiment. Both Social procedures induced more intake of ethanol than water. The Intermittent Social procedure induced more ethanol intake at the two highest ethanol concentration blocks (10-12% and 14-16%) than the Continuous Social procedure, but this effect was not observed with water. Effects of social stimulation on ethanol drinking are discussed.
a factorial design which has 2 experimental factors (aka independent variables) and 2 factor levels per experimental factors
Philippe Rocca-Serra
PMID: 17561240
2x2 factorial design
fractional factorial design
A fractional factorial design is_a study design in which only an adequately chosen fraction of the treatment combinations required for the complete factorial experiment is selected to be run
Philippe Rocca-Serra
http://www.itl.nist.gov/div898/handbook/pri/section3/pri334.htm From ASQC (1983) Glossary & Tables for Statistical Quality Control
fractional factorial design
dye swap design
PMID: 17411393-Dye-specific bias effects, commonly observed in the two-color microarray platform, are normally corrected using the dye swap design. This design, however, is relatively expensive and labor-intensive. We propose a self-self hybridization design as an alternative to the dye swap design. In this design, the treated and control samples are labeled with Cy5 and Cy3 (or Cy3 and Cy5), respectively, without dye swap, along with a set of self-self hybridizations on the control sample. We compare this design with the dye swap design through investigation of mouse primary hepatocytes treated with three peroxisome proliferator-activated receptor-alpha (PPARalpha) agonists at three dose levels. Using Agilent's Whole Mouse Genome microarray, differentially expressed genes (DEG) were determined for both the self-self hybridization and dye swap designs. The DEG concordance between the two designs was over 80% across each dose treatment and chemical. Furthermore, 90% of DEG-associated biological pathways were in common between the designs, indicating that biological interpretations would be consistent. The reduced labor and expense for the self-self hybridization design make it an efficient substitute for the dye swap design. For example, in larger toxicogenomic studies, only about half the chips are required for the self-self hybridization design compared to that needed in the dye swap design.
An experiment design type where the label orientations are reversed. exact synonym: flip dye, dye flip
Philippe Rocca-Serra on behalf of MO
MO_858
dye swap design
time series design
PMID: 14744830-Microarrays are powerful tools for surveying the expression levels of many thousands of genes simultaneously. They belong to the new genomics technologies which have important applications in the biological, agricultural and pharmaceutical sciences. There are myriad sources of uncertainty in microarray experiments, and rigorous experimental design is essential for fully realizing the potential of these valuable resources. Two questions frequently asked by biologists on the brink of conducting cDNA or two-colour, spotted microarray experiments are 'Which mRNA samples should be competitively hybridized together on the same slide?' and 'How many times should each slide be replicated?' Early experience has shown that whilst the field of classical experimental design has much to offer this emerging multi-disciplinary area, new approaches which accommodate features specific to the microarray context are needed. In this paper, we propose optimal designs for factorial and time course experiments, which are special designs arising quite frequently in microarray experimentation. Our criterion for optimality is statistical efficiency based on a new notion of admissible designs; our approach enables efficient designs to be selected subject to the information available on the effects of most interest to biologists, the number of arrays available for the experiment, and other resource or practical constraints, including limitations on the amount of mRNA probe. We show that our designs are superior to both the popular reference designs, which are highly inefficient, and to designs incorporating all possible direct pairwise comparisons. Moreover, our proposed designs represent a substantial practical improvement over classical experimental designs which work in terms of standard interactions and main effects. The latter do not provide a basis for meaningful inference on the effects of most interest to biologists, nor make the most efficient use of valuable and limited resources.
Groups of assays that are related as part of a time series.
PRS-AGB adding formal restriction on independent variable specification about time (march 2013) and making time series design class a defined class.
Philippe Rocca-Serra on behalf of MO
MO_887
time series design
collecting specimen from organism
taking a sputum sample from a cancer patient, taking the spleen from a killed mouse, collecting a urine sample from a patient
a process with the objective to obtain a material entity that was part of an organism for potential future use in an investigation
PERSON:Bjoern Peters
IEDB
collecting specimen from organism
material component separation
Using a cell sorter to separate a mixture of T cells into two fractions; one with surface receptor CD8 and the other lacking the receptor, or purification
a material processing in which components of an input material become segregated in space
Bjoern Peters
IEDB
material component separation
group assignment
Assigning' to be treated with active ingredient role' to an organism during group assignment. The group is those organisms that have the same role in the context of an investigation
group assignment is a process which has an organism as specified input and during which a role is assigned
Philippe Rocca-Serra
cohort assignment
study assignment
OBI Plan
group assignment
maintaining cell culture
When harvesting blood from a human, isolating T cells, and then limited dilution cloning of the cells, the maintaining_cell_culture step comprises all steps after the initial dilution and plating of the cells into culture, e.g. placing the culture into an incubator, changing or adding media, and splitting a cell culture
a protocol application in which cells are kept alive in a defined environment outside of an organism. part of cell_culturing
PlanAndPlannedProcess Branch
OBI branch derived
maintaining cell culture
'establishing cell culture'
a process through which a new type of cell culture or cell line is created, either through the isolation and culture of one or more cells from a fresh source, or the deliberate experimental modification of an existing cell culture (e.g passaging a primary culture to become a secondary culture or line, or the immortalization or stable genetic modification of an existing culture or line).
PERSON:Matthew Brush
PERSON:Matthew Brush
A 'cell culture' as used here referes to a new lineage of cells in culture deriving from a single biological source.. New cultures are established through the initial isolation and culturing of cells from an organismal source, or through changes in an existing cell culture or line that result in a new culture with unique characteristics. This can occur through the passaging/selection of a primary culture into a secondary culture or line, or experimental modifications of an existing cell culture or line such as an immortalization process or other stable genetic modification. This class covers establishment of cultures of either multicellular organism cells or unicellular organisms.
establishing cell culture
addition of molecular label
The addition of phycoerytherin label to an anti-CD8 antibody, to label all antibodies. The addition of anti-CD8-PE to a population of cells, to label the subpopulation cells that are CD8+.
a material processing technique intended to add a molecular label to some input material entity, to allow detection of the molecular target of this label in a detection of molecular label assay
PERSON:Matthew Brush
labeling
OBI developer call, 3-12-12
addition of molecular label
sequencing assay
The use of the Sanger method of DNA sequencing to determine the order of the nucleotides in a DNA template
the use of a chemical or biochemical means to infer the sequence of a biomaterial
has_output should be sequence of input; we don't have sequence well defined yet
PlanAndPlannedProcess Branch
OBI branch derived
sequencing assay
recombinant vector cloning
a planned process with the objective to insert genetic material into a cloning vector for future replication of the inserted material
pa_branch (Alan, Randi, Kevin, Jay, Bjoern)
molecular cloning
OBI branch derived
recombinant vector cloning
RNA extraction
A RNA extraction is a nucleic acid extraction where the desired output material is RNA
PlanAndPlannedProcess Branch
OBI branch derived
requested by Helen Parkinson for MO
RNA extraction
nucleic acid extraction
Phenol / chlorophorm extraction disolvation of protein content folllowed by ethanol precipitation of the nucleic acid fraction over night in the fridge followed by centrifugation to obtain a nucleic acid pellet.
a material separation to recover the nucleic acid fraction of an input material
PlanAndPlannedProcess Branch
OBI branch derived
requested by Helen Parkinson for MO. Could be defined class
nucleic acid extraction
phage display library
PMID: 15905471.Nucleic Acids Res. 2005 May 19;33(9):e81.Oligonucleotide-assisted cleavage and ligation: a novel directional DNA cloning technology to capture cDNAs. Application in the construction of a human immune antibody phage-display library. [Phage display library encoding fragments of human antibodies. m-rna library encoding for 9-mer peptides]
a phage display library is a collection of materials in which a mixture of genes or gene fragments is expressed and can be individually selected and amplified.
PERSON: Bjoern Peters
PERSON: Philippe Rocca-Serra
display library
WEB: http://www.immuneepitope.org/home.do
PRS: 22022008. class moved under population,
modification of definition and replacement of biomaterials in previous definition with 'material'
addition of has_role restriction
phage display library
material to be added
A mixture of peptides that is being added into a cell culture.
a material that is added to another one in a material combination process
10/26/09: This defined class is used as a 'macro expression' to reduce the size of the IEDB export
2010/02/24 Alan Ruttenberg: I think this might generate confusion as the common use of the term would consider something to be a specimen during the realization of the role, not only if it bears it. However having this class as a probe, or for display, or as a macro might be useful. Ideally we would mark or segregate such classes
IEDB
material to be added
target of material addition
A cell culture into which a mixture of peptides is being added.
A material entity into which another is being added in a material combinatino process
10/26/09: This defined class is used as a 'macro' to reduce the size of the IEDB export.
IEDB
target of material addition
phenotype
A (combination of) quality(ies) of an organism determined by the interaction of its genetic make-up and environment that differentiates specific instances of a species from other instances of the same species.
phenotype
fluorescence
A luminous flux quality inhering in a bearer by virtue of the bearer's emitting longer wavelength light following the absorption of shorter wavelength radiation; fluorescence is common with aromatic compounds with several rings joined together.
fluorescence
mass
A physical quality that inheres in a bearer by virtue of the proportion of the bearer's amount of matter.
mass
protein
antithrombin III is a protein
An amino acid chain that is produced de novo by ribosome-mediated translation of a genetically-encoded mRNA.
protein
molecular label role
a reagent role inhering in a molecular entity intended to associate with some molecular target to serve as a proxy for the presence, abundance, or location of this target in a detection of molecular label assay.
MHB (9-29-13): 'molecular label role' imported from the Reagent Ontology and replaced OBI:OBI_0000140 (label role)
molecular tracer role
OBI developer call, 3-12-12
molecular label role
molecular label
a molecular reagent intended to associate with some molecular target to serve as a proxy for the presence, abundance, or location of this target in a detection of molecular label assay
molecular tracer
OBI developer call, 3-12-12
molecular label
region
A sequence_feature with an extent greater than zero. A nucleotide region is composed of bases and a polypeptide region is composed of amino acids.
primary structure of sequence macromolecule
sequence
region
digital images may be stored as electronic file in TIFF format on mass memory storage devices
an electronic file is an information content entity which conforms to a specification or format and which is meant to hold data and information in digital form, accessible to software agents
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
digital file
a balanced design is a an experimental design where all experimental group have the an equal number of subject observations
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
balanced design
1
a single factor design is a study design which declares exactly 1 independent variable
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
single factor design
x-axis is a cartesian coordinate axis which is orthogonal to the y-axis and the z-axis
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
x-axis
an axis is a line graph used as reference line for the measurement of coordinates.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
http://www.oxforddictionaries.com/definition/english/axis
axis
y-axis is a cartesian coordinate axis which is orthogonal to the x-axis and the z-axis
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
y-axis
A Cartesian coordinate system is a coordinate system that specifies each point uniquely in a plane by a pair of numerical coordinates, which are the signed distances from the point to two fixed perpendicular directed lines, measured in the same unit of length.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
http://en.wikipedia.org/wiki/Cartesian_coordinate_system
cartesian coordinate system
In geometry, a coordinate system is a system which uses one or more numbers, or coordinates, to uniquely determine the position of a point or other geometric element on a manifold such as Euclidean space.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
http://en.wikipedia.org/wiki/Coordinate_system
coordinate system
a cartesian axis is one of 3 the axis in a cartesian coordinate system defining a referential in 3 dimensions. each of the axis is orthogonal to the other 2
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
rectangular coordinate axis
adapted from Wolfram Alpha:
https://www.wolframalpha.com/input/?i=cartesian+coordinates&lk=4&num=6&lk=4&num=6
cartesian coordinate axis
z-axis is a cartesian coordinate axis which is orthogonal to the x-axis and the y-axis
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
z-axis
a 2 dimensional cartesian coordinate system is a cartesian coordinate system which defines 2 orthogonal one dimensional axes and which may be used to describe a 2 dimensional spatial region.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
two dimensional cartesian coordinate system
In mathematics, a spherical coordinate system is a coordinate system for three-dimensional space where the position of a point is specified by three numbers: the radial distance of that point from a fixed origin, its polar angle measured from a fixed zenith direction, and the azimuth angle of its orthogonal projection on a reference plane that passes through the origin and is orthogonal to the zenith, measured from a fixed reference direction on that plane.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
https://en.wikipedia.org/wiki/Spherical_coordinate_system
spherical coordinate system
A cylindrical coordinate system is a three-dimensional coordinate system that specifies point positions by the distance from a chosen reference axis, the direction from the axis relative to a chosen reference direction, and the distance from a chosen reference plane perpendicular to the axis. The latter distance is given as a positive or negative number depending on which side of the reference plane faces the point.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
https://en.wikipedia.org/wiki/Cylindrical_coordinate_system
cylindrical coordinate system
In mathematics, the polar coordinate system is a two-dimensional coordinate system in which each point on a plane is determined by a distance from a fixed point and an angle from a fixed direction.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
http://en.wikipedia.org/wiki/Polar_coordinate_system
polar coordinate system
Wilks' lambda distribution (named for Samuel S. Wilks), is a probability distribution used in multivariate hypothesis testing, especially with regard to the likelihood-ratio test and Multivariate analysis of variance. It is a multivariate generalization of the univariate F-distribution, and generalizes the F-distribution in the same way that the Hotelling's T-squared distribution generalizes Student's t-distribution.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
wikipedia:
last accessed: 2013-09-11
http://en.wikipedia.org/wiki/Wilks%27_lambda_distribution
Wilk's lambda distribution
A cartesian spatial coordinate datum chosen as a fixed point of reference in a three dimensional spatial region.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
three dimensional cartesian spatial coordinate origin
normal distribution hypothesis is a goodness of fit hypothesis stating that the distribution computed from the sample population fits a normal distribution.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
normal distribution hypothesis
A cartesian spatial coordinate datum chosen as a fixed point of reference in a two dimensional spatial region.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
two dimensional cartesian spatial coordinate origin
90
a confidence interval which covers 90% of the sampling distribution, meaning that there is a 90% risk of false positive (type I error)
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
90% confidence interval
A one dimensional cartesian coordinate system is a cartesian coordinate system which defines a one dimensional axis and which may be used to describe a one dimensional spatial region, i.e. a straight line. It is defined by a point O, the origin, a unit of length and the orientation for the one dimensional space.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
one dimensional cartesian coordinate system
http://www.stat.duke.edu/courses/Spring98/sta110c/qtable.html
The studentized range (q) distribution is a probability distribution used by the Tukey Honestly Significant Difference test.
The distribution of the statistic
[x̄(k)- x̄(1)]/(s/√n)
where random samples of size n have been taken from k independent and identically distributed normal populations, with x̄(1) and x̄(k) being, respectively, the smallest and largest of the k sample means, and s2 being the pooled estimate of the common variance. This statistic is particularly used in multiple comparison tests.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
q distribution
A Dictionary of Statistics (2 rev ed.), OUP. ISBN-13: 9780199541454
http://www.oxfordreference.com/view/10.1093/acref/9780199541454.001.0001/acref-9780199541454-e-1588
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/Tukey.html
studentized range distribution
a three dimensional cartesian coordinate system is a cartesian coordinate system which defines 3 orthogonal one dimensional axes and which may be used to describe a 3 dimensional spatial region.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
three dimensional cartesian coordinate system
A cartesian spatial coordinate datum chosen as a fixed point of reference in a one dimensional spatial region.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
one dimensional cartesian spatial coordinate origin
A cartesian spatial coordinate datum chosen as a fixed point of reference in a spatial region.
placeholder, more work needed
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
cartesian spatial coordinate origin
linkage between 2 categorical variable test is a statistical test which evaluates if there is an association between a predictor variable assuming discrete values and a response variable also assuming discrete values
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
test of association
STATO
test of independence
test of independence between variables
test of association between categorical variables
measure of variation or statistical dispersion is a data item which describes how much a theoritical distribution or dataset is spread.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
measure of dispersion
measure of variation
measure of variation
a measure of central tendency is a data item which attempts to describe a set of data by identifying the value of its centre.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
measure of central tendency
measure of central tendency
Chi-squared statistic is a statistic computed from observations and used to produce a p-value in statistical test when compared to a Chi-Squared distribution.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
Chi-Squared statistic
binary classification (or binomial classification) is a data transformation which aims to cast members of a set into 2 disjoint groups depending on whether the element have a given property/feature or not.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
adapted from wikipedia:
http://en.wikipedia.org/wiki/Binary_classifier
last accessed: 2013-11-21
binomial classification
binary classification
The mode is a data item which corresponds to the most frequently occurring number in a set of numbers.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
http://www.sagepub.com/upm-data/47775_ch_3.pdf
mode
scipy.stats.mode(a, axis=0)
http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mode.html#scipy.stats.mode
source:
https://github.com/scipy/scipy/blob/v0.15.1/scipy/stats/stats.py#L586
mode
a model parameter is a data item which is part of a model and which is meant to characterize an theoritecal or unknown population. a model parameter may be estimated by considering the properties of samples presumably taken from the theoritecal population
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
model parameter
the range is a measure of variation which describes the difference between the lowest score and the highest score in a set of numbers (a data set)
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
http://www.sagepub.com/upm-data/47775_ch_3.pdf
range(..., na.rm = FALSE)
http://stat.ethz.ch/R-manual/R-patched/library/base/html/range.html
range
Outliers are deviant scores that have been legitimately gathered and are not due to equipment failures.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
http://www.sagepub.com/upm-data/47775_ch_3.pdf
outlier
http://stats.stackexchange.com/questions/50623/r-calculating-mean-and-standard-error-of-mean-for-factors-with-lm-vs-direct
The standard error of the mean (SEM) is data item denoting the standard deviation of the sample-mean's estimate of a population mean.
It is calculated by dividing the sample standard deviation (i.e., the sample-based estimate of the standard deviation of the population) by the square root of n , the size (number of observations) of the sample.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
SEM
adapted from wikipedia (https://en.wikipedia.org/wiki/Standard_error)
scipy.stats.sem(a, axis=0, ddof=1)
http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.sem.html#scipy.stats.sem
source:
https://github.com/scipy/scipy/blob/v0.15.1/scipy/stats/stats.py#L1928
standard error of the mean
a set of 2 subjects which result from a pairing process which assigns subject to a set based on a pairing rule/criteria
possibly submit to 'Population and Community Ontology'
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
matched pair of subjects
a statistic is a measurement datum to describe a dataset or a variable. It is generated by a calculation on set of observed data.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO, adapted from wikipedia (http://en.wikipedia.org/wiki/Statistic).
statistic
statistic
an MA plot is a scatter plot of the log intensity ratios M = log_2(T/R) versus the average log intensities A = log_2(T*T)/2, where T and R represent the signal intensities in the test and reference channels respectively.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
M vs A plot
http://www.stat.berkeley.edu/users/terry/zarray/Software/SMAcode/html/plot.mva.html
MA plot
plot.mva()
MA plot
1
The Anderson–Darling test is a statistical test of whether a given sample of data is drawn from a given probability distribution.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
http://en.wikipedia.org/wiki/Anderson_Darling_test
ad.test(x) function, where x is a numeric vector
scipy.stats.anderson(x, dist='norm')
http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.anderson.html#scipy.stats.anderson
source:
https://github.com/scipy/scipy/blob/v0.15.1/scipy/stats/morestats.py#L1017
Anderson-Darling test
true
true
1
1
one-way anova is an analysis of variance where the different groups being compared are associated with the factor levels of only one independent variable. The null hypothesis is an absence of difference between the means calculated for each of the groups. The test assumes normality and equivariance of the data.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
one factor ANOVA
STATO
http://statland.org/R/R/R1way.htm
http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.f_oneway.html#scipy.stats.f_oneway
one-way ANOVA
true
true
1
2
two-way anova is an analysis of variance where the different groups being compared are associated the factor levels of exatly 2 independent variables. The null hypothesis is an absence of difference between the means calculated for each of the groups. The test assumes normality and equivariance of the data.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
two factor ANOVA
STATO
http://courses.statistics.com/software/R/Rtwoway.htm
two-way ANOVA
a block design is a kind of study design which declares a blocking variable (also known as nuisance variable) in order to account for a known source of variation and reduce its impact on the acquisition of the signal
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
adapted from several sources including Wikipedia
block design
1
a count of 4 resulting from counting limbs in humans
a count is a data item denoted by an integer and represented the number of instances or occurences of an entity
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
count
true
true
1
3
Multi-way anova is an analysis of variance where the difference groups being compared are associated to the factor levels of more than 2 independent variables. The null hypothesis is an absence of difference between the means calculated for each of the groups. The test assumes normality and equivariance of the data.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
multiway ANOVA
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2581961/
Hardy-Weinberg equilibrium hypothesis is a good of fit hypothesis which states that allele and genotype frequencies in a population will remain constant from generation to generation in the absence of other evolutionary influences (non-random mating, mutation, selection, genetic drift, gene flow and meiotic drive).
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO, adapted from wikipedia (http://en.wikipedia.org/wiki/Hardy–Weinberg_principle)
Hardy-Weinberg equilibrium hypothesis
signal to noise ratio is a measurement datum comparing the amount of meaningful, useful or interesting data (the signal) to the amount of irrelevant or false data (the noise). Depending on the field and domain of application, different variables will be used to determinate a 'signal to noise ratio'. In statistics, the definition of signal to noise ratio is the ratio of the mean of a measurement to its standard deviation. It thus corresponds to the inverse of the coefficient of variation
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
adapted from Wikipedia:
http://en.wikipedia.org/wiki/Signal-to-noise_ratio#Alternative_definition
last accessed: 2013-10-18
S/N
SNR
http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.signaltonoise.html#scipy.stats.signaltonoise
signal to noise ratio
Poisson distribution is a probability distribution used to model the number of events occurring within a given time interval. It is defined by a real number (λ) and an integer k representing the number of events and a function.
The expected value of a Poisson-distributed random variable is equal to λ and so is its variance.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
dpois(x, lambda, log = FALSE)
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/Poisson.html
http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.poisson.html#scipy.stats.poisson
NIST: http://www.itl.nist.gov/div898/handbook/eda/section3/eda366j.htm
Poisson distribution
true
Z-test is a statistical test which evaluate the null hypothesis that the means of 2 populations are equal and returns a p-value.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
http://reference.wolfram.com/mathematica/ref/ZTest.html
simple.z.test(x, sigma, conf.level=0.95)
http://www.inside-r.org/packages/cran/UsingR/docs/simple.z.test
Z-test
a false positive rate is a data item which accounts for the proportion of incorrect rejection of a true null hypothesis.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
PRS,AGB adapted from wikipedia and wolfram alpha
significance level
type I error rate
α
false positive rate
homoskedasticity states that all variances under consideration are homogenous.
definition edited according to the discussion documented in:
https://github.com/ISA-tools/stato/issues/39
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
equality of variance
STATO
homoskedasticity hypothesis
http://www.ncbi.nlm.nih.gov/assembly/model/
chrX:35,000,000-36,000,000.
chromosome coordinate system is a genomic coordinate which uses chromosome of a particular assembly build process to define start and end positions. This coordinate system is unstable and will change with each new genome sequence assembly build.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
chromosome coordinate system
a null hypothesis which states that no linkage exists between 2 categorical variables
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
no relationship between the variables
variables are independent
absence of association hypothesis
A null hypothesis is a statistical hypothesis that is tested for possible rejection under the assumption that it is true (usually that observations are the result of chance). The concept was introduced by R. A. Fisher.
The hypothesis contrary to the null hypothesis, usually that the observations are the result of a real effect, is known as the alternative hypothesis.[wolfram alpha]
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
http://mathworld.wolfram.com/NullHypothesis.html
null hypothesis
goodness of fit hypothesis is a null hypothesis stating that the distribution computed from the sample population fits a theoretical distribution or that a dataset can be correctly explained by a model
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
goodness of fit hypothesis
0
the Student's t distribution is a continuous probability distribution which arises when estimating the mean of a normally distributed population in situations where the sample size is small and population standard deviation is unknown.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO, adapted from wikipedia (http://en.wikipedia.org/wiki/Student's_t-distribution)
t distribution
dt(x, df, ncp, log = FALSE)
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/TDist.html
http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.t.html#scipy.stats.t
Student's t distribution
hypergeometric distribution is a probability distribution that describes the probability of k successes in n draws from a finite population of size N containing K successes without replacement
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
http://en.wikipedia.org/wiki/Hypergeometric_distribution
dhyper(x, m, n, k, log = FALSE)
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/Hypergeometric.html
http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.hypergeom.html#scipy.stats.hypergeom
hypergeometric distribution
It is a null hypothesis stating that there are no differences observed between group of subjects.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
absence of between group difference hypothesis
is a null hypothesis stating that there are no difference observed across a series of measurements made one same subject.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
absence of within subject difference hypothesis
genomic coordinate datum is a data item which denotes a genomic position expressed using a genomic coordinate system
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
genomic coordinate datum
http://left.subtree.org/2012/04/13/counting-the-number-of-reads-in-a-bam-file/
sequence read count is a data item determining how many sequence reads generated by a DNA sequencing assay for a given stretch of DNA can counted
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
AGB-PRS, STATO
sequence read count
In statistics, a statement that can be tested.[wolfram alpha]
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
http://mathworld.wolfram.com/Hypothesis.html
hypothesis
Cleveland dot plot is a dot plot which plots points that each belong to one of several categories. They are an alternative to bar charts or pie charts, and look somewhat like a horizontal bar chart where the bars are replaced by a dots at the values associated with each category. Compared to (vertical) bar charts and pie charts, Cleveland argues that dot plots allow more accurate interpretation of the graph by readers by making the labels easier to read, reducing non-data ink (or graph clutter) and supporting table look-up.which
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
adapted from Wikipedia:
http://en.wikipedia.org/wiki/Dot_plot_(statistics)
and
Cleveland, William S. (1993). Visualizing Data. Hobart Press. ISBN 0-9634884-0-6. hdl:2027/mdp.39015026891187.
http://stat.ethz.ch/R-manual/R-patched/library/graphics/html/dotchart.html
dotchart(x, labels = NULL, groups = NULL, gdata = NULL,
cex = par("cex"), pch = 21, gpch = 21, bg = par("bg"),
color = par("fg"), gcolor = par("fg"), lcolor = "gray",
xlim = range(x[is.finite(x)]),
main = NULL, xlab = NULL, ylab = NULL, ...)
Cleveland dot plot
a continuousprobability distribution is a probability distribution which is defined by a probability density function
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
adapted from Wikipedia
http://en.wikipedia.org/wiki/Probability_distribution#Continuous_probability_distribution
last accessed:
14/01/2014
continuous probability distribution
Skewness is a data item indicating of the degree of asymmetry of a distribution.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
http://mathworld.wolfram.com/Skewness.html
skewness(x, na.rm = FALSE, type = 3)
http://hosho.ees.hokudai.ac.jp/~kubo/Rdoc/library/e1071/html/skewness.html
http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.skew.html#scipy.stats.skew
skewness
The number degree of freedom is a count evaluating the number of values in a calculation that can vary. In statistics, the number of degrees of freedom ν is equal to N-1 in the case of the direct measurement of a quantity estimated by the arithmetic mean of N independent observations.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
http://stats.stackexchange.com/questions/16921/how-to-understand-degrees-of-freedom
http://www.optique-ingenieur.org/en/courses/OPI_ang_M07_C01/co/Contenu_07.html
the rank of the quadratic form (mathematical definition)
number of degrees of freedom
2
Yate's corrected Chi-Squared test is a statistical test which is used to test the association/linkage/independence of 2 dichotomous variables while introducing a correction for using the continous Chi-squared distribution for the test.
To reduce the error in approximation, Frank Yates, an English statistician, suggested a correction for continuity that adjusts the formula for Pearson's chi-squared test by subtracting 0.5 from the difference between each observed value and its expected value in a 2 × 2 contingency table. This reduces the chi-squared value obtained and thus increases its p-value.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO, adapted from wikipedia (http://en.wikipedia.org/wiki/Yates's_correction_for_continuity) polled in June 2013
Yate's correction for continuity
chisq.test(x, y = NULL, correct = TRUE)
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/chisq.test.html
Yate's corrected Chi-Squared test
reaction rate is a measurement datum which represents the speed of a chemical reaction turning reactive species into product species of event (i.e the number of such conversions)s occuring over a time interval
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
reaction rate
substrate concentration is a scalar measurement datum which denotes the amount of molecular entity involved in an enzymatic reaction (or catalytic chemical reaction) and whose role in that reaction is as substrate.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
substrate concentration
1
2
2
5
Fisher's exact test is a statistical test used to determine if there are nonrandom associations between two categorical variables.
duplicate with OBI_0200176. so either MIREOT and add metadata and axioms or move from OBI
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
http://mathworld.wolfram.com/FishersExactTest.html
fisher.test(x) function, where x is a matrix
scipy.stats.fisher_exact(table, alternative='two-sided')
http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.fisher_exact.html#scipy.stats.fisher_exact
source:
https://github.com/scipy/scipy/blob/v0.15.1/scipy/stats/stats.py#L2485
Fisher's exact test
true
2
1
1
2
2
Cochran-Mantel-Haenzel test for repeated tests of independence is a statitiscal test which allows the comparison of two groups on a dichotomous/categorical response. It is used when the effect of the explanatory variable on the response variable is influenced by covariates that can be controlled. It is often used in observational studies where random assignment of subjects to different treatments cannot be controlled, but influencing covariates can.
The null hypothesis is that the two nominal variables that are tested within each repetition are independent of each other. So there are 3 variables to consider: two categorical variables to be tested for independence of each other, and the third variable identifies the repeats.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO adapted from wikipedia (http://en.wikipedia.org/wiki/Cochran–Mantel–Haenszel_statistics) and from the Handbook of Biological Statistics (http://udel.edu/~mcdonald/statcmh.html)
CHM test
Mantel–Haenszel test
cmh.test(x,y,z)
Cochran-Mantel-Haenzel test for repeated tests of independence
a rarefaction curve is a graph used for estimating species richness in ecology studies
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
>library(vegan)
>rarefaction(x, subsample=5, plot=TRUE, color=TRUE, error=FALSE, legend=TRUE, symbol)
http://hosho.ees.hokudai.ac.jp/~kubo/Rdoc/library/vegan/html/vegan-package.html
rarefaction curve
1
1
2
1
The Mann-Whitney U-test is a null hypothesis statistical testing procedure which allows two groups (or conditions or treatments) to be compared without making the assumption that values are normally distributed.
The Mann-Whitney test is the non-parametric equivalent of the t-test for independent samples
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
U test
Wilcoxon rank-sum test
rank-sum test for the comparison of two samples
adapted from http://udel.edu/~mcdonald/statkruskalwallis.html
and from http://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U
last accessed [2014-03-04]
Wilcoxon Rank-Sum test
wilcox.test(dependent variable ~ independant variable, data = dataset)
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/wilcox.test.html
scipy.stats.mannwhitneyu(x, y, use_continuity=True)
http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mannwhitneyu.html#scipy.stats.mannwhitneyu
source:
https://github.com/scipy/scipy/blob/v0.15.1/scipy/stats/stats.py#L4049
scipy.stats.ranksums(x, y)
http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ranksums.html#scipy.stats.ranksums
source:
https://github.com/scipy/scipy/blob/v0.15.1/scipy/stats/stats.py#L4103
Mann-Whitney U-test
Shapiro-Wilk test is a goodness of fit test which evaluates the null hypothesis that the sample is drawn from a population following a normal distribution
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
S-W test
STATO, adapted from wikipedia (https://en.wikipedia.org/wiki/Shapiro–Wilk_test)
shapiro.test(x) function, where x is a numeric vector
https://stat.ethz.ch/R-manual/R-devel/library/stats/html/shapiro.test.html
scipy.stats.shapiro(x, a=None, reta=False)
http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.shapiro.html#scipy.stats.shapiro
source:
https://github.com/scipy/scipy/blob/v0.15.1/scipy/stats/morestats.py#L944
Shapiro-Wilk test
Levene's test is a null hypothesis statistical test which evaluates the null hypothesis of equality of variance in several populations.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
http://en.wikipedia.org/wiki/Levene_test
levene.test(x) function, where x is a numeric vector
scipy.stats.levene(*args, **kwds)
http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.levene.html#scipy.stats.levene
source:
https://github.com/scipy/scipy/blob/v0.15.1/scipy/stats/morestats.py#L1496
Levene's test
Bartlett's test (see Snedecor and Cochran, 1989) is used to test if k samples are from populations with equal variances. Equal variances across samples is called homoscedasticity or homogeneity of variances. Some statistical tests, for example the analysis of variance, assume that variances are equal across groups or samples. The Bartlett test can be used to verify that assumption.
Bartlett's test is sensitive to departures from normality. That is, if the samples come from non-normal distributions, then Bartlett's test may simply be testing for non-normality. Levene's test and the Brown–Forsythe test are alternatives to the Bartlett test that are less sensitive to departures from normality.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
http://en.wikipedia.org/wiki/Bartlett_test
bartlett.test(x) function, where x is a numeric vector
scipy.stats.bartlett(*args)
http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bartlett.html#scipy.stats.bartlett
source:
https://github.com/scipy/scipy/blob/v0.15.1/scipy/stats/morestats.py#L1450
Barlett's test
the Brown Forsythe test is a statistical test which evaluates if the variance of different groups are equal. It relies on computing the median rather than the mean, as used in the Levene's test for homoschedacity.
This test maybe used to, for instance, ensure that the conditions of applications of ANOVA are met.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
adapted from Wikipedia and Brown, M. B., and A. B. Forsythe. 1974a. The small sample behavior of some statistics which test the equality of several means. Technometrics, 16, 129-132.
http://www.statmethods.net/stats/anovaAssumptions.html
The hovPlot( ) function in the HH package provides a graphic test of homogeneity of variances based on Brown-Forsyth. In the following example, y is numeric and G is a grouping factor. Note that G must be of type factor.
# Homogeneity of Variance Plot
library(HH)
hov(y~G, data=mydata)
hovPlot(y~G,data=mydata)
Brown Forsythe test
2
Pearson's Chi-Squared test is a statistical null hypothesis test which is used to either evaluate goodness of fit of dataset to a Chi-Squared distribution or used to test independence of 2 categorical variables (ie absence of association between those variables).
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
Chi2 test for independence
adapted from:
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/chisq.test.html
and
http://en.wikipedia.org/wiki/Pearson's_chi-squared_test
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/chisq.test.html
chisq.test(x, y = NULL, correct = TRUE,
p = rep(1/length(x), length(x)), rescale.p = FALSE,
simulate.p.value = FALSE, B = 2000)
http://www.inside-r.org/packages/cran/nortest/docs/pearson.test
pearson.test(x) function, where x is a numeric vector
http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html#scipy.stats.chi2_contingency
Pearson's Chi square test of independence between categorical variables
2
1
1
a fixed effect model is a statistical model which represents the observed quantities in terms of explanatory variables that are treated as if the quantities were non-random.
PRS: this is a stub and more work is needed to reconcile conflicting definitions
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
adapted from wikipedia:
http://en.wikipedia.org/wiki/Fixed_effects_model
fixed effect model
Kolmogorov-Smirnov test is a goodness of fit test which evaluates the null hypothesis that a sample is drawn from a population that follows a specific continuous probability distribution.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
K-S test
STATO, adapted from wikipedia (https://en.wikipedia.org/wiki/Kolmogorov–Smirnov_test)
http://www.itl.nist.gov/div898/handbook/eda/section3/eda35g.htm
ks.test(dataset, distribution)
scipy.stats.kstwobign = <scipy.stats._continuous_distns.kstwobign_gen object at 0x7f6169f842d0>
http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kstwobign.html#scipy.stats.kstwobign
source:
https://github.com/scipy/scipy/blob/v0.15.1/scipy/stats/_continuous_distns.py
scipy.stats.mstats.ks_twosamp(data1, data2,alternative='two-sided')
http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mstats.ks_twosamp.html
source code:
https://github.com/scipy/scipy/blob/v0.15.1/scipy/stats/mstats_basic.py#L821
Kolmogorov-Smirnov test
multinomial logistic regression model is a model which attempts to explain data distribution associated with *polychotomous* response/dependent variable in terms of values assumed by the independent variable uses a function of predictor/independent variable(s): the function used in this instance of regression modeling is probit function.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO, adapted from wikipedia (http://en.wikipedia.org/wiki/Multinomial_probit) polled in June 2013
http://cran.r-project.org/web/packages/mlogit/vignettes/mlogit.pdf
multinomial probit regression for analysis of polychotomous dependent variable
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2689604/
effect size estimate is a data item about the direction and strength of the consequences of a causative agent as explored by statistical methods. Those methods produce estimates of the effect size, e.g. confidence interval
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
AGB,PRS
effect size
effect size estimate
an F-test is a statistical test which evaluates that the computed test statistics follows an F-distribution under the null hypothesis. The F-test is sensitive to departure from normality. F-test arise when decomposing the variability in a data set in terms of sum of squares.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
F-test
2
a polychotomous variable is a categorical variable which is defined to have minimally 2 categories or possible values
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
http://udel.edu/~mcdonald/statvartypes.html
polychotomous variable
statistical sample size is a count evaluating the number of individual experimental units
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
AGB-PRS
statistical sample size
study group population size
2
1
a case-control study design is a observation study design which assess the risk of particular outcome (a trait or a disease) associated with an event (either an exposure or endogenous factor). A case-control study design therefore declares an exposure variable which is dichotomous in nature (exposed/non-exposed) and an outcome variable, which is also dichotomous (case or control), thus giving the name to the design. During the execution of the design, a case control study defines a population and counts the events to determine their frequency.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO, adapted from:
http://www.drcath.net/toolkit/casecontrol.html
case-control study design
2
a dichotomous variable is a categorical variable which is defined to have only 2 categories or possible values
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
AGB-PRS
http://udel.edu/~mcdonald/statvartypes.html
'has part' exactly 1 ('categorical measurement datum'
and ('has category label' exactly 2 'categorical label'))
dichotomous variable
Genome wide association study is a kind of study whose objective is to detect association between genetic markers (SNP or otherwise) accross the genome and a trait which may be a disease or another phenotype (e.g. trait of agronomic relevance in animal or plant studies). Genome wide association study compare the allele frequencies in 2 populations, one free of the trait used as control, the other one showing the trait use as 'case'. GWAS studies implement case-control design
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
AGB, PRS
GWAS study
whole genome association study
genome-wide association study
1
2
The Wilcoxon signed rank test is a statistical test which tests the null hypothesis that the median difference between pairs of observations is zero. This is the non-parametric analogue to the paired t-test, and should be used if the distribution of differences between pairs may be non-normally distributed.
The procedure involves a ranking, hence the name. The absolute value of the differences between observations are ranked from smallest to largest, with the smallest difference getting a rank of 1, then next larger difference getting a rank of 2, etc. Ties are given average ranks. The ranks of all differences in one direction are summed, and the ranks of all differences in the other direction are summed. The smaller of these two sums is the test statistic, W (sometimes symbolized Ts). Unlike most test statistics, smaller values of W are less likely under the null hypothesis.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
http://udel.edu/~mcdonald/statsignedrank.html
signrank()
scipy.stats.wilcoxon(x, y=None, zero_method='wilcox', correction=False)
http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.wilcoxon.html#scipy.stats.wilcoxon
source:
https://github.com/scipy/scipy/blob/v0.15.1/scipy/stats/stats.py#L4103
Wilcoxon signed rank test
Information about a calendar date or timestamp indicating day, month, year and time of an event.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
date
1
2
1
1
The Kruskal–Wallis test is a null hypothesis statistical testing objective which allows multiple (n>=2) groups (or conditions or treatments) to be compared, without making the assumption that values are normally distributed. The Kruskal–Wallis test is the non-parametric equivalent of the independent samples ANOVA.
The Kruskal–Wallis test is most commonly used when there is one nominal variable and one measurement variable, and the measurement variable does not meet the normality assumption of an anova.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
H test
rank-sum test for the comparison of multiple (more than 2) samples.
http://udel.edu/~mcdonald/statkruskalwallis.html
kruskal.test()
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/kruskal.test.html
scipy.stats.mstats.kruskalwallis(*args)
http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mstats.kruskalwallis.html
source code:
https://github.com/scipy/scipy/blob/v0.15.1/scipy/stats/mstats_basic.py#L800
Kruskal Wallis test
true
1
true
1
paired t-test is a statistical test which is specifically designed to analysis differences between paired observations in the case of studies realizing repeated measures design with only 2 repeated measurements per subject (before and after treatment for example)
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
http://udel.edu/~mcdonald/statpaired.html
http://udel.edu/~mcdonald/statsignedrank.html
t-test for dependent means
t-test for repeated measures
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/t.test.html
t.test(dependent variable ~ independant variable, data = dataset, var.equal = FALSE, paired= TRUE)
scipy.stats.ttest_rel(a, b, axis=0)
http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_rel.html#scipy.stats.ttest_rel
source:
https://github.com/scipy/scipy/blob/v0.15.1/scipy/stats/stats.py#L3389
paired t-test
2
1
stratification is a planned process which executes a stratification rule using as input a population and assign it member to mutually exclusive subpopulation based on the values defined by the stratification rule
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
PRS+AGB adapted from wikipedia:
http://en.wikipedia.org/wiki/Stratified_sampling
polled on June 7th,2013
stratifying population
population stratification prior to sampling
A stastical test power analysis is a data transformation which aims to determine the size of a statistical sample required to reach a desired significance level given a particular statistical test
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
http://www.statmethods.net/stats/power.html
http://www.statmethods.net/stats/power.html
statistical test power analysis
2
2
http://arxiv.org/pdf/1007.1094.pdf
Hotelling's T2 test is a statistical test which is a generalization of Student's T-test to a assess if the means of a set of variables remains unchanged when studying 2 populations. It is a type of multivariate analysis
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
http://svitsrv25.epfl.ch/R-doc/library/rrcov/html/T2.test.html
two sample Hotelling T2 test
1
1
a random effect(s) model, also called a variance components model, is a kind of hierarchical linear model. It assumes that the dataset being analysed consists of a hierarchy of different populations whose differences relate to that hierarchy.
PRS: this is a stub and more work is needed to reconcile conflicting definitions
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
variance components model
adapted from wikipedia:
http://en.wikipedia.org/wiki/Random_effects_model#Qualitative_description
random effect model
2
standardized mean difference is data item computed by forming the difference between two means, divided by an estimate of the within-group standard deviation.
It is used to provide an estimatation of the effect size between two treatments when the predictor (independent variable) is categorical and the response(dependent) variable is continuous
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
SMD
adapted from "Effect size, confidence interval and statistical significance: a practical guide for biologists" Nakagawa and Cuthill
DOI: 10.1111/j.1469-185X.2007.00027.x
adapted from http://htaglossary.net/standardised+mean+difference+(SMD)
Cohen's d statistic
standardized mean difference
the multinomial distribution is a probability distribution which gives the probability of any particular combination of numbers of successes for various categories defined in the context of n independent trials each of which leads to a success for exactly one of k categories, with each category having a given fixed success probability.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
adapted from
http://mathworld.wolfram.com/MultinomialDistribution.html
and
http://en.wikipedia.org/wiki/Multinomial_distribution
dmultinom(x, size = NULL, prob, log = FALSE)
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/Multinom.html
multinomial distribution
A z-score (also known as z-value, standard score, or normal score) is a measure of the divergence of an individual experimental result from the most probable result, the mean. Z is expressed in terms of the number of standard deviations from the mean value.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
https://controls.engin.umich.edu/wiki/index.php/Basic_statistics:_mean,_median,_average,_standard_deviation,_z-scores,_and_p-value#Z-Scores
normal score
standard score
scipy.stats.zscore(a, axis=0, ddof=0)
http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.zscore.html#scipy.stats.zscore
source:
https://github.com/scipy/scipy/blob/v0.15.1/scipy/stats/stats.py#L1977
z-score
log signal intensity ratio is a data item which corresponding the logarithmitic base 2 of the ratio between 2 signal intensity, each corresponding to a condition.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
adapted from wikipedia:
http://en.wikipedia.org/wiki/MA_plot
last accessed: 2014-03-13
M-value
log signal intensity ratio
probit regression model is a model which attempts to explain data distribution associated with *dichotomous* response/dependent variable in terms of values assumed by the independent variable uses a function of predictor/independent variable(s): the function used in this instance of regression modeling is the probit function aka the quantile function, i.e., the inverse cumulative distribution function (CDF), associated with the standard normal distribution.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO, adapted from wikipedia (http://en.wikipedia.org/wiki/Probit_model) polled in June 2013
probit regression for analysis of polychotomous dependent variable
a statistical model is an information content entity which is a formalization of relationships between variables in the form of mathematical equations. A statistical model describes how one or more random variables are related to one or more other variables. The model is statistical as the variables are not deterministically but stochastically related.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
adapted from Wikipedia:
http://en.wikipedia.org/wiki/Statistical_model
last accessed: 14/01/2014
statistical model
statistical model
linear regression model is a model which attempts to explain data distribution associated with response/dependent variable in terms of values assumed by the independent variable uses a linear function or linear combination of the regression parameters and the predictor/independent variable(s).
linear regression modeling makes a number of assumptions, which includes homoskedasticity (constance of variance)
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO, adapted from wikipedia (http://en.wikipedia.org/wiki/Linear_regression) polled in June 2013
linear regression for analysis of continuous dependent variable
multinomial logistic regression model is a model which attempts to explain data distribution associated with *polychotomous* response/dependent variable in terms of values assumed by the independent variable uses a function of predictor/independent variable(s): the function used in this instance of regression modeling is logistic function.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO, adapted from wikipedia (http://en.wikipedia.org/wiki/Multinomial_logistic_regression) polled in June 2013
http://cran.r-project.org/web/packages/mlogit/vignettes/mlogit.pdf
multinomial logistic regression for analysis of dichotomous dependent variable
a sequence read is a DNA sequence data which is generated by a DNA sequencer
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
sequence read
a Funnel plot is a scatter plot of treatment effect versus a measure of study size and aims to provide a visual aid to detecting bias or systematic heterogeneity. A symmetric inverted funnel shape arises from a ‘well-behaved’ data set, in which publication bias is unlikely. An asymmetric funnel indicates a relationship between treatment effect and study size.
Known caveats: If high precision studies really are different from low precision studies with respect to effect size (e.g., due to different populations examined) a funnel plot may give a wrong impression of publication bias. The appearance of the funnel plot can change quite dramatically depending on the scale on the y-axis — whether it is the inverse square error or the trial size.
Funnel plot was introduced by Light and Palmer in 1984.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
adapted from Wikipedia:
http://en.wikipedia.org/wiki/Funnel_plot
Funnel plot
variance is a data item about a random variable or probability distribution. it is equivalent to the square of the standard deviation. It is one of several descriptors of a probability distribution, describing how far the numbers lie from the mean (expected value).The variance is the second moment of a distribution.
Alejandra Gonzalez-Belran
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
σ2
var(x, y = NULL, na.rm = FALSE, use)
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/cor.html
variance
the process of using statistical analysis for interpreting and communicating "what the data say".
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
From "The strenght of statistical evidence" by Richard Royall.
https://www.stat.fi/isi99/proceedings/arkisto/varasto/roya0578.pdf
assess stastistical evidence
assess statistical evidence
a discrete probability distribution is a probability distribution which is defined by a probability mass function where the random variable can only assume a finite number of values or infinitely countable values
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
adapted from Wikipedia
http://en.wikipedia.org/wiki/Probability_distribution#Discrete_probability_distribution
last accessed:
14/01/2014
discrete probability distribution
ranking is a data transformation which turns a non-ordinal variable into a Ordinal variable by sorting the values of the input variable and replacing their value by their position in the sorting result
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
ranking
model parameter estimation is a data transformation that finds parameter values (the model parameter estimates) most compatible with the data as judged by the model.
textual definition modified following contributiong by Thomas Nichols:
https://github.com/ISA-tools/stato/issues/18
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
model parameter estimation
http://www.r-bloggers.com/boxplots-beyond-iv-beanplots/
beanplot is a plot in which (one or) multiple batches ("beans") are shown. Each bean consists of a density trace, which is mirrored to
form a polygon shape. Next to that, a one-dimensional scatter plot shows all the individual measurements, like in a stripchart.
The name beanplot stems from green beans. The density shape can be seen as the pod of a green bean, while the scatter plot shows the seeds inside the pod.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
http://www.jstatsoft.org/v28/c01/paper
http://cran.r-project.org/web/packages/beanplot/index.html
bean plot
the objective of a data transformation to evaluate a null hypothesis of absence of linkage between variables.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
association between categorical variables testing objective
a pedigree chart is a graph which plots parent child relations
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO adapted from wikipedia (https://en.wikipedia.org/wiki/Pedigree_chart)
family tree
plot.pedigree {kinship}
http://hosho.ees.hokudai.ac.jp/~kubo/Rdoc/library/kinship/html/plot.pedigree.html
pedigree chart
2
r2 is a correlation coefficient which is computed over the frequency of 2 dichotomous variable and is used as a measure of Linkage Disequilibrium and as input data item to the creation of an LD plot
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
R squared measure of LD
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2580747/
r2 measure of LD
r2 measure of linkage desequilibrium
a stratification rule/criteria is a criteria used to determine population strata so that a stratification process implementing the rule can result in any member of the total population being assigned to one and only one stratum
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
adapted from wikipedia:
http://en.wikipedia.org/wiki/Stratified_sampling
polled on June 7th,2013
stratification rule
The dot plot as a representation of a distribution consists of group of data points plotted on a simple scale. Dot plots are used for continuous, quantitative, univariate data. Data points may be labelled if there are few of them.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
adapted from Wikipedia:
Wilkinson, Leland (1999). "Dot plots". The American Statistician (American Statistical Association) 53 (3): 276–281. doi:10.2307/2686111
Wilkinson dot plot
volcano plot is a kind of scatter plot which graphs the negative log of the p-value (significance) on the y-axis versus log2 of fold-change between 2 conditions on the x-axis.
It is a popular method for visualizing differential occurence of variables between 2 conditions.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
http://en.wikipedia.org/wiki/Volcano_plot_(statistics)
volcanoplot(fit, coef=1, highlight=0, names=fit$genes$ID, ...)
http://rss.acs.unt.edu/Rdoc/library/limma/html/volcanoplot.html
volcano plot
99
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2689604/
a confidence interval which covers 99% of the sampling distribution, meaning that there is a 1% risk of false positive (type I error)
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
confidence interval at 1% of type I error rate
STATO
99% confidence interval
Altman Box and Whisker plot is a variation of Tukey Box and Whisker plot which use the criteria of Altman to create the 'whisker' of the plot.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
Altman, D.G. Practical Statistics for Medical Research (Chapman and Hall, 1991).
Altman box and whisker plot
2
2
http://www.biomedcentral.com/1471-2288/11/58#B9
the Breslow-Day test is a statistical test which evaluates if the odds ratios are homogenous across N 2x2 contingency tables, for instance several 2x2 contingency tables associated with different strata of a stratified population when evaluating the relationship between exposure and outcome or associated with the different samples coming from several centres in a multicentric study in clinical trial context.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
adapted from wikipedia:
http://en.wikipedia.org/wiki/Odds_ratio#Statistical_inference
polled on June 8th,2013
Breslow-Day test
http://www.math.montana.edu/~jimrc/classes/stat524/Rcode/breslowday.test.r
Breslow-Day test for homogeneity of odds ratio
a sphericity test is a null hypothesis statistical testing procedure which posits a null hypothesis of equality of the variances of the differences between levels of the repeated measures factor
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO, adapted from wikipedia (http://en.wikipedia.org/wiki/Sphericity#Sphericity_in_statistics)
test of data sphericity
sphericity test
Hotelling T squared distribution is a probability distribution used in multivariate hypothesis testing, which is a univariate distribution proportional to the F-distribution and arises importantly as the distribution of a set of statistics which are natural generalizations of the statistics underlying Student's t-distribution.
In particular, the distribution arises in multivariate statistics in undertaking tests of the differences between the (multivariate) means of different populations, where tests for univariate problems would make use of a t-test.
The distribution is named for Harold Hotelling, who developed it[1] as a generalization of Student's t-distribution.
This distribution is commonly used to describe the sample Mahalanobis distance between two populations.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
adapted from Wikipedia
"http://en.wikipedia.org/wiki/Hotelling's_T-squared_distribution"
last polled: 2013-11-09
Hotelling T2 distribution
A post-hoc analysis is a statistical test carried out following an analysis of variance which ruled out the null hypothesis of absence of difference between group which allows identifying which groups differ.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
a posteriori test
adapted from wikipedia: http://en.wikipedia.org/wiki/Post-hoc_analysis
last accessed: 2013-11-15
post-hoc analysis
specificity is a measurement datum qualifying a binary classification test and is computed by substracting the false positive rate to the integral numeral 1
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
specificity
true negative rate
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2789971/
strictly standardized mean difference (SSMS) is a standardized mean difference which corresponds to the ratio of mean to the standard deviation of the difference between two groups.
SSMD directly measures the magnitude of difference between two groups.
SSMD is widely used in High Content Screen for hit selection and quality control.
When the data is preprocessed using log-transformation as normally done in HTS experiments, SSMD is the mean of log fold change divided by the standard deviation of log fold change with respect to a negative reference.
In other words, SSMD is the average fold change (on the log scale) penalized by the variability of fold change (on the log scale).
For quality control, one index for the quality of an HTS assay is the magnitude of difference between a positive control and a negative reference in an assay plate. For hit selection, the size of effects of a compound (i.e., a small molecule or an siRNA) is represented by the magnitude of difference between the compound and a negative reference. SSMD directly measures the magnitude of difference between two groups. Therefore, SSMD can be used for both quality control and hit selection in HTS experiments.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
http://en.wikipedia.org/wiki/SSMD
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2789971/
strictly standardized mean difference
2
Tarone's test for homogeneity of odds ratio is a statistical test which evaluates the null hypothesis that odds ratio are homogeneous
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
Tarone, R. E. ‘On heterogeneity tests based on efficient scores’, Biometrika, 72, 91-95 (1985).
> library("metafor")
> calcTaronesTest <- function(mylist,referencerow=2)
http://a-little-book-of-r-for-biomedical-statistics.readthedocs.org/en/latest/src/biomedicalstats.html#calculating-the-mantel-haenszel-odds-ratio-when-there-is-a-stratifying-variable
Tarone's test for homogeneity of odds ratio
2
an homoskedasticity test is a statistical test aiming at evaluate if the variances from several random samples are similar
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
equivariance test
homoskedasticity test
1
1
2
a 2x2 contingency table is a contingency table build for 2 dichotomous variables (i.e. 2 categorical variables, each with only 2 possible outcomes). It is the simplest of contingency tables
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
2x2 contingency table
xtabs(formula = ~., data = parent.frame(), subset, sparse = FALSE,
na.action, exclude = c(NA, NaN), drop.unused.levels = FALSE)
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/xtabs.html
flat contingency tables:
ftable(x, ...)
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/ftable.html
2 by 2 contingency table
pairing patients by age, pairing animals by body weight range
a subject pairing is a planned process which executes a pairing rule and results in the creation of sets of 2 subjects meeting the pairing criteria
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
subject pairing
2
a contigency table is a data item which displays the (multivariate) frequency distribution of the possible values of categorical variables.
The first row of the table corresponds to categories of one categorical variable, the first column of the table corresponds to categories of the other categorical variable, the cells corresponding to each combination of categories is filled with the observed occurences in the sample being considered.
The table also contains marginal total (marginal sums) and grand total of the occurences
The term contingency table was first used by Karl Pearson in "On the Theory of Contingency and Its Relation to Association and Normal Correlation", part of the Drapers' Company Research Memoirs Biometric Series I published in 1904.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO, adapted from wikipedia (http://en.wikipedia.org/wiki/Contingency_table)
xtabs(formula = ~., data = parent.frame(), subset, sparse = FALSE,
na.action, exclude = c(NA, NaN), drop.unused.levels = FALSE)
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/xtabs.html
flat contingency tables:
ftable(x, ...)
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/ftable.html
contingency table
acute toxicity study is an investigation which use interventions organized according to a factorial design and a parallel group design to observe the effect of use of high dose xenobiotics in animal models or cellular models
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
acute toxicity study
acute toxicity study
2
2
-1
1
The correlation coefficient of two variables in a data sample is their covariance divided by the product of their individual standard deviations. It is a normalized measurement of how the two are linearly related.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
r
r statistics
correlation coefficient
2
A Bayesian model selection is a data transformation which is based on Bayesian statistics to compute Bayes factor in order to evaluate which model best explains data.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
adapted from wikipedia
Bayesian model selection
for example , model parameter estimates could be produced using regression analysis (used in a model estimation process) , which attempts to express the response variable in terms of function of predictor variable and model parameters.
a model parameter estimate is a data item which results from a model parameter estimation process and which provides a numerical value about a model parameter.
textual definition modified following contributiong by Thomas Nichols:
https://github.com/ISA-tools/stato/issues/18
Alejandra Gonzalez Beltran
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
model parameter estimate
the geometric distribution is a negative binomial distribution where r is 1.
It is useful for modeling the runs of consecutive successes (or failures) in repeated independent trials of a system.
The geometric distribution models the number of successes before one failure in an independent succession of tests where each test results in success or failure.
The geometric distribution with prob = p has density
p(x) = p (1-p)^x
for x = 0, 1, 2, …, 0 < p ≤ 1.
If an element of x is not integer, the result of dgeom is zero, with a warning.
The quantile is defined as the smallest value x such that F(x) ≥ p, where F is the distribution function.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/Geometric.html
http://www.mathworks.co.uk/help/stats/geometric-distribution.html
dgeom(x, prob, log = FALSE)
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/Geometric.html
http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.geom.html#scipy.stats.geom
geometric distribution
a null hypothesis stating that there are differences observed between group of subjects
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
presence of between group difference hypothesis
Linkage Disequilibrium plot is a graph which represents pairwise linkage disequilibrium measures between SNP as a heatmap
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO, adapted from R documentation (http://cran.r-project.org/web/packages/LDheatmap/index.html)
LD plot
Linkage Disequilibrium plot
LD plot
1
1
1
The Cochran-Armitage test is a statistical test used in categorical data analysis when the aim is to assess for the presence of an association between a dichotomous variable (variable with two categories) and a polychotomous variable (a variable with k categories).
The two-level variable represents the response, and the other represents an explanatory variable with ordered levels. The null hypothesis is the hypothesis of no trend, which means that the binomial proportion is the same for all levels of the explanatory variable
For example, doses of a treatment can be ordered as 'low', 'medium', and 'high', and we may suspect that the treatment benefit cannot become smaller as the dose increases. The trend test is often used as a genotype-based test for case-control genetic association studies.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
CATT
http://en.wikipedia.org/wiki/Cochran%E2%80%93Armitage_test_for_trend
Cochran-Armitage test for trend
binomial logistic regression model is a model which attempts to explain data distribution associated with *dichotomous* response/dependent variable in terms of values assumed by the independent variable uses a function of predictor/independent variable(s): the function used in this instance of regression modeling is logistic function.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO, adapted from wikipedia (http://en.wikipedia.org/wiki/Logistic_regression) polled in June 2013
binomial logistic regression for analysis of dichotomous dependent variable
a minimum value is a data item which denotes the smallest value found in a dataset or resulting from a calculation.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
minimum value
maximum value is a data item which denotes the largest value found in a dataset or resulting from a calculation.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
maximum value
a quartile is a quantile which splits data into sections accrued of 25% of data, so the first quartile delineates 25% of the data, the second quartile delineates 50% of the data and the third quartile, 75 % of the data
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO, adapted from wikipedia (http://en.wikipedia.org/wiki/Quartile)
quartile
1
2
http://arxiv.org/pdf/1007.1094.pdf
The one-sample Hotelling’s T2 is the multivariate extension of the common one-sample or paired Student’s t-test. In a one-sample t-test, the mean response is compared against a specific value. Hotelling’s one-sample T2 is used when the number of response variables is two or more, although it can be used when there is only one response variable. T2 makes the usual assumption that the data are approximately multivariate normal. Randomization tests are provided that do not rely on this assumption. These randomization tests should be used whenever you want exact results that do not rely on several assumptions.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
adapted from:
https://ncss-wpengine.netdna-ssl.com/wp-content/themes/ncss/pdf/Procedures/NCSS/Hotellings_One-Sample_T2.pdf
http://svitsrv25.epfl.ch/R-doc/library/rrcov/html/T2.test.html
one sample Hotelling T2 test
a violin plot is a plot combining the features of box plot and kernel density plot. The violin plot is therefore similar to box plot but it incorporated in the display the probability density of the data at different values.
Typically violin plots will include a marker for the median of the data and a box indicating the interquartile range, as in standard box plots.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
http://en.wikipedia.org/wiki/Violin_plot
and
Hintze, J. L. and R. D. Nelson (1998). Violin plots: a box plot-density trace synergism. The American Statistician, 52(2):181-4.
http://www.inside-r.org/packages/cran/vioplot/docs/vioplot
vioplot( x, ..., range=1.5, h, ylim, names, horizontal=FALSE,
col="magenta", border="black", lty=1, lwd=1, rectCol="black",
colMed="white", pchMed=19, at, add=FALSE, wex=1,
drawRect=TRUE)
violin plot
2
meta-analysis is a data transformation which uses the effect size estimates from several independent quantitative scientific studies addressing the same question in order to assess finding consistency.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
adapted from wikipedia:
http://en.wikipedia.org/wiki/Metaanalysis
last accessed: 2013-11-15
meta analysis
the Scheffe test is a data transformation which evaluates all possible contrasts and adjusting the levels significance by accounting for multiple comparison. The test is therefore conservative. Confidence intervals can be constructed for the corresponding linear regression. It was developped by American statistician Henry Scheffe in 1959.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO, adapted from wikipedia (http://en.wikipedia.org/wiki/Scheffé's_method)
http://www.inside-r.org/packages/cran/agricolae/docs/scheffe.test
Scheffe test
the LSD test is a statistical test for multiple comparisons of treatments by means of least significant difference following an ANOVA analysis
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
R
LSD test
http://rss.acs.unt.edu/Rdoc/library/agricolae/html/LSD.test.html
Least significance different test
a null hypothesis which states that a linkage exists between 2 categorical variables
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
presence of association hypothesis
Stacked bar chart is a bar which is used to compare overall quantities across items while showing the contribution of category to the total amount. Stacked bar chart can be used for highlighting the total as they visually aggregate all of the categories in a group while indicating a part to whole relationship. The downside is that it becomes harder to compare the sizes of the individual categories.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
adapter from
http://www2.le.ac.uk/offices/ld/resources/numeracy/bar-charts
and
http://blog.visual.ly/how-groups-stack-up-when-to-use-grouped-vs-stacked-column-charts/
[last accessed: 2014-03-04]
barplot(height....)
set argument " beside = FALSE "
http://stat.ethz.ch/R-manual/R-patched/library/graphics/html/barplot.html
stacked bar chart
2
The exponential distribution (a.k.a. negative exponential distribution) is the probability distribution that describes the time between events in a Poisson process, i.e. a process in which events occur continuously and independently at a constant average rate. It is the continuous analogue of the geometric distribution, and it has the key property of being memoryless.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/Exponential.html
dexp(x, rate = 1, log = FALSE)
pexp(q, rate = 1, lower.tail = TRUE, log.p = FALSE)
qexp(p, rate = 1, lower.tail = TRUE, log.p = FALSE)
rexp(n, rate = 1)
http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.expon.html#scipy.stats.expon
exponential distribution
variable distribution is data item which denotes the spatial resolution of data point making up a variable. variable distribution may be compared to a known probability distribution using goodness of fit test or plotting a quantile-quantile plot for visual assessment of the fit.
TODO: Probably need to drop it
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
data distribution
data distribution
the role played by an entity part of study group as defined by an experimental design and realized in a data analysis and data interpretation
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
experimental unit role
trimmed mean or truncated mean is a measure of central tendency which involves the calculation of the mean after discarding given parts of a probability distribution or sample at the high and low end, and typically discarding an equal amount of both
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
adapted from wikipedia [last accessed 2014-03-04]
http://en.wikipedia.org/wiki/Truncated_mean
truncated mean
scipy.stats.tmean(a, limits=None, inclusive=(True, True))
http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.tmean.html#scipy.stats.tmean
source:
https://github.com/scipy/scipy/blob/v0.15.1/scipy/stats/stats.py#L684
trimmed mean
The interquartile range is a data item which corresponds to the difference between the upper quartile (3rd quartile) and lower quartile (1st quartile).
The interquartile range contains the second quartile or median.
The interquartile range is a data item providing a measure of data dispersion
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO adapted from wikipedia, wolfram alpha and oxford dictionary of statistics
IQR(x, na.rm = FALSE, type = 7)
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/IQR.html
inter quartile range
a pie chart is a graph in which a circular graph is divided into sector illustrating numerical proportion, meaning that the arc length of each sector (and consequently its central angle and area), is proportional to the quantity it represents.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
adapted from Wikipedia, last accessed [2014-03-05]
http://en.wikipedia.org/wiki/Pie_chart
pie(x, labels = names(x), edges = 200, radius = 0.8,
clockwise = FALSE, init.angle = if(clockwise) 90 else 0,
density = NULL, angle = 45, col = NULL, border = NULL,
lty = NULL, main = NULL, ...)
https://stat.ethz.ch/R-manual/R-devel/library/graphics/html/pie.html
pie chart
A bar chart is appropriate to represent counts of data.
the bart chart is a graph resulting from plotting rectangular bars with lengths proportional to the values that they represent.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
adapted from wikipedia (http://en.wikipedia.org/wiki/Bar_chart) polled in June 2013
bar plot
barplot(height, ...)
http://stat.ethz.ch/R-manual/R-patched/library/graphics/html/barplot.html
bar chart
the first quartile is a quartile which splits the lower 25 % of the data
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
first quartile
a real time quantitative pcr plot is a line graph which plots the signal fluorescence intensity as a function of the number of PCR cycle
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
real time quantitative pcr plot
Fold change is a number describing how much a quantity changes going from an initial to a final value or one condition to another condition
30/04/2014
- removed restriction:
'is about' exactly 2 'study group population'
- need more discussion for the relationship of fold change to study group populations for particular examples.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
http://en.wikipedia.org/wiki/Fold_change
fold change
the first quartile is a quartile which splits the 75 % of the data
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
third quartile
Spear Box and Whisker plot is a variation of Tukey Box and Whisker plot which use the criteria of Spear to create the 'whisker' of the plot.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
Spear, M.E. Charting Statistics (McGraw-Hill, 1952)
Spear box and whisker plot
expected fragments per kilobase of transcript per million fragments mapped is a metric used to report transcript expression event as generated by RNA-Seq using paired-end library. The calculated value results from 2 types of normalization, one to take into account the difference in reads counts associated with transcript length (at equal abundance, longer transcripts will have more reads than shorter transcripts) , (hence the 'per kilobase of transcript') and the other one to take into account different sequencing depth during distinct sequencing runs (hence the 'per millions mapped fragment'. The metric is specifically produced by cufflink software.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
FPKM
adapted from:
http://seqanswers.com/forums/showthread.php?t=3254
and from
http://www.nature.com/nprot/journal/v7/n3/full/nprot.2012.016.html
fragments per kilobase of transcript per million fragments mapped
homogeneity testing objective is the objective of a data transformation to test a null hypothesis that two or more sub-groups of a population share the same distribution of a single categorical variable.
For example, do people of different countries have the same proportion of smokers to non-smokers
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
homogeneity test objective
A forest plot is a graph designed to illustrate the relative strength of treatment effects in multiple quantitative scientific studies addressing the same question.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
http://en.wikipedia.org/wiki/Forest_plot
metaplot(mn, se, nn=NULL, labels=NULL, conf.level=0.95,
xlab="Odds ratio", ylab="Study Reference",xlim=NULL,
summn=NULL, sumse=NULL, sumnn=NULL, summlabel="Summary",
logeffect=FALSE, lwd=2, boxsize=1,
zero=as.numeric(logeffect), colors=meta.colors(),
xaxt="s", logticks=TRUE, ...)
http://rss.acs.unt.edu/Rdoc/library/rmeta/html/metaplot.html
Forest plot
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/confint.html
confidence interval calculation is a data transformation which determines a confidence interval for a given statistical parameter
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
confidence interval calculation
t-statistic is a statistic computed from observations and used to produce a p-value in statistical test when compared to a Student's t distribution.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
T
t-statistic
the beta distribution is a continuous probability distributions defined on the interval [0, 1] parametrized by two positive shape parameters, denoted by α and β, that appear as exponents of the random variable and control the shape of the distribution
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
adapted from Wikipedia:
http://en.wikipedia.org/wiki/Beta_distribution
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/Beta.html
dbeta(x, shape1, shape2, ncp = 0, log = FALSE)
pbeta(q, shape1, shape2, ncp = 0, lower.tail = TRUE, log.p = FALSE)
qbeta(p, shape1, shape2, ncp = 0, lower.tail = TRUE, log.p = FALSE)
rbeta(n, shape1, shape2, ncp = 0)
http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.beta.html#scipy.stats.beta
beta distribution
Kurtosis is a data item which denotes the degree of peakedness of a distribution. It is defined as a normalized form of the fourth central moment of a distribution.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
http://mathworld.wolfram.com/Kurtosis.html
http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kurtosis.html#scipy.stats.kurtosis
kurtosis
1
1
ANCOVA or analysis of covariance is a data transformation which evaluates if population means of a dependent variable are equal across levels of a categorical independent variables while controlling for the effects of other continuous variable s, known as covariates. Therefore, when performing ANCOVA, we are adjusting the dependent variable means to what they would be if all groups were equal on the covariates.
It augments the ANOVA model with one or more additional quantitative variables, called covariates, which are related to the response variable. The covariates are included to reduce the variance in the error terms and provide more precise measurement of the treatment effects. ANCOVA is used to test the main and interaction effects of the factors, while controlling for the effects of the covariate
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
adapted from wikipedia
ANCOVA
1
0
standard normal distribution is a normal distribution with variance = 1 and mean=0
we need to formally set value for mean and variance
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
dnorm(x, mean = 0, sd = 1, log = FALSE)
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/Normal.html
standard normal distribution
Hardy-Weinberg equilibrium test is a statistical test which aims to evaluate if a population's proportion of allele is stable or not. It is used as means of quality control to evaluate possibility of genotyping error or population structure.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO: adapted from wikipedia (http://en.wikipedia.org/wiki/Hardy–Weinberg_principle)
> library(HardyWeinberg)
> x <- c(298,489,213)
> HW.test <- HWChisq(x,verbose=TRUE)
http://cran.r-project.org/web/packages/HardyWeinberg/index.html
Hardy-Weinberg equilibrium testing
2
4
Odds ratio is a ratio that measures effect size, that is the strength of association between 2 dichotomous variables, one describing an exposure and one describing an outcome.
It represents the odds that an outcome will occur given a particular exposure, compared to the odds of the outcome occurring in the absence of that exposure ( the probability of the event occuring divided by the probability of an event not occurring). The odds ratio is a ratio of describing the strength of association or non-independence between two binary data values by forming the ratio of the odds for the first group and the odds for the second group. Odds ratio are used when one wants to compare the odds of something occurring to two different groups.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2938757/
http://www.stats.org/stories/2008/odds_ratios_april4_2008.html
OR
odds ratio
sphericity testing objective is a statistical objective of a data transformation which aims to test a null hypothesis of sphericity holds.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
sphericity testing objective
sphericity testing objective
A ratio is a data item which is formed with two numbers r and s is written r/s, where r is the numerator and s is the denominator. The ratio of r to s is equivalent to the quotient r/s.
review formal definition as both numerator and denominator should be of the same type, not just some data item
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
adapted from Wolfram Alpha:
https://www.wolframalpha.com/share/clip?f=d41d8cd98f00b204e9800998ecf8427efdcsig76g7
ratio
1
2
1
a 2 by n contingency table is a contingency table built for one dichotomous variable (a categorical variable with only 2 outcomes) and one polychotomous variable (a polychomotomous variable with at least 2 outcomes)
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
xtabs(formula = ~., data = parent.frame(), subset, sparse = FALSE,
na.action, exclude = c(NA, NaN), drop.unused.levels = FALSE)
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/xtabs.html
flat contingency tables:
ftable(x, ...)
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/ftable.html
2 by n contingency table
Lineweaver-Burk plot is a graph which is the graphical representation of the Lineweaver–Burk equation of enzyme kinetics, described by Hans Lineweaver and Dean Burk in 1934. The plot provides a useful graphical method for analysis of the Michaelis–Menten equation.
It was widely used to determine important terms in enzymology and enzyme kinetics as the x-intercept of the graph represents −1/Km and the y-intercept of such a graph is equivalent to the inverse of Vmax
TODO: create 'inverse function' and replace 'data transformation' in the assertions
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
double reciprocal plot
Lineweaver-Burk plot
2
1
Tukey Honestly Significant Difference (HSD) test is a statistical test used following an ANOVA test yielding a statistically significant p-value in order to determine which means are different, to a given level of significance. The Tukey HSD test relies on the q-distribution.
The procedure is conservative, meaning that if sample sizes (the sizes of different study groups) are equal, the risk of a Type I error is exactly α, and if sample sizes are unequal it’s less than α.
IMPORTANT: Do Not to confuse the Tukey HSD test with Tukey Mean Difference Test (Bland-Altman test)
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
http://www.tc3.edu/instruct/sbrown/stat/anova1.htm#ANOVAprereq
Tukey's honestly significant difference
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/TukeyHSD.html
Tukey HSD for Post-Hoc Analysis
average log signal intensity is a data time which corresponds to the sum of 2 distinct logarithm base 2 transformed signal intensity, each corresponding to a distinct condition of signal acquisition, divided by 2.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
adapted from wikipedia:
http://en.wikipedia.org/wiki/MA_plot
last accessed: 2014-03-13
A-value
average log signal intensity
1
A mixed model is a statistical model containing both fixed effects and random effects. These models are useful in a wide variety of disciplines in the physical, biological and social sciences. They are particularly useful in settings where repeated measurements are made on the same statistical units (longitudinal study), or where measurements are made on clusters of related statistical units. Because of their advantage in dealing with missing values, mixed effects models are often preferred over more traditional approaches such as repeated measures ANOVA.
PRS: this is a stub and more work is needed to reconcile conflicting definitions
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
adapted from wikipedia
mixed effect model
Threshold cycle (or Ct or Cq) is a count which is defined as the fractional PCR cycle number at which the reporter fluorescence is greater than the threshold in the context of the RT-qPCR assay. The Ct is a basic principle of real time PCR and is an essential component in producing accurate and reproducible data.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
Cq
Ct
http://www.ncbi.nlm.nih.gov/genome/probe/doc/TechQPCR.shtml
threshold cycle
a goodness of fit statistical test is a statistical test which aim to evaluate if a sample distribution can be considered equivalent to a theoretical distribution used as input
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
goodness of fit statistical test
a cartesian product is a data transformation which operates on a n Sets to produce a set of all possible ordered n-tuples where each element of the tuple comes from a Set
Alejandra Gonzalez-Beltran
Orlaith Burke
PERSON: Philippe Rocca-Serra
adapted from math wolfram (http://mathworld.wolfram.com/CartesianProduct.html)
cartesian product
is a population whose individual members realize (may be expressed as) a combination of inclusion rule values specifications or resulting from a sampling process (e.g. recruitment followed by randomization to group) on which a number of measurements will be carried out, which may be used as input to statistical tests and statistical inference.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
statistical sample
study group population
self explanatory
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
cartesian product 2 sets
A non-negative integer defining how many combination of factor levels (or treatments in the statistical sense) are to be used in a study.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
number of factor level combinations
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2689604/
A confidence interval is a data item which defines an range of values in which a measurement or trial falls corresponding to a given probability.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
http://mathworld.wolfram.com/ConfidenceInterval.html
confidence interval
a genomic coordinate system is a coordinate system to describe position of sequence on a genomic scaffold (assembly of chromosome, contig....)
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
ensembl, ucsc
genomic coordinate system
a statistical test which makes no assumption about the underlying data distribution
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
non-parametric test
the Mauchly's test for sphericity is a statistical test which evaluates if the variance of the differences between all combinations of the groups are equal, a property known as 'sphericity' in the context of repeated measures. It is used for instance prior to repeated measure ANOVA.
The test works by assessing if a Wishart-distributed covariance matrix (or transformation thereof) is proportional to a given matrix.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
AGB-PRS, adapted from wikipedia (http://en.wikipedia.org/wiki/Mauchly's_sphericity_test)
polled on june,10th, 2013
and from R manual:
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/mauchly.test.html
Mauchly's test for sphericity
mauchly.test(object, ...)
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/mauchly.test.html
Mauchly's test for sphericity
the statistical test power is data item which is about a statistical test and is obtained by subtracting the false negative rate (type II error rate) to 1. The power of a statistical test is the probability that it will correctly lead to the rejection of a false null hypothesis (Greene 2000). The statistical power is the ability of a test to detect an effect, if the effect actually exists (High 2000).
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
adapted from wikipedia (http://en.wikipedia.org/wiki/Statistical_power), polled June 10th, 2013
statistical test power
2
Spearman's rank correlation coefficient is a correlation coefficient which is a nonparametric measure of statistical dependence between two ranked variables. It assesses how well the relationship between two variables can be described using a monotonic function. If there are no repeated data values, a perfect Spearman correlation of +1 or −1 occurs when each of the variables is a perfect monotone function of the other.
Spearman's coefficient may be used when the conditions for computing Pearson's correlation are not met (e.g linearity, normality of the 2 continuous variables) but may require a ranking transformation of the variables
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
Spearman's rho
http://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient
cor(x, y = NULL, use = "everything",method = c("spearman"))
scipy.stats.spearmanr(a, b=None, axis=0)
http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.spearmanr.html#scipy.stats.spearmanr
source:
https://github.com/scipy/scipy/blob/v0.15.1/scipy/stats/stats.py#L2643
Spearman's rank correlation coefficient
within subject comparison statistical test is a kind of statistical test which evaluates if a change occurs within one experimental unit over time following a treatment or an event
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
within subject comparison statistical test
a cohort is a study group population where the members are human beings which meet inclusion criteria and undergo a longitudinal design
possibly submit to 'Population and Community Ontology'
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
cohort
the F-distribution is a continuous probability distribution which arises in the testing of whether two observed samples have the same variance.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
Fisher distribution
Snedecor Fisher distribution
http://mathworld.wolfram.com/F-Distribution.html
df(x, df1, df2, ncp, log = FALSE)
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/Fdist.html
http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.f.html#scipy.stats.f
F-distribution
RPKM is a kind of count which numbers the sequence reads found per kilobase of transcript reported to million of sequence reads. RPKM is a metric generated by ERANGE software tool as reported by Mortazi et al, in 2008.
The metric has been enhanced and replaced by FPKM to better take into account splice variant. FKPM uses a statistical model to perform the computation.
Alejandra Gonzalez Beltran
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
RPKM
http://www.nature.com/nmeth/journal/v5/n7/full/nmeth.1226.html
reads per kilobase of transcript per million fragments mapped
a planned process which etablishes and states the different hypothesis to be evaluated during a null hypothesis statistical test
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
specifying null and alternate hypothesis
An alternative hypothesis is an hypothesis defined in a statistical test that is the opposite of the null hypothesis.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
alternative hypothesis
PMID:12892658
"Two formulas for computation of the area under the curve represent measures of total hormone concentration versus time-dependent change."
area under curve is a measurement datum which corresponds to the surface define by the x-axis and bound by the line graph represented in a 2 dimensional plot resulting from an integration or integrative calculus. The interpretation of this measurement datum depends on the variables plotted in the graph
PRS: submit 'integral calculus' as a kind of data transformation in OBI:DT branch
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
area under curve
is a data item formed by dividing the fluorescence intensity obtained in one channel to that obtained in the other channel, typically the case when considering 2-color microarray data when imaging is done for Cy3 and Cy5 dyes.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
channel1/channel2 fluorescence intensity ratio
channel1/channel2 fluorescence intensity ratio
odds ratio homogeneity hypothesis is a null hypothesis stating that all odds ratio are homogenous, that is remain within the same range.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
odds ratio homogeneity hypothesis
odds ratio homogeneity hypothesis
2
a tetrachoric correlation coefficient is a polychoric correlation coefficient for 2 dichotomous variables used as proxy for correlation between 2 continuous latent variables.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
adapted from:
http://www.rasch.org/rmt/rmt193c.htm
and
http://en.wikipedia.org/wiki/Polychoric_correlation
tetrachoric correlation coefficient
discretization as a processing converting a continuous variable into a polychotomous variable by concretizing a set of discretization rules
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
AGB,PRS adapted from wikipedia (http://en.wikipedia.org/wiki/Discretization)
http://cran.r-project.org/web/packages/discretization/index.html
continuous variable discretization
50
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2689604/
a confidence interval which covers 50% of the sampling distribution, meaning that there is a 50% risk of false positive (type I error)
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
confidence interval at 10% of type I error rate
STATO
50% confidence interval
probit regression model is a model which attempts to explain data distribution associated with *ordinal* response/dependent variable in terms of values assumed by the independent variable uses a function of predictor/independent variable(s): the function used in this instance of regression modeling is the ordered probit function.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO, adapted from wikipedia (http://en.wikipedia.org/wiki/Probit_model) polled in June 2013
ordered probit regression for analysis of ordinal dependent variable
a stratum population is a population resulting from a population stratification prior to sampling process which aims to produce homogenous subpopulations from an heterogeneous population by applying one or more stratification criteria
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
stratum population
a null hypothesis which states that a given matrix is proportional to a Wishart-distributed covariance matrix
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
hypothesis of sphericity
sphericity hypothesis
Model fitting is a data transformation process which evaluates if a model appropriately represents a dataset. A model fitting process tests the goodness of fit of the model to the data
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
model fitting
a real time pcr standard curve is a line graph which plots the fluorescence intensity signal as a function of the concentration of a sample used as reference and used to determine relative abundance of test samples
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
adapted from:
http://www.sigmaaldrich.com/content/dam/sigma-aldrich/docs/Sigma/General_Information/qpcr_technical_guide.pdf
and
http://www.lifetechnologies.com/uk/en/home/life-science/pcr/real-time-pcr/qpcr-education/absolute-vs-relative-quantification-for-qpcr.html
RT-PCR standard curve
the false negative rate is a data item which denotes the proportion of missed detection of elements known to be meeting the detection criteria
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO, adapted from
type II error rate
β
false negative rate
a random variable (or aleatory variable or stochastic variable) in probability and statistics, is a variable whose value is subject to variations due to chance (i.e. randomness, in a mathematical sense)
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
aleatory variable
stochastic variable
wikipedia:
http://en.wikipedia.org/wiki/Random_variable
random variable
3
graeco-latin square design is_a study design which allows in its simpler form controlling 3 levels of nuisance variables (also known as blocking variables). The 3 nuisance factors are divided into a tabular grid with the property that each row and each column receive each treatment exactly once.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
graeco-latin square design
group assignment based on blocking variable specification is a kind of group assignment process which takes into account the levels assumed by a blocking variable to allocate subjects or experimental units to a treatment group
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
group assignment based on blocking variable specification
A testing objective to ensure that the sample used in a statistical test actually follows a normal distribution.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
goodness of fit testing objective
A probability distribution is a information content entity that specifies the probability of the value of a random variable.
For a discrete random variable, a mathematical formula that gives the probability of each value of the variable.
For a continuous random variable, a curve described by a mathematical formula which specifies, by way of areas under the curve, the probability that the variable falls within a particular interval.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
probability distribution
It is a testing objective to ensure the variances of the different groups used in a statistical test are similar (i.e. not too different).
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
homoschedacity testing objective
STATO
equal variance testing objective
0
a normal distribution is a continuous probability distribution described by a probability distribution function described here:
http://mathworld.wolfram.com/NormalDistribution.html
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
Gaussian distribution
http://mathworld.wolfram.com/NormalDistribution.html
http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.norm.html#scipy.stats.norm
normal distribution
ordinal variable is a categorical variable where the discrete possible values are ordered or correspond to an implicit ranking
Alejandra Gonzalez-Beltan
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
ranked variable
http://udel.edu/~mcdonald/statvartypes.html
ordinal variable
Chi-square probability distribution with k degrees of freedom is a theoretical probability distribution which corresponds to the distribution of a sum of the squares of k independent standard normal random variables.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
dchisq(x, df, ncp = 0, log = FALSE)
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/Chisquare.html
http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2.html#scipy.stats.chi2
Chi-square probability distribution
the expected value (or expectation, mathematical expectation, EV, mean, or the first moment) of a random variable is a data item which corresponds to the weighted average of all possible values that this random variable can take on. The weights used in computing this average correspond to the probabilities in case of a discrete random variable, or densities in case of a continuous random variable. From a rigorous theoretical standpoint, the expected value is the integral of the random variable with respect to its probability measure.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
first moment
mean
μ
http://en.wikipedia.org/wiki/Expected_value
expected value
95
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2689604/
a confidence interval which covers 95% of the sampling distribution, meaning that there is a 5% risk of false positive (type I error). If the number of observations made is large enough, the sampling distribution can be assumed to be normal, which entails that 95% of the sampling distributions falls within roughly2 (1.96) standard deviations from the mean.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
confidence interval at 5% of type I error rate
STATO
95% confidence interval
number of PCR cycle is a count which enumerates how many iterations of 'annealing, renaturation, amplification,' rounds (or cycles) are performed during a polymerase chain reaction (PCR) or an assay relying on PCR.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
adapted from various sources including:
http://www.ncbi.nlm.nih.gov/genome/probe/doc/TechQPCR.shtml
number of PCR cycle
sensitivity is a measurement datum qualifying a binary classification test and is computed by substracting the false negative rate to the integral numeral 1
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
recall
sensitivity
adapted from:
http://en.wikipedia.org/wiki/Sensitivity_and_specificity
and
http://mathworld.wolfram.com/Sensitivity.html
true positive rate
a residual is a data item which is the output of an error estimate or model fitting process and which is an observable estimate of the unobservable error
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
residual
A genetic association study is a kind of study whose objective is to detect associations between phenotypes, between a phenotype and a genetic polymorphism or between two genetic polymorphisms.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
http://en.wikipedia.org/wiki/Genetic_association
genetic association study
the coefficient of variation is a normalized measure of dispersion of a probability distribution of frequency distribution.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
http://en.wikipedia.org/wiki/Coefficient_of_variation
last accessed: 2013-10-18
scipy.stats.variation(a, axis=0)
http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.variation.html#scipy.stats.variation
source:
https://github.com/scipy/scipy/blob/v0.15.1/scipy/stats/stats.py#L951
coefficient of variation
The standard deviation of a random variable, statistical population, data set, or probability distribution is a measure of variation which correspond to the average distance from the mean of the data set to any given point of that dataset. It also corresponds to the square root of its variance.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
σ
http://en.wikipedia.org/wiki/Standard_deviation
sd(x, na.rm = FALSE)
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/sd.html
standard deviation
high content screening is a kind of investigation which uses a standardized cellular assays to test the effect of substances (RNAi or small molecules) held in libraries on a cellular phenotype. it relies on microscopy imaging and or flow-cytometry, robotic handling to ensure fast and high-throughput.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
high throughput screening
adapted from:
http://en.wikipedia.org/wiki/High-content_screening
high-content screening
high throughput screening is a kind of investigation which uses a standardized assays (cell based, enzymatic or chemometric) to test the effect of substances (RNAi or small molecules) held in libraries on a very specific and measureable outcome (e.g fluorence intensity). it relies on robotic handling to ensure fast and high-throughput in assay performance, data acquisition and hit selection.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
AGB,PRS
high throughput screening
2
Kendall's correlation coefficient is a correlation coefficient between 2 ordinal variables (natively or following a ranking procedure) and may be used when the conditions for computing Pearson's correlation are not met (e.g linearity, normality of the 2 continuous variables)
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
Kendall rank correlation coefficient
Kendall's tau (τ) coefficient
STATO, adapted from wikipedia (http://en.wikipedia.org/wiki/Kendall_tau_rank_correlation_coefficient), polled in june 2013
and from:
http://stamash.org/pearsons-correlation-coefficient/
http://stamash.org/kendalls-tau-correlation/
cor(x, y = NULL, use = "everything",method = c("kendall"))
scipy.stats.kendalltau(x, y, initial_lexsort=True)
http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kendalltau.html#scipy.stats.kendalltau
source:
https://github.com/scipy/scipy/blob/v0.15.1/scipy/stats/stats.py#L2827
Kendall's correlation coefficient
2
Q-Q plot or quantile-quantile plot is the output of a graphical method for comparing two probability distributions by plotting their quantiles against each other
PRS,AGB: need to add the notion of quantile
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
quantile-quantile plot
qqplot(x, y, plot.it = TRUE, xlab = deparse(substitute(x)),
ylab = deparse(substitute(y)), ...)
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/qqnorm.html
Q-Q plot
statistical error is an data item denoting the amount by which an observation differs from the expected value, being based on the whole statistical population from which the statistical unit was chosen randomly
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
adapted from wikipedia:
http://en.wikipedia.org/wiki/Errors_and_residuals_in_statistics
last accessed: 18-11-2013
disturbance
statistical error
A box and whisker plot is appropriate to represent the characteristics oc a distribution.
a box plot is a graph which plots datasets relying on their quartiles and the interquartile range to create the box and the whiskers.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
Tukey box and whisker plot
box plot
Tukey, J. W. "Box-and-Whisker Plots." §2C in Exploratory Data Analysis. Reading, MA: Addison-Wesley, pp. 39-43, 1977.
boxplot
boxplot(x, ...)
http://stat.ethz.ch/R-manual/R-patched/library/graphics/html/boxplot.html
box and whisker plot
(Rn +) − (Rn −), where Rn + = (emission intensity of reporter dye)/(emission intensity of passive reference dye) in PCR with template and Rn − = (emission intensity of reporter dye)/(emission intensity of passive reference dye) in PCR without template or early cycles of a real-time reaction. Ct = threshold cycle, i.e., cycle at which a statistically significant increase in ΔRn is first detected
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
http://jcm.asm.org/content/38/7/2516.figures-only
ΔRn
4
Relative risk is a measurement datum which denotes the risk of an 'event' relative to an 'exposure'. Relative risk is calculated by forming the ratio of the probability of the event occurring in the exposed group versus the probability of this event occurring in the non-exposed group.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
risk ratio
relative risk
2
2
Woolf's test is a statistical test which evaluates the null hypothesis that odds ratio are the same accross all strata of population under investigation
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
http://people.umass.edu/biep640w/pdf/4.%20%20Categorical%20Data%20Analysis%202012.pdf
woolf_test(x) where x is 2 x 2 x k contingency table
http://hosho.ees.hokudai.ac.jp/~kubo/Rdoc/library/vcd/html/woolf_test.html
Woolf's test
odds ratio homogeneity test is a statistical test which aims to evaluate that null the hypothesis of consistency odds ratio accross different strata of population is true or not
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
odds ratio homogeneity test
https://onlinecourses.science.psu.edu/stat503/node/19
Often in medical studies, the blocking factor used is the type of institution. This provides a very useful blocking factor, hopefully removing institutionally related factors such as size of the institution, types of populations served, hospitals versus clinics, etc., that would influence the overall results of the experiment.
a blocking variable is a independent variable which is used in a blocking process part of an experiment with the purpose of maximizing the signal coming from the main variable.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
nuisance variable
https://onlinecourses.science.psu.edu/stat503/node/18
blocking variable
a DNA microarray hybridization is an assay relying on nucleic acid hybridization , which uses a DNA microarray device and a nucleic acid as input. It precedes a data acquisition process
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
DNA microarray hybridization
group comparison objective is a data transformation objective which aims to determine if 2 or more study group differ with respect to the signal of a response variable
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
group comparison objective
"Time to solve an anagram problem" is continuous since it could take 2 minutes, 2.13 minutes etc. to finish a problem
A continuous variable is one for which, within the limits the variable ranges, any value is possible.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
http://davidmlane.com/hyperstat/A97418.html
http://udel.edu/~mcdonald/statvartypes.html
continuous variable
a categorical variable is a variable which that can only assume a finite number of value and cast observation in a small number of categories
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
discrete variable
nominal variable
qualitative factor
http://udel.edu/~mcdonald/statvartypes.html
https://onlinecourses.science.psu.edu/stat503/node/7
categorical variable
the objective of a data transformation to test a null hypothesis of absence of difference within subject holds.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
within subject comparison objective
The allele frequency is a data item which denotes the incidence of a gene variant in a population. It is calculated as a ratio, by dividing the number of copies of a particular allele by the number of copies of all alleles at the genetic place (locus) in a population.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
http://www.nature.com/scitable/definition/allele-frequency-298
allele frequency
the objective of a data transformation to test a null hypothesis of absence of difference withing subject holds.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
between group comparison objective
a manhattan plot for gwas is a kind of scatter plot used to facilitate presentation of genome-wide association study (GWAS) data. Genomic coordinates are displayed along the X-axis, with the negative logarithm of the association P-value for each single nucleotide polymorphism displayed on the Y-axis.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
http://en.wikipedia.org/wiki/Manhattan_plot
plotGrandLinear(obj, ..., facets, space.skip = 0.01, geom = NULL,
cutoff = NULL, cutoff.color = "red", cutoff.size = 1,
legend = FALSE, xlim, ylim, xlab, ylab, main)
http://www.tengfei.name/ggbio/docs/man/plotGrandLinear.html
manhattan plot for gwas
A domestic group, or a number of domestic groups linked through descent (demonstrated or stipulated) from a common ancestor, marriage, or adoption.
import from Population and Community Ontology:
http://www.ontobee.org/browser/rdf.php?o=PCO&iri=http://purl.obolibrary.org/obo/PCO_0000020
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
http://purl.obolibrary.org/obo/PCO_0000020
family
a variable is a data item which can assume any of a set of values, either as determined by an agent or as randomly occuring through observation.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
adapted from wolfram-alpha (http://www.wolframalpha.com/input/?i=variable) definition 2.
and from Oxford English Dictionary:
http://www.oed.com/view/Entry/221514?redirectedFrom=variable#eid, definition B,1
variable
true
true
1
repeated measure ANOVA is a kind of ANOVA specifically developed for non-independent observations as found when repeated measurements on the sample experimental unit.
repeated measure ANOVA is sensitive to departure from normality (evaluation using Bartlett's test), more so in the case of unbalanced groups (i.e. different sizes of sample populations).
Departure from sphericity (evaluation using Mauchly'test) used to be an issue which is now handled robustly by modern tools such as R's lme4 or nlme, which accommodate dependence assumptions other than sphericity.
discussion in https://github.com/ISA-tools/stato/issues/28
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
Thomas Nichols
ANOVA for correlated samples
adapted from wikipedia and https://statistics.laerd.com/statistical-guides/repeated-measures-anova-statistical-guide-3.php
http://www.ats.ucla.edu/stat/sas/library/repeated_ut.htm
ANOVA for correlated samples
http://cran.r-project.org/doc/contrib/Lemon-kickstart/kr_repms.html
repeated measure ANOVA
2
1
The Newman–Keuls or Student–Newman–Keuls (SNK) method is a stepwise multiple comparisons procedure used to identify sample means that are significantly different from each other. It was named after Student (1927), D. Newman, and M. Keuls. This procedure is often used as a post-hoc test whenever a significant difference between three or more sample means has been revealed by an analysis of variance (ANOVA). The Newman–Keuls method is similar to Tukey's range test as both procedures use Studentized range statistics.Compared to Tukey's range test, the Newman–Keuls method is more powerful but less conservative.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
adapted from wikipedia:
last accessed: 2013-11-15
SNK.test(y, trt, DFerror, MSerror, alpha = 0.05, group=TRUE, main = NULL)
http://artax.karlin.mff.cuni.cz/r-help/library/agricolae/html/SNK.test.html
Newman-Keuls test post-hoc analysis
Bernoulli distribution is a binomial distribution where the number of trials is equal to 1.
notation: B(1,p)
The mean is p
The variance is p*q
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
Bernoulli distribution
Galbraith (Radial) plot is a scatter plot which can be used in the meta-analytic context to examine the data for heterogeneity. For a fixed-effects model, the plot shows the inverse of the standard errors on the horizontal axis against the individual observed effect sizes or outcomes standardized by their corresponding standard errors on the vertical axis.
Radial plots were introduced by Rex Galbraith (1988a, 1988b, 1994).
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
Galbraith, Rex (1988). "Graphical display of estimates having differing standard errors". Technometrics (Technometrics, Vol. 30, No. 3) 30 (3): 271–281. doi:10.2307/1270081. JSTOR 1270081
radial Galbraith plot
http://www.inside-r.org/packages/cran/Luminescence/docs/plot_RadialPlot
plot_RadialPlot(data, na.exclude = TRUE, negatives = "remove",
log.z = TRUE, central.value, centrality = "mean.weighted",
plot.ratio, bar.col, grid.col, legend.text, summary = FALSE,
stats, line, line.col, line.label, output = FALSE, ...)
http://www.metafor-project.org/doku.php/plots:radial_plot
Galbraith plot
http://isogenic.info/html/9__treatments.html#factorial
a factor level combination is one a possible sets of factor levels resulting from the cartesian product of sets of factor and their levels as defined in a factorial design
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
treatment combination
STATO
factor level combination
http://isogenic.info/html/9__treatments.html#factorial
A factor level is data item which corresponds to one of the value assumed by a factor or independent variable manipulated and set by the experimentalist. In the context of factorial design, a factor level is assumed to be or treated as a category in a categorical variable
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
treatment
AGB-PRS
https://onlinecourses.science.psu.edu/stat503/node/7
factor level
Bayes factor is a ratio between 2 probabilities of observing data according 2 distinct models. It is used in Bayes model selection to evaluate which model best explains the data. if K<0, the model used in the denominator term is supported, if K>1, the model used in the numerator term is supported.
The Bayes factor is about the plausibility of 2 different models
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
adapted from Wikipedia
last accessed 2013-11-13
Bayes factor
grouped bar chart is a kind of bar chart which juxtaposes the discrete values for each of the possible value of a given categorical variable, thus providing within group comparison. Grouped bar charts are good for comparing between each element in the categories, and comparing elements across categories. However, the grouping can make it harder to tell the difference between the total of each group.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
adapted from
http://www2.le.ac.uk/offices/ld/resources/numeracy/bar-charts
and
http://blog.visual.ly/how-groups-stack-up-when-to-use-grouped-vs-stacked-column-charts/
[last accessed: 2014-03-04]
barplot(height....)
set argument " beside = TRUE "
http://stat.ethz.ch/R-manual/R-patched/library/graphics/html/barplot.html
grouped bar chart
A gamma distribution is a general type of continous statistical distribution (related to the beta distribution) that arises naturally in processes for which the waiting times between Poisson distributed events are relevant. Gamma distributions have two free parameters shape k and scale denoted theta .
Alejandra Gonzalez-Beltran
Philippe Rocca-Serra
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
http://mathworld.wolfram.com/GammaDistribution.html
dgamma(x, shape, rate = 1, scale = 1/rate, log = FALSE)
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/GammaDist.html
http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.gamma.html#scipy.stats.gamma
Gamma distribution
2
true
2
polychoric correlation coefficient is a correlation coefficient which is computed over 2 variables to characterise an association by proxy with 2 (latent) variables which are assumed to be continuous and normally distributed.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
adapted from:
http://www.rasch.org/rmt/rmt193c.htm
and
http://en.wikipedia.org/wiki/Polychoric_correlation
http://cran.r-project.org/web/packages/polycor/
polychor(x, y, ML = FALSE, control = list(), std.err = FALSE, maxcor=.9999)
polychoric correlation coefficient
1
a full factorial design is a factorial design which ensures that all possible factor level combinations are defined and used so all between group differences can be explored
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
full factorial design
permutation numbering is a data tranformation allowing to count the number of possible permutations of elements in a set of size n, each element occurring exactly once. This number is factorial n.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
permutation numbering
The Michaelis constant is the substrate concentration at which the reaction rate is at half-maximum, and is an inverse measure of the substrate's affinity for the enzyme—as a small indicates high affinity, meaning that the rate will approach more quickly.[5] The value of is dependent on both the enzyme and the substrate, as well as conditions such as temperature and pH.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
adapted from wikipedia:
http://en.wikipedia.org/wiki/Michaelis–Menten_constant
last accessed: 22-11-2013
half maximal reaction rate substrate concentration (Km)
Michaelis-Menten constant
A population of two parents and a child.
possibly submit to 'Population and Community Ontology'
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
child-parent trio
parent-child trio
parents-child trio
child-parents trio
receiver operational characteristics curve is a graphical plot which illustrates the performance of a binary classifier system as its discrimination threshold (aka cut-off point) is varied by plotting sensitivity vs (1 − specificity)
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
http://en.wikipedia.org/wiki/Receiver_operating_characteristic
roc.from.table(table, graph = TRUE, add = FALSE, title = FALSE,
line.col = "red", auc.coords = NULL, ...)
http://rss.acs.unt.edu/Rdoc/library/epicalc/html/roc.html
receiver operational characteristics curve
2
The transmission disequilibrium test is a statistical test for genetic linkage between genetic marker and a trait in families. The test is robust to population structure.
TODO: need to modify restrictions to include family and trio
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
TDT
STATO , adapted wikipedia (from http://en.wikipedia.org/wiki/Transmission_disequilibrium_test), polled on June,2013
transmission disequilibrium test
The binomial distribution is a discrete probability distribution which describes the probability of k successes in n draws with replacement from a finite population of size N.
The binomial distribution is frequently used to model the number of successes in a sample of size n drawn with replacement from a population of size N.
The binomial distribution gives the discrete probability distribution of obtaining exactly n successes out of N Bernoulli trials (where the result of each Bernoulli trial is true with probability p and false with probability q=1-p )
notation: B(n,p)
The mean is N*p
The variance is N*p*q
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
http://en.wikipedia.org/wiki/Binomial_distribution
dbinom(x, size, prob, log = FALSE)
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/Binomial.html
http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.binom.html#scipy.stats.binom
binomial distribution
hit selection is a planned process which in screening processes such as high-throughput screening, lead to the identification of perturbing agent which cause the typical signal generated by a standardized assay to significantly differ from the negative control. The selection hitself results from meeting or exceeding selection threshold (for instance 6 sigma from the mean or SSMD value beyond 5 when compared to positive controls or below -5 when compared to negative controls
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
AGB, PRS
adapted from:
http://en.wikipedia.org/wiki/SSMD
adapted from:
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2789971/
hit selection
TODO
pairing rule is a rule which is specifies the criteria for deciding on how to associated any 2 entities.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
pairing rule
between group comparison statistical test is a statistical test which aims to detect difference between the means computing for each of the study group populations
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
between group comparison statistical test
2
true
The Pearson's correlation coefficient is a correlation coefficient which evaluates two continuous variables for association strength in a data sample. It assumes that both variables are normally distributed and linearity exists.
The coefficient is calculated by dividing their covariance with the product of their individual standard deviations. It is a normalized measurement of how the two are linearly related.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
Pearson product-moment correlation coefficient
Pearson's r
r statistics
STATO, adapted from
http://www.r-tutor.com/elementary-statistics/numerical-measures/correlation-coefficient
and from:
http://stamash.org/pearsons-correlation-coefficient/
http://stamash.org/kendalls-tau-correlation/
cor(x, y = NULL, use = "everything",method = c("pearson"))
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/cor.html
scipy.stats.pearsonr(x, y)
http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html#scipy.stats.pearsonr
source:
https://github.com/scipy/scipy/blob/v0.15.1/scipy/stats/stats.py#L2427
Pearson's correlation coefficient
F statistic is a statistic computed from observations and used to produce a p-value in statistical test when compared to a F distribution. the F statistic is the ratio of two scaled sums of squares reflecting different sources of variability
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
F
F-statistic
negative binomial probability distribution is a discrete probability distribution of the number of successes in a sequence of Bernoulli trials before a specified (non-random) number of failures (denoted r) occur. The negative binomial distribution, also known as the Pascal distribution or Pólya distribution, gives the probability of r-1 successes and x failures in x+r-1 trials, and success on the (x+r)th trial.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
Pascal distribution
Pólya distribution
http://mathworld.wolfram.com/NegativeBinomialDistribution.html
dnbinom(x, size, prob, mu, log = FALSE)
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/NegBinomial.html
http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.nbinom.html#scipy.stats.nbinom
negative binomial distribution
Breusch-Pagan test is a statistical test which computes a score test of the hypothesis of constant error variance against the alternative that the error variance changes with the level of the response (fitted values), or with a linear combination of predictors.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
Breusch, T. S. and Pagan, A. R. (1979) A simple test for heteroscedasticity and random coefficient variation. Econometrica 47, 1287--1294.
and adapted from:
http://www.inside-r.org/packages/cran/car/docs/ncvTest
last accessed [2014-03-15]
http://hosho.ees.hokudai.ac.jp/~kubo/Rdoc/library/lmtest/html/bptest.html
bptest(formula, varformula = NULL, studentize = TRUE, data = list())
or
http://www.inside-r.org/packages/cran/car/docs/ncvTest
Breusch-Pagan test
http://www.ncbi.nlm.nih.gov/pubmed/?term=17182697
Bioinformatics. 2007 Feb 15;23(4):401-7.
Enrichment or depletion of a GO category within a class of genes: which test?
Rivals I1, Personnaz L, Taing L, Potier MC.
hypergeometric test is a null hypothesis test which evaluates if a random variable follows a hypergeometric distribution. It is a test of goodness of fit to that distribution. The test is suited for situation aimed at assessing cases of sampling from a finite set without replacements. For instance, testing for enrichment or depletion of elements (e.g GO categories, genes)
Added following a term request by Chris Mungall:
https://github.com/ISA-tools/stato/issues/6
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
phyper(q, m, n, k, lower.tail = TRUE, log.p = FALSE)
lower.tail
logical; if TRUE (default), probabilities are P[X ≤ x], otherwise, P[X > x].
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/Hypergeometric.html
hypergeometric test
0
a one-tailed test is a statistical test which, assuming an unskewed probability distribution, allocates all of the significance level to evaluate only one hypothesis to explain a difference.
The one-tailed test provides more power to detect an effect in one direction by not testing the effect in the other direction.
one-tailed test should be preceded by two-tailed test in order to avoid missing out on detecting alternate effect explaining an observed difference.
Added following a term request by Chris Mungall:
https://github.com/ISA-tools/stato/issues/6
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
one sided test
adapted from:
http://www.ats.ucla.edu/stat/mult_pkg/faq/general/tail_tests.htm
one tailed test
0
For example, we may wish to compare the mean of a sample to a given value x using a t-test. Our null hypothesis is that the mean is equal to x. A two-tailed test will test both if the mean is significantly greater than x and if the mean significantly less than x. The mean is considered significantly different from x if the test statistic is in the top 2.5% or bottom 2.5% of its probability distribution, resulting in a p-value less than 0.05.
a two tailed test is a statistical test which assess the null hypothesis of absence of difference assuming a symmetric (not skewed) underlying probability distribution by allocating half of the significance level selected to each of the direction of change which could explain a difference (for example, a difference can be an excess or a loss).
Added following a term request by Chris Mungall:
https://github.com/ISA-tools/stato/issues/6
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
two sided test
adapted from:
http://www.ats.ucla.edu/stat/mult_pkg/faq/general/tail_tests.htm
two tailed test
A null hypothesis which states that no difference exists between 2 or more groups being considered.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
absence of difference hypothesis
let's consider an experiment evaluating 2 compounds (aspirin & ibuprofen) at 3 distinct dose levels (low, medium, high) and 4 time points post exposure (0h, 6h, 12h, 24h). Assuming the treatments are applied only once (no replication), the number of observation in a full factorial design is 2 x 3 x 4 = 24 so the design matrix would have 24 rows and 3 columns (1 per factor (independent variable).
a design matrix is an information content entity which denotes a study design. The design matrix is a n by m matrix where n the number of rows, corresponds to the number of observations (4 rows if quadruplicates) and where m, the number of columns corresponds to the number of independent variables. Each element in the matrix correspond to a discretized value representing one of the factor levels for a given factor.
A design matrix can be used as input to statistical modeling or statistical analysis.
The design matrix contains data on the independent variables (also called explanatory variables) in statistical models which attempt to explain observed data on a response variable (often called a dependent variable) in terms of the explanatory variables. The theory relating to such models makes substantial use of matrix manipulations involving the design matrix: see for example linear regression. A notable feature of the concept of a design matrix is that it is able to represent a number of different experimental designs and statistical models, e.g., ANOVA, ANCOVA, and linear regression
Added following a term request by Nolan Nichols: https://github.com/ISA-tools/stato/issues/9
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
model matrix
adapted from:
Design of Experiments: Principles and Applications
edited by Lennart Eriksson, 1999-2008 Umetrics. ISBN-13:978-91-973730-4-3
and
http://en.wikipedia.org/wiki/Design_matrix
[last accessed: 22-05-2014]
model.matrix(object, data = environment(object),
contrasts.arg = NULL, xlev = NULL, ...)
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/model.matrix.html
design matrix
A contrast is the weighted sum of group means, the c_j coefficients represent the assigned weights of the means (these must sum to 0 for orthogonal contrasts)
Term request by Nolan Nichols via https://github.com/ISA-tools/stato/issues/9
Alejandra Gonzalez-Beltran
Camille Maumet
Orlaith Burke
Philippe Rocca-Serra
Tom Nichols
http://en.wikipedia.org/wiki/Contrast_%28statistics%29
contrasts(x, contrasts = TRUE, sparse = FALSE)
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/contrasts.html
contrast
a quantile is a data item which corresponds to specific elements x in the range of a variate X.
the k-th n-tile P_k is that value of x, say x_k, which corresponds to a cumulative frequency of Nk/n (Kenney and Keeping 1962). If n=4, the quantity is called a quartile, and if n=100, it is called a percentile.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
Evans, M.; Hastings, N.; and Peacock, B. Statistical Distributions, 3rd ed. New York: Wiley, 2000.
http://mathworld.wolfram.com/Quantile.html
quantile
a decile is a quantile where n=10 and which splits data into sections accrued of 10% of data, so the first decile delineates 10% of the data, the second decile delineates 20% of the data and the nineth decile, 90 % of the data
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
decile
a percentile is a quantile which splits data into sections accrued of 1% of data, so the first percentile delineates 1% of the data, the second quartile delineates 2% of the data and the 99th percentile, 99 % of the data
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
percentile
absence of negative difference hypothesis is a hypothesis which assumes that a difference significantly less than a threshold does not exist.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
absence of negative difference hypothesis
absence of negative difference hypothesis is a hypothesis which assumes that a difference significantly greater than a threshold does not exist.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
absence of positive difference hypothesis
absence of depletion difference hypothesis is a hypothesis which assumes that the representation of an element significantly greater than a threshold does not exist.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
absence of over representation hypothesis
STATO
absence of enrichment hypothesis
absence of depletion difference hypothesis is a hypothesis which assumes that the representation of an element significantly less than a threshold does not exist.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
absence of under representation hypothesis
STATO
absence of depletion hypothesis
a binomial test is a statistical hypothesis test which evaluates if the observations made about a Bernoulli experiment , that is an experiment which tests the statistical significance of deviations from a theoretically expected distribution (the binomial distribution) of observations into 2 categories. It is a goodness of fit test.
Added following a term request by Chris Mungall:
https://github.com/ISA-tools/stato/issues/6
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
adapted from:
http://en.wikipedia.org/wiki/Binomial_test
binomial test
binom.test(x, n, p = 0.5,
alternative = c("two.sided", "less", "greater"),
conf.level = 0.95)
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/binom.test.html
scipy.stats.binom_test(x, n=None, p=0.5)
http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.binom_test.html#scipy.stats.binom_test
source:
https://github.com/scipy/scipy/blob/v0.15.1/scipy/stats/morestats.py#L1605
exact binomial test
Evaluation of statistical inference on empirical resting state fMRI.
IEEE Trans Biomed Eng. 2014 Apr;61(4):1091-9. doi: 10.1109/TBME.2013.2294013.
http://www.ncbi.nlm.nih.gov/pubmed/24658234
Statistical inference is the process of deducing properties of an underlying probability distribution by analysis of data.
Added following a term request by Nolan Nichols:
https://github.com/ISA-tools/stato/issues/12
Definition changed according to discussions in https://github.com/ISA-tools/stato/issues/55
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
The definition cited from Wikipedia is from: Upton, G., Cook, I. (2008) Oxford Dictionary of Statistics, OUP. ISBN 978-0-19-954145-4
https://en.wikipedia.org/wiki/Statistical_inference
statistical inference
A ratio where the numerator and denominator are expressed in the same unit.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
dimensionless ratio
dimensionless ratio
2
2
The covariance is a measurement data item about the strength of correlation between a set (2 or more) of random variables.
The covariance is obtained by forming:
cov(X,Y)=E([X-E(X)][Y-E(Y)] where E(X), E(Y) is the expected value (mean) of variable X and Y respectively.
covariance is symmetric so cov(X,Y)=cov(Y,X).
The covariance is usefull when looking at the variance of the sum of the 2 random variables since:
var(X+Y) = var(X) +var(Y) +2cov(X,Y)
The covariance cov(x,y) is used to obtain the coefficient of correlation cor(x,y) by normalizing (dividing) cov(x,y) but the product of the standard deviations of x and y.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
adapted from:
http://mathworld.wolfram.com/Covariance.html
covariance
cov(x, y = NULL, use = "everything", method = c("pearson", "kendall", "spearman"))
from:
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/cor.html
covariance
one sample t-test is a kind of Student's t-test which evaluates if a given sample can be reasonably assumed to be taken from the population.
The test compares the sample statistic (m) to the population parameter (M).
The one sample t-test is the small sample analog of the z test, which is suitable for large samples.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
adapted from various sources, including:
Practical Statistics for Medical Research by D.Altman.
ISBN: 0-412-27630-5
http://www.psychology.emory.edu/clinical/bliwise/Tutorials/TOM/meanstests/tone.htm
one sample t-test
t.test(x = NULL,
alternative = c("two.sided", "less", "greater"),
mu = 0, paired = FALSE, var.equal = FALSE,
conf.level = 0.95, ...)
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/t.test.html
scipy.stats.ttest_1samp(a, popmean, axis=0)
http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_1samp.html#scipy.stats.ttest_1samp
source:
https://github.com/scipy/scipy/blob/v0.15.1/scipy/stats/stats.py#L3194
one sample t-test
1
true
2
two sample t-test is a null hypothesis statistical test which is used to reject or accept the hypothesis of absence of difference between the means over 2 randomly sampled populations.
It uses a t-distribution for the test and assumes that the variables in the population are normally distributed and with equal variances.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
two sample t-test
adapted from:
http://en.wikipedia.org/wiki/Student's_t-test#Independent_.28unpaired.29_samples
and from:
http://www.psychology.emory.edu/clinical/bliwise/Tutorials/TOM/meanstests/tind.htm
t-test for independent means assuming equal variance
t.test(x, y = NULL,
alternative = c("two.sided", "less", "greater"),
mu = 0, paired = FALSE, var.equal = TRUE,
conf.level = 0.95, ...)
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/t.test.html
scipy.stats.ttest_ind(a, b, axis=0, equal_var=True)
http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html#scipy.stats.ttest_ind
source:
https://github.com/scipy/scipy/blob/v0.15.1/scipy/stats/stats.py#L3271
two sample t-test with equal variance
2
1
true
2
Welch t-test is a two sample t-test used when the variances of the 2 populations/samples are thought to be unequal (homoskedasticity hypothesis not verified). In this version of the two-sample t-test, the denominator used to form the t-statistics, does not rely on a 'pooled variance' estimate.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
Welsh t-test
Welch, B. L. (1947). "The generalization of "Student's" problem when several different population variances are involved". Biometrika 34 (1–2): 28–35. doi:10.1093/biomet/34.1-2.28
adapted from wikipedia:
http://en.wikipedia.org/wiki/Welch's_t_test
last accessed: 2014-05-06
t-test for independent means assuming unequal variance
t.test(x, y = NULL,
alternative = c("two.sided", "less", "greater"),
mu = 0, paired = FALSE, var.equal = FALSE,
conf.level = 0.95, ...)
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/t.test.html
scipy.stats.ttest_ind(a, b, axis=0, equal_var=False)
http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html#scipy.stats.ttest_ind
source:
https://github.com/scipy/scipy/blob/v0.15.1/scipy/stats/stats.py#L3271
two sample t-test with unequal variance
A Helmert contrast is a contrast in which the coefficients for the Helmert regressors compare each level with the average of the “preceding” ones
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
http://www.clayford.net/statistics/tag/helmert-contrasts/
An R and S-Plus Companion to Applied Regression. John Fox
ISBN-13: 978-0761922803
contr.helmert(n, contrasts = TRUE, sparse = FALSE)
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/contrast.html
Helmert contrast
a polynomial contrast is a contrast which...
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
contr.poly(n, scores = 1:n, contrasts = TRUE, sparse = FALSE)
from:
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/contrast.html
polynomial contrast
treatment contrast is a contrast which allows to test how linear model coefficients of categorical variables are interpreted in case where the “first” level (aka, the baseline) is included into the intercept and all subsequent levels have a coefficient that represents their difference from the baseline.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
adapted from multiple sources:
http://pic.dhe.ibm.com/infocenter/spssstat/v20r0m0/index.jsp?topic=%2Fcom.ibm.spss.statistics.help%2Fidh_idd_genlin_emmeans.htm
http://www.clayford.net/statistics/tag/helmert-contrasts/
http://www.aliquote.org/articles/tech/contrasts.html
contr.treatment(n, base = 1, contrasts = TRUE, sparse = FALSE)
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/contrast.html
treatment contrast
the sum contrast is a contrast in which each coefficient compares the corresponding level of the factor to the average of the other levels
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
http://www.clayford.net/statistics/tag/helmert-contrasts/
An R and S-Plus Companion to Applied Regression. John Fox
ISBN-13: 978-0761922803
contr.sum(n, contrasts = TRUE, sparse = FALSE)
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/contrast.html
sum contrast
2
Pearson's Chi-Squared test for goodnes of fit is a statistical null hypothesis test which is used to either evaluate goodness of fit of dataset to a Chi-Squared distribution
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
Chi2 test for goodness of fit
adapted from:
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/chisq.test.html
and
http://en.wikipedia.org/wiki/Pearson's_chi-squared_test
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/chisq.test.html
chisq.test(x = NULL, correct = FALSE,
p = rep(1/length(x), length(x)), rescale.p = FALSE,
simulate.p.value = FALSE, B = 2000)
Pearson's Chi square test of goodness of fit
1
2
2
Barnard's test is an exact statistical test used to determine if there are nonrandom associations between two categorical variables. It was developed in 1949 by Barnard and is a test which is, most times, more powerfull that the Fisher exact test
duplicate with OBI_0200176. so either MIREOT and add metadata and axioms or move from OBI
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
http://en.wikipedia.org/wiki/Barnard's_test
and
G A Barnard (1945) "A New Test for 2X2 Tables", Nature, 156, 177 & 783.
Barnard's test
barnardw.test(n1, n2, n3, n4, dp = 0.001, verbose = FALSE)
from
http://www.inside-r.org/packages/cran/Barnard/docs/barnardw.test
Barnard's test
a central composite design is a study design which contains an imbedded factorial or fractional factorial design with center points that is augmented with a group of so-called 'star points' that allow estimation of curvature.
A CCD design with k factors has 2k star points.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
http://www.itl.nist.gov/div898/handbook/pri/section3/pri3361.htm
Box-Wilson Central Composite Design
cd(basis, generators, blocks = "Block", n0 = 4, alpha = "orthogonal",
wbreps = 1, bbreps = 1, randomize = TRUE, inscribed = FALSE, coding)
http://artax.karlin.mff.cuni.cz/r-help/library/rsm/html/ccd.html
central composite design
The Box-Behnken design is an independent quadratic design in that it does not contain an embedded factorial or fractional factorial design. In this design the treatment combinations are at the midpoints of edges of the process space and at the center. These designs are rotatable (or near rotatable) and require 3 levels of each factor. The designs have limited capability for orthogonal blocking compared to the central composite designs.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
http://www.itl.nist.gov/div898/handbook/pri/section3/pri3362.htm
bbd(k, n0 = 4, block = (k == 4 | k == 5), randomize = TRUE, coding)
from:
http://artax.karlin.mff.cuni.cz/r-help/library/rsm/html/bbd.html
Box–Behnkens design
Plackett-Burman design is a type of study design optimizing multifactorial experiments characterized by their parsimony and economy with the run number a multiple of 4 (rather than a power of 2).
Plackett-Burman design is often used for screening experiments where the main effect is often heavily confounded with two-factor interactions.
This type of design is very useful for economically detecting large main effects, assuming all interactions are negligible when compared with the few important main effects.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
http://www.itl.nist.gov/div898/handbook/pri/section3/pri335.htm
pb(nruns, nfactors = nruns - 1, factor.names = if (nfactors <= 50)
Letters[1:nfactors] else paste("F", 1:nfactors, sep = ""),
default.levels = c(-1, 1), ncenter=0, center.distribute=NULL,
boxtyssedal = TRUE, n12.taguchi = FALSE,
replications = 1, repeat.only = FALSE,
randomize = TRUE, seed = NULL, oldver = FALSE, ...)
from:
http://www.inside-r.org/packages/cran/FrF2/docs/pb
Plackett-Burman design
upper confidence limit is a data item which is a largest value bounding a confidence interval
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
upper confidence limit
lower confidence limit is a data item which is a lowest value bounding a confidence interval
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
lower confidence limit
root-mean-square standardized effect is a data item which denotes effect size in the context of analysis of variance and corresponds to the square root of the arithmetic average of p standardized effects (effects normalized to be expressed in standard deviation units).
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
Ψ
http://www.statpower.net/Steiger%20Biblio/Steiger04.pdf
RMSSE
root-mean-square standardized effect
Eta-squared is a biased estimator of the variance explained by the model in the population (it estimates only the effect size in the sample). Eta-squared describes the ratio of variance explained in the dependent variable by a predictor while controlling for other predictors, making it analogous to the r2. This estimate shares the weakness with r2 that each additional variable will automatically increase the value of η2. In addition, it measures the variance explained of the sample, not the population, meaning that it will always overestimate the effect size, although the bias grows smaller as the sample grows larger.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
http://en.wikipedia.org/wiki/Effect_size#Eta-squared.2C_.CE.B72
η2
eta-squared
omega-squared is a effect size estimate for variance explained which is less biased than the eta-squared coefficient.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
adapted from:
http://en.wikipedia.org/wiki/Effect_size#Omega-squared.2C_.CF.892
ω2
omega-squared
Hedges's g is an estimator of effect size, which is similar to Cohen's d and is a measure based on a standardized difference. However, the denominator, corresponding to a pooled standard deviation, is computed differently from Cohen's d coefficient, by applying a correction factor (which involves a Gamma function).
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
adapted from :
http://en.wikipedia.org/wiki/Effect_size#Cohen.27s_d
and
http://blog.stata.com/tag/cohens-d/
Hedges's g
Glass's delta is an estimator of effect size which is similar to Cohen's d but where the denominator corresponds only to the standard deviation of the control group (or second group). It is considered less biais than the Cohen's d for estimating effect sizes based on means and distances between means.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
adapted from :
http://en.wikipedia.org/wiki/Effect_size#Cohen.27s_d
and
http://blog.stata.com/tag/cohens-d/
Glass's delta
0
Probability distribution estimated empirically on the data without assumptions on the shape of the probability distribution.
Camille Maumet
Karl Helmer
Philippe Rocca-Serra
Thomas Nichols
Initially discussed at https://github.com/incf-nidash/nidm/pull/191
non-parametric distribution
http://artax.karlin.mff.cuni.cz/r-help/library/nparcomp/html/weight.matrix.html
a contrast weight is a coefficient which multiplies a group mean, part of a linear combinaison defining a constrast as a weighted sum of group means, giving a 'weight' to a specific group mean hence the name.
Alejandra Gonzalez-Beltran
Camille Maumet
Orlaith Burke
Philippe Rocca-Serra
Thomas Nichols
adapted from wikipedia:
http://en.wikipedia.org/wiki/Contrast_%28statistics%29
contrast coefficient
contrast weight
[1,0,0]
a contrast weight matrix is a information content entity which holds a set of contrast weight, coefficient used in a weighting sum of means defining a contrast
Alejandra Gonzalez-Beltran
Camille Maumet
Orlaith Burke
Philippe Rocca-Serra
Tom Nichols
STATO
contrast weights
contrast weight matrix
contrast weight estimate is a model parameter estimate which results from the computation from the data and that is used as input to a model fitting process
Alejandra Gonzalez-Beltran
Camille Maumet
Orlaith Burke
Philippe Rocca-Serra
Tom Nichols
STATO
contrast weight estimate
http://www.ncbi.nlm.nih.gov/pubmed/7791040
The Akaike information criterion (AIC) is a measure of the relative quality of a statistical model for a given set of data. As such, AIC provides a means for model selection. AIC is defined as:
AIC = 2K - 2log(L)
where K is the number of predictors and L is the maximized likelihood value.
AIC deals with the trade-off between the goodness of fit of the model and the complexity of the model. It is founded on information theory: it offers a relative estimate of the information lost when a given model is used to represent the process that generates the data. AIC does not provide a test of a model in the sense of testing a null hypothesis; i.e. AIC can tell nothing about the quality of the model in an absolute sense. If all the candidate models fit poorly, AIC will not give any warning of that.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
http://en.wikipedia.org/wiki/Akaike_information_criterion
and
http://users.ecs.soton.ac.uk/jn2/teaching/aic.pdf
AIC
AIC(object, ..., k = 2)
from:
https://stat.ethz.ch/R-manual/R-devel/library/stats/html/AIC.html
Akaike information criterion
http://www.ncbi.nlm.nih.gov/pubmed/19761098
corrected Akaike information criteria is a modified version of the Akaike information criterion.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
CAIC
corrected Akaike information criterion
http://www.ncbi.nlm.nih.gov/pubmed/7791040
Bayesian information criterion or Schwartz's Bayesian information criterion is a criterion for model selection among a finite set of models. It is based, in part, on the likelihood function and it is closely related to the Akaike information criterion (AIC).
Given any two estimated models, the model with the lower value of BIC is the one to be preferred. The BIC is an increasing function of sigma_e^2 and an increasing function of k. That is, unexplained variation in the dependent variable and the number of explanatory variables increase the value of BIC. Hence, lower BIC implies either fewer explanatory variables, better fit, or both.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
Schwarz, Gideon E. (1978). "Estimating the dimension of a model". Annals of Statistics 6 (2): 461–464. doi:10.1214/aos/1176344136.
http://en.wikipedia.org/wiki/Bayesian_information_criterion
BIC
SBIC
Schwartz's Baeysian information criterion
Bayesian information criterion
2
A statistical model selection is a data transformation which is based on computing a relative quality value in order to evaluate and select which model best explains data.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
statistical model selection
0
0
Probability distribution which has no skew so its skewness=0
Alejandra Gonzalez-Beltran
Camille Maumet
Orlaith Burke
Philippe Rocca-Serra
Tom Nichols
STATO
symmetric distribution
Probability distribution estimated empirically from all acquired data
Alejandra Gonzalez-Beltran
Camille Maumet
Orlaith Burke
Philippe Rocca-Serra
Thomas Nichols
http://reference.wolfram.com/language/ref/EmpiricalDistribution.html
empirical distribution
Probability distribution estimated empirically on the data following a binning process
Alejandra Gonzalez-Beltran
Camille Maumet
Orlaith Burke
Philippe Rocca-Serra
Thomas Nichols
histogram distribution
Probability distribution estimated using a smooth kernel function to avoid making assumptions about the distribution of the data. The kernel density estimator is the estimated probability density function (pdf) of the random variable.
Alejandra Gonzalez-Beltran
Camille Maumet
Orlaith Burke
Philippe Rocca-Serra
Thomas Nichols
http://uk.mathworks.com/help/stats/kernel-distribution.html
and
http://reference.wolfram.com/language/ref/SmoothKernelDistribution.html
smooth kernel distribution
kernel density estimation (KDE) is a non-parametric way to estimate the probability density function of a random variable
Alejandra Gonzalez-Beltran
Camille Maumet
Orlaith Burke
Philippe Rocca-Serra
Thomas Nichols
https://en.wikipedia.org/wiki/Kernel_density_estimation
https://reference.wolfram.com/language/ref/KernelMixtureDistribution.html
kernel mixture distribution
Mixture distribution is the probability distribution of a random variable that is derived from a collection of other random variables as follows: first, a random variable is selected by chance from the collection according to given probabilities of selection, and then the value of the selected random variable is realized.
Alejandra Gonzalez-Beltran
Camille Maumet
Orlaith Burke
Philippe Rocca-Serra
Thomas Nichols
http://en.wikipedia.org/wiki/Mixture_distribution
mixture distribution
Probability distribution estimated empirically from a censored lifetime data
Alejandra Gonzalez-Beltran
Camille Maumet
Orlaith Burke
Philippe Rocca-Serra
Thomas Nichols
http://reference.wolfram.com/language/ref/SurvivalDistribution.html
survival distribution
best linear unbiased prediction is a data transformation which predicts <TDB> under the assumption that the variable(s) under consideration have a random effect
Philippe Rocca-Serra
Henderson C. R., 1984 Applications of Linear Models in Animal Breeding. University of Guelph, Guelph, Ontario, Canada.
ftp://tech.obihiro.ac.jp/suzuki/Henderson.pdf
BLUP
best linear unbiased predictor of the random effect
conditional mode of the random effect
best linear unbiased predictor
breeding value estimation is a data transformation process aiming at computing breeding value estimates of an organism given a set of genomic (SNP) observations, pedigree information and/or phenotypic observations.
Philippe Rocca-Serra
breeding value estimation
breeding value estimation is a data transformation process aiming at computing breeding value estimates of an organism given a set of genomic (SNP) observations.
Philippe Rocca-Serra
breeding value estimation using genotype data
breeding value estimation is a data transformation process aiming at computing breeding value estimates of an organism given a set of pedigree information.
Philippe Rocca-Serra
breeding value estimation using pedigree data
breeding value estimation is a data transformation process aiming at computing breeding value estimates of an organism given a set of phenotypic observations.
Philippe Rocca-Serra
breeding value estimation using phenotypic data
Philippe Rocca-Serra
genomic selection objective
a dataset which is made up of genotypic information, that is presenting allele information at specific loci in a set of individuals of an organism.
Philippe Rocca-Serra
genotype data set
a covariance structure is a data item which is part of a regression model and which indicates a pattern in the covariance matrix. The nature of covariance structure is specified before the regression analysis and various covariance structure may be tested and evaluated using information criteria to help choose the most suiteable model
Alejandra Gonzalez-Beltran
Camille Maumet
Orlaith Burke
Philippe Rocca-Serra
Tom Nichols
http://www3.nd.edu/~kyuan/courses/sem/readpapers/benter.pdf
covariance structure
Given two sets of locations computes the Matern cross covariance matrix for covariances among all pairings.
Alejandra Gonzalez-Beltran
Camille Maumet
Orlaith Burke
Philippe Rocca-Serra
Tom Nichols
Matern covariance function
http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_mixed_sect019.htm
https://www.rdocumentation.org/packages/fields/versions/2.3/topics/matern.cov
matern.cov
Matern function anisotropic covariance structure
The rational quadratic covariance function is used in spatial statistics, geostatistics, machine learning, image analysis, and other fields where multivariate statistical analysis is conducted on metric spaces. It is commonly used to define the statistical covariance between measurements made at two points that are d units distant from each other. Since the covariance only depends on distances between points, it is stationary. If the distance is Euclidean distance, the rational quadratic covariance function is also isotropic.
Alejandra Gonzalez-Beltran
Camille Maumet
Orlaith Burke
Philippe Rocca-Serra
Tom Nichols
http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_mixed_sect019.htm
https://en.wikipedia.org/wiki/Rational_quadratic_covariance_function
http://stat.ethz.ch/R-manual/R-devel/library/nlme/html/corRatio.html
rational quadratic anisotropic covariance structure
spatial linear geometric anisotropic covariance structure is a type of covariance structure characterized by its anisotropy, i.e., the variation of properties can be different in directions x and y, which is this case give linear features.
Alejandra Gonzalez-Beltran
Camille Maumet
Orlaith Burke
Philippe Rocca-Serra
Tom Nichols
SP(LINGA)
http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_mixed_sect019.htm
http://stat.ethz.ch/R-manual/R-devel/library/nlme/html/corLin.html
spatial linear geometric anisotropic covariance structure
spatial spherical geometric anisotropic covariance structure is a type of covariance structure characterized by its anisotropy, i.e., the variation of properties can be different in directions x and y, which is this case give spherical features.
Alejandra Gonzalez-Beltran
Camille Maumet
Orlaith Burke
Philippe Rocca-Serra
Tom Nichols
http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_mixed_sect019.htm
SP(SPHGA)
http://stat.ethz.ch/R-manual/R-devel/library/nlme/html/corSpher.html
spatial spherical geometric anisotropic covariance structure
spatial gaussian geometric anisotropic covariance structure is a type of covariance structure characterized by its anisotropy, i.e., the variation of properties can be different in directions x and y, which is this case give gaussian features.
Alejandra Gonzalez-Beltran
Camille Maumet
Orlaith Burke
Philippe Rocca-Serra
Tom Nichols
SP(GAUGA)
http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_mixed_sect019.htm
http://stat.ethz.ch/R-manual/R-devel/library/nlme/html/corGaus.html
spatial gaussian geometric anisotropic covariance structure
spatial exponential geometric anisotropic covariance structure is a type of covariance structure characterized by its anisotropy, i.e., the variation of properties can be different in directions x and y, which is this case give exponential features.
Alejandra Gonzalez-Beltran
Camille Maumet
Orlaith Burke
Philippe Rocca-Serra
Tom Nichols
SP(EXPGA)
http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_mixed_sect019.htm
spatial exponential geometric anisotropic covariance structure
spatial exponential anisotropic covariance structure is a type of covariance structure characterized by its anisotropy, i.e., the variation of properties can be different in directions x and y, which is this case give exponential features.
Alejandra Gonzalez-Beltran
Camille Maumet
Orlaith Burke
Philippe Rocca-Serra
Tom Nichols
SP(EXPA)(c-list)
http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_mixed_sect019.htm
Sacks et al. (1989)
http://stat.ethz.ch/R-manual/R-devel/library/nlme/html/corExp.html
spatial exponential anisotropic covariance structure
the banded heterogeneous Toeplitz covariance structure is a type of coviance structure which is often used to analyzed and intepret repeated measure design.
Alejandra Gonzalez-Beltran
Camille Maumet
Orlaith Burke
Philippe Rocca-Serra
Tom Nichols
http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_mixed_sect019.htm
TOEPH(q)
banded heterogeneous Toeplitz covariance structure
This covariance structure has heterogenous variances and heterogenous correlations between elements. The correlation between adjacent elements is homogenous across pairs of adjacent elements. The correlation between elements separated by a third is again homogenous, and so on.
Alejandra Gonzalez-Beltran
Camille Maumet
Orlaith Burke
Philippe Rocca-Serra
Tom Nichols
TOEPH
http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_mixed_sect019.htm
as well as:
https://www.ibm.com/support/knowledgecenter/en/SSLVMB_23.0.0/spss/advanced/covariance_structures.html
heterogeneous Toeplitz covariance structure
A banded Toeplitz structure, defined by parameter q, can be viewed as a moving-average structure with order q-1.
Alejandra Gonzalez-Beltran
Camille Maumet
Orlaith Burke
Philippe Rocca-Serra
Tom Nichols
http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_mixed_sect019.htm
TOEP(q)
banded Toeplitz covariance structure
The Toeplitz covariance structure has homogenous variances and heterogenous correlations between elements. The correlation between adjacent elements is homogenous across pairs of adjacent elements. The correlation between elements separated by a third is again homogenous, and so on.
Alejandra Gonzalez-Beltran
Camille Maumet
Orlaith Burke
Philippe Rocca-Serra
Tom Nichols
http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_mixed_sect019.htm
https://www.ibm.com/support/knowledgecenter/en/SSLVMB_23.0.0/spss/advanced/covariance_structures.html
TOEP
Toeplitz covariance structure
a form of covariance structure used to provide analysis ground s in the context of repeated measures datasets (longitudinal, time series)
Alejandra Gonzalez-Beltran
Camille Maumet
Orlaith Burke
Philippe Rocca-Serra
Tom Nichols
HF
http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_mixed_sect019.htm
Huynh and Feldt 1970
Huynh-Feldt covariance structure
factor-analytic structure is a covariance structure which is specified for q factors
equal diagonal factor-analytic covariance structure is a type of factor analytic covariance structure specified for q factors, which includes a diagonal component for repeated measures.
Alejandra Gonzalez-Beltran
Camille Maumet
Orlaith Burke
Philippe Rocca-Serra
Tom Nichols
http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_mixed_sect019.htm
FA1(q)
equal diagonal Factor Analytic covariance structure
no diagonal factor-analytic covariance structure is a type of factor analytic covariance structure specified for q factors, which does not include a diagonal component for repeated measures.
Alejandra Gonzalez-Beltran
Camille Maumet
Orlaith Burke
Philippe Rocca-Serra
Tom Nichols
http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_mixed_sect019.htm
FA0(q)
no diagonal Factor Analytic covariance structure
factor-analytic structure is a type of heterogeneous covariance structure which is specified for q factors
Alejandra Gonzalez-Beltran
Camille Maumet
Orlaith Burke
Philippe Rocca-Serra
Tom Nichols
adapted from:
Heterogeneous Variance: Covariance Structures for Repeated Measures
Russell D. Wolfinger. Journal of Agricultural, Biological, and Environmental Statistics
Vol. 1, No. 2 (Jun., 1996), pp. 205-230.
https://doi.org/10.2307/1400366
and
http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_mixed_sect019.htm
Jennrich and Schluchter 1986
FA(q)
Factor Analytic covariance structure
compound symmetry covariance structure is a covariance structure which means that all the variances are equal and all the covariances are equal.
Alejandra Gonzalez-Beltran
Camille Maumet
Orlaith Burke
Philippe Rocca-Serra
Tom Nichols
http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_mixed_sect019.htm
CS
http://stat.ethz.ch/R-manual/R-devel/library/nlme/html/corCompSymm.html
compound symmetry covariance structure
heterogenous compound symmetry structure is a compound symmetry covariance structure which has a different variance parameter for each diagonal element, and it uses the square roots of these parameters in the off-diagonal entries.
Alejandra Gonzalez-Beltran
Camille Maumet
Orlaith Burke
Philippe Rocca-Serra
Tom Nichols
http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_mixed_sect019.htm
CSH
heterogeneous compound symmetry covariance structure
first order autoregressive moving average covariance structure is a type of covariance structure which is used in the context of time series analysis
Alejandra Gonzalez-Beltran
Camille Maumet
Orlaith Burke
Philippe Rocca-Serra
Tom Nichols
ARMA(1,1)
http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_mixed_sect019.htm
http://stat.ethz.ch/R-manual/R-devel/library/nlme/html/corARMA.html
first order autoregressive moving average covariance structure
first order autoregressive covariance structure is a covariance structure where correlations among errors decline exponentially with distance
Alejandra Gonzalez-Beltran
Camille Maumet
Orlaith Burke
Philippe Rocca-Serra
Tom Nichols
http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_mixed_sect019.htm
AR(1)
http://stat.ethz.ch/R-manual/R-devel/library/nlme/html/corAR1.html
first order autoregressive covariance structure
This is an homogeneous structure, i.e. the variance along the main diagonal is constant. The covariances decline exponentially. It has only 2 parameters.
Alejandra Gonzalez-Beltran
Camille Maumet
Orlaith Burke
Philippe Rocca-Serra
Tom Nichols
http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_mixed_sect019.htm
ARH(1)
heterogeneous first-order autoregressive covariance structure
Ante-dependence covariance structure is a covariance structure which specifies that the covariance between two time points is a function of the product of variances at both points (hence allowing hetrogenity of error variance across measures to affect the correlation) and the product of the correlations at the distances up to the one chosen.
Alejandra Gonzalez-Beltran
Camille Maumet
Orlaith Burke
Philippe Rocca-Serra
Tom Nichols
http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_mixed_sect019.htm
ANTE(1)
Ante-dependence covariance structure
Mallows' Cp is a data item which compares the precision and bias of the full model to models with a subset of the predictors thus helping to choose between multiple regression models.
the mallows cp is a function of the number of parameter used in the model relying on the residuals sum of squares to compute a score.
the smaller Cp is, the better the model fit is.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/regression-and-correlation/goodness-of-fit-statistics/what-is-mallows-cp/
http://ugrad.stat.ubc.ca/R/library/locfit/html/cp.html
Mallows' Cp
repeated measure analysis is a kind of data transformation which deals with signals measured in the same experimental units at different times and, possibly, under different conditions over a period of time. Data produced by longitudinal studies qualify for such analysis. Since measurements are made on the same experimental units a number of times, they are likely to be correlated. Repeated measure analysis usually takes into consideration the possibility of correlation with time. It does so by specifying covariance structure in the analysis
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
adapted from https://ciser.cornell.edu/sasdoc/saspdf/analyst/chap16.pdf
repeated measure analysis
repeated measure analysis
the ordinary least squares estimation is a model parameter estimation for a linear regression model when the errors are uncorrelated and equal in variance. Is the Best Linear Unbiased Estimation (BLUE) method under these assumptions, Uniformly Minimum-Variance Unbiased Estimator (UMVUE) with addition of a Gaussian assumption.
Alejandra Gonzalez-Beltran
Camille Maumet
Philippe Rocca-Serra
Tom Nichols
http://en.wikipedia.org/wiki/Ordinary_least_squares and Tom Nichols
OLS estimation
https://stat.ethz.ch/R-manual/R-patched/library/stats/html/lm.html
ordinary least squares estimation
the weighted least squares estimation is a model parameter estimation for a linear regression model with errors that independent but have heterogeneous variance. Difficult to use use in practice, as weights must be set based on the variance which is usually unknown. If true variance is known, it is the Best Linear Unbiased Estimation (BLUE) method under these assumptions, Uniformly Minimum-Variance Unbiased Estimator (UMVUE) with addition of a Gaussian assumption.
Alejandra Gonzalez-Beltran
Camille Maumet
Orlaith Burke
Philippe Rocca-Serra
Tom Nichols
http://en.wikipedia.org/wiki/Least_squares#Weighted_least_squares and Tom Nichols
WLS estimation
https://stat.ethz.ch/R-manual/R-patched/library/stats/html/lm.html
weighted least squares estimation
the generalized least squares estimation is a model parameter estimation for a linear regression model with errors that are dependent and (possibly) have heterogeneous variance. Difficult to use use in practice, as covariance matrix of the errors must known to "whiten" data and model. If true covariance is known, it is the Best Linear Unbiased Estimation (BLUE) method under these assumptions, Uniformly Minimum-Variance Unbiased Estimator (UMVUE) with addition of a Gaussian assumption.
Philippe Rocca-Serra
Tom Nichols
http://en.wikipedia.org/wiki/Generalized_least_squares and Tom Nichols
GLS estimation
http://stat.ethz.ch/R-manual/R-devel/library/nlme/html/gls.html
generalized least squares estimation
the iteratively reweighted least squares estimation is a model parameter estimation which is a practical implementation of Weighted Least Squares, where the heterogeneous variances of the errors are estimated from the residuals of the regression model, providing an estimate for the weights. Each successive estimate of the weights improves the estimation of the regression parameters, which in turn are used to compute residuals and update the weights
Alejandra Gonzalez-Beltran
Camille Maumet
Orlaith Burke
Philippe Rocca-Serra
Tom Nichols
Tom Nichols
iteratively reweighted least squares estimation
the feasible generalized least squares estimation is a model parameter estimation which is a practical implementation of Generalised Least Squares, where the covariance of the errors is estimated from the residuals of the regression model, providing the information needed to whiten the data and model. Each successive estimate of the whitening matrix improves the estimation of the regression parameters, which in turn are used to compute residuals and update the whitening matrix.
Alejandra Gonzalez-Beltran
Camille Maumet
Orlaith Burke
Philippe Rocca-Serra
Tom Nichols
Tom Nichols
feasible generalized least squares estimation
used as an unbiased estimator of teh variance for a regression model
a residual mean square is a data item which is obtained by dividing the sum of squared residuals (SSR) by the number of degrees of freedom
Alejandra Gonzalez-Beltran
Camille Maumet
Philippe Rocca-Serra
Thomas Nichols
http://en.wikipedia.org/wiki/Mean_squared_error#Regression
http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/anova/anova-statistics/understanding-mean-squares/
https://github.com/ISA-tools/stato/issues/35
MSE
error mean square
residual mean square
Z-statistic is a statistic computed from observations and used to produce a p-value when compared to a Standard Normal Distribution in a statistical test called the Z-test.
Alejandra Gonzalez-Beltran
Camille Maument
Philippe Rocca-Serra
Thomas Nichols
http://en.wikipedia.org/wiki/Z-test
Z-statistic
Deviance is an indicator of fit and can be estimated by computing -2 times the log-likelihood ratio of the fitted model compared to a saturated(full) model.
It is a generalization of the idea of using the sum of squares of residuals in ordinary least squares to cases where model-fitting is achieved by maximum likelihood.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
http://en.wikipedia.org/wiki/Deviance_%28statistics%29
deviance
https://stat.ethz.ch/R-manual/R-devel/library/stats/html/deviance.html
deviance
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2682718/
The deviance information criterion (DIC) is a hierarchical modeling generalization of the AIC (Akaike information criterion) and BIC (Bayesian information criterion, also known as the Schwarz criterion). It is particularly useful in Bayesian model selection problems where the posterior distributions of the models have been obtained by Markov chain Monte Carlo (MCMC) simulation. Like AIC and BIC it is an asymptotic approximation as the sample size becomes large. It is only valid when the posterior distribution is approximately multivariate normal.
The deviance information criterion was published in 2002 by Spiegelhalter et al.
Spiegelhalter, D. J., N. G. Best, B. P. Carlin, and A. van der Linde, 2002. Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society, B, 64, 583-639.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
http://en.wikipedia.org/wiki/Deviance_information_criterion
DIC
http://artax.karlin.mff.cuni.cz/r-help/library/SpatialExtremes/html/DIC.html
deviance information criterion
The focused information criterion is a measurement data item which aims at facilitating model selection. It was published in 2003 by Claeskens, G. and Hjort, N.L. (2003). "The focused information criterion".
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
Journal of the American Statistical Association, volume 98, pp. 879–899. doi:10.1198/016214503000000819
FIC
focused information criterion
a data transformation that finds a contrast value (the contrast estimate) by computing the weighted sum of model parameter estimates using a set of contrast weights.
Alejandra Gonzalez-Beltran
Camille Maumet
Philippe Rocca-Serra
Thomas Nichols
https://github.com/ISA-tools/stato/pull/37
contrast estimation
estimate of a contrast obtained by computing the weighted sum of model parameter estimates using a set of contrast weights.
Alejandra Gonzalez-Beltran
Camille Maumet
Philippe Rocca-Serra
Thomas Nichols
https://github.com/ISA-tools/stato/pull/37
contrast estimate
an estimate of the standard deviation of a contrast estimate sampling distribution.
Alejandra Gonzalez-Beltran
Camille Maumet
Philippe Rocca-Serra
Thomas Nichols
https://github.com/ISA-tools/stato/pull/37
standard error of a contrast estimate
A scree plot is a graphical display of the variance of each component in the dataset which is used to determine how many components should be retained in order to explain a high percentage of the variation in the data
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
http://www.stats.gla.ac.uk/glossary/?q=node/451
Cattell scree test plot
screeplot(x, npcs = min(10, length(x$sdev)),
type = c("barplot", "lines"),
main = deparse(substitute(x)), ...)
from:
https://stat.ethz.ch/R-manual/R-devel/library/stats/html/screeplot.html
scree plot
A scatterplot matrix contains all the pairwise scatter plots of a set of variables on a single page in a matrix format.
Alejandra Gonzalez-Beltran
Philippe Rocca-Serra
Adapted from http://itl.nist.gov/div898/handbook/eda/section3/eda33qb.htm
scatterplot matrix
The alpha distribution is a continuous probability distribution whose density function is as defined at: https://docs.scipy.org/doc/scipy-1.0.0/reference/tutorial/stats/continuous_alpha.html
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
https://docs.scipy.org/doc/scipy-1.0.0/reference/tutorial/stats/continuous_alpha.html
http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.alpha.html#scipy.stats.alpha
alpha distribution
a power-law probability distribution is a probability distribution whose density function (or mass function in the discrete case) has the form
p(x) = L(x) . x^{-alpha}
where alpha is a parameter >1 and L(x) is a slowly varying function.
adapted from wikipedia and wolfram alpha:
https://en.wikipedia.org/wiki/Power_law#Power-law_probability_distributions
last accessed: 2015-11-03
https://cran.r-project.org/web/packages/poweRlaw/index.html
http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.powerlaw.html#scipy.stats.powerlaw
power law distribution
A regression model is a statistical model used in a type of analysis knowns as regression analysis, whereby a function is used to determine the relation between a response variable and an independent variable , with a set of unknown parameters.
Philippe Rocca-Serra
adapted from wikipedia:
https://en.wikipedia.org/wiki/Regression_analysis#Regression_models
last accessed: 2015-11-03
regression model
The Pareto distribution is a continuous probability distribution, which is defined by the follwoing probability density (1) function and distribution function (2)
(1): P(x)=(ab^a)/(x^(a+1))
(2): D(x)=1-(b/x)^a
defined over the interval x>=b.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
adapter from:
http://mathworld.wolfram.com/ParetoDistribution.html
last accessed: 2015-11-04
http://artax.karlin.mff.cuni.cz/r-help/library/LaplacesDemon/html/dist.Pareto.html
last accessed: 2015-11-04
Usage
>dpareto(x, alpha, log=FALSE)
>ppareto(q, alpha)
>qpareto(p, alpha)
>rpareto(n, alpha)
Arguments
x,q
These are each a vector of quantiles.
p
This is a vector of probabilities.
n
This is the number of observations, which must be a positive integer that has length 1.
alpha
This is the shape parameter alpha, which must be positive.
log
Logical. If log=TRUE, then the logarithm of the density or result is returned.
Pareto type-I probability distribution
the Pareto type-II probability distribution is a continuous probability distribution which is defined by a probability density function characterized by 2 parameters, alpha and lambda, 2 real, strictly positive numbers. alpha is known as the shape parameter while lambda is known as the scale parameter.
the function defines the probably of a continous random variable according to the following:
p(x) = {\alpha \over \lambda} \left[{1+ {x \over \lambda}}\right]^{-(\alpha+1)}, \qquad x \geq 0,
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
https://en.wikipedia.org/wiki/Lomax_distribution
Lomax distribution
http://www.inside-r.org/packages/cran/actuar/docs/Pareto
dpareto(x, shape, scale, log = FALSE)
Pareto type-II probability distribution
The Pareto(III) distribution is a continous probability distribution which is described with a cumulative distribution function of the following form:
F(x) = 1 − [1 + ((x − mu)/sigma)1/gamma]−1
for x > mu, sigma > 0, gamma > 0 and s =1.
a is the location parameter,
b is the scale parameter,
g is the inequality parameter
s is the shape parameter of value 1
The Pareto III distribution corresponds to a Pareto Type IV distribution where the shape parameter has a value of 1.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
adapted from wikipedia:
https://en.wikipedia.org/wiki/Pareto_distribution#Pareto_types_I.E2.80.93IV
last accessed: 2015-11-04
Pareto type-III probability distribution
The Pareto(IV) distribution is a continous probability distribution which is described with a cumulative distribution function of the following form:
F(y) = 1 − [1 + ((y − a)/b)1/g]−s
for y > a, b > 0, g > 0 and s > 0.
a is the location parameter,
b is the scale parameter,
g is the inequality parameter
s is the shape parameter
The distribution is used in actuarial science, economics, finance and telecommunications, but not restricted to those fields.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
https://cran.r-project.org/web/packages/VGAM/VGAM.pdf
page 517
dparetoIV(x, location = 0, scale = 1, inequality = 1, shape = 1, log = FALSE)
https://cran.r-project.org/web/packages/VGAM/VGAM.pdf
Pareto type-IV probability distribution
The geometric mean of two numbers, say 2 and 8, is just the square root of their product; that is sqrt(2 x 8)=4.
The geometric mean is defined as the nth root of the product of n numbers, i.e., for a set of numbers \{x_i\}_{i=1}^N, the geometric mean is defined as \left(\prod_{i=1}^N x_i\right)^{1/N}.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
adapted from:
https://en.wikipedia.org/wiki/Mean#Geometric_mean_.28GM.29
https://en.wikipedia.org/wiki/Geometric_mean
http://personality-project.org/r/html/geometric.mean.html
Usage: >geometric.mean(x,na.rm=TRUE)
Arguments: x , a vector or data.frame
http://docs.scipy.org/doc/scipy-0.13.0/reference/generated/scipy.stats.mstats.gmean.html
geometric mean
The harmonic mean is a kind of mean which is calculated by dividing the total number of observations by the reciprocal of each number in a series.
Harmonic Mean = N/(1/a1+1/a2+1/a3+1/a4+.......+1/aN)
where a(i)= Individual score and N = Sample size (Number of scores)
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
adapted from wikipedia and
https://www.easycalculation.com/statistics/learn-harmonic-mean.php
last accessed: 2015-11-04
https://en.wikipedia.org/wiki/Harmonic_mean
https://en.wikipedia.org/wiki/Mean#Harmonic_mean_.28HM.29
http://personality-project.org/r/html/harmonic.mean.html
Usage: > harmonic.mean(x,na.rm=TRUE)
Arguments:
x, a vector, matrix, or data.frame
na.rm, na.rm=TRUE remove NA values before processing
http://docs.scipy.org/doc/scipy-0.13.0/reference/generated/scipy.stats.mstats.hmean.html
harmonic mean
The weighted arithmetic mean is a measure of central tendency that is the sum of the products of each observed value and their respective non-negative weights, divided by the sum of the weights, such that the contribution of each observed value to the mean may defer according to its respective weight. It is defined by the formula: A = sum(vi*wi)/sum(wi), where 'i' ranges from 1 to n, 'vi' is the value of each observation, and 'wi' is the value of the respective weight for each observed value.
The weighted arithmetic mean is a kind of mean similar to an ordinary arithmetic mean (the most common type of average), except that instead of each of the data points contributing equally to the final average, some data points are weighted, meaning they contribute more than others.
The weighted arithmetic mean is often used if one wants to combine average values from samples of the same population with different sample sizes.
Alejandra Gonzalez-Beltran
Matthew Diller
Orlaith Burke
Philippe Rocca-Serra
https://en.wikipedia.org/wiki/Weighted_arithmetic_mean
https://docs.scipy.org/doc/numpy/reference/generated/numpy.average.html
np.average(range(1,11), weights=range(10,0,-1))
https://github.com/ISA-tools/stato/issues/59
weighted arithmetic mean
The interquartile mean (IQM) (or midmean) is a statistical measure of central tendency based on the truncated mean of the interquartile range.
In the calculation of the IQM, only the data in the second and third quartiles is used (as in the interquartile range), and the lowest 25% and the highest 25% of the scores are discarded. These points are called the first and third quartiles, hence the name of the IQM.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
IQM
https://en.wikipedia.org/wiki/Mean#Interquartile_mean
interquartile mean
The root mean square (abbreviated RMS or rms), also known as the quadratic mean, in statistics is a statistical measure of central tendency defined as the square root of the mean of the squares of a sample.
( To find the root mean square of a set of numbers, square all the numbers in the set and then find the arithmetic mean of the squares. Take the square root of the result. This is the root mean square.)
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
RMS
root mean square
https://en.wikipedia.org/wiki/Root_mean_square
and
http://www.mathwords.com/r/root_mean_square.htm
last accessed: 2015-11-04
quadratic mean
the sample mean of sample of size n with n observations is an arithmetic mean computed over n number of observations on a statistical sample.
The sample mean, denoted x¯ and read “x-bar,” is simply the average of the n data points x1, x2, ..., xn:
x¯=x1+x2+⋯+xnn=1n∑i=1nxi
The sample mean summarizes the "location" or "center" of the data.
the sample mean is a measure of dispersion of the observations made on the sample and provides an unbias estimate of the population mean
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
adapted from:
https://onlinecourses.science.psu.edu/stat414/node/66
and
http://mathworld.wolfram.com/SampleMean.html
last accessed: 2015-11-05
sample mean
the population mean or distribution mean is a parameter of a probability distribution or population indicative of the data dispersion. For continous probabibility distribution, the population mean is computed using the probability density function, for discrete probability distributions, a mass density function is used instead.
A population mean can be estimated by computing a sample mean
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
adapted from:
http://mathworld.wolfram.com/PopulationMean.html
last accessed: 2015-11-05
population mean
A covariance structure where no restrictions are made on the covariance between any pair of measurements.
Alejandra Gonzalez-Beltran
Camille Maumet
Philippe Rocca-Serra
Thomas Nichols
http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_mixed_sect019.htm#statug.mixed.mixedcovstruct
unstructured covariance structure
2
The Yuen's t-test is a two sample t-test with populations of unequal variance which provides a more robust t-test procedure under normal distribution and long tailed distributions.
The test computes a t statistics not using 'arithmetic means' but using 'trimmed means' instead as well as winsorized variances.
Philippe Rocca-Serra
Yuen-Welch's t-test
Biometrika (1974) 61 (1): 165-170
10.1093/biomet/61.1.165
http://finzi.psych.upenn.edu/library/DescTools/html/YuenTTest.html
Yuen t-Test with trimmed means
Fagan nomogram is a graph plotting pre-test probabilities, likelyhood ratios and post-test probabilities on 3 parallel axis. The plot was first proposed by Fagan in 1975 as a way to visualize Baye's Theorem data where
P(D) is the probability that the patient has the disease before the test. P(D|T) is the probability that the patient has the disease after the test result. P(T|D) is the probability of the test result if the patient has the disease, and P(T|D̄) is the probability of the test result if the patient does not have the disease. With this terminology the usefulness of both positive and negative test results can be assessed. A line drawn from P(D) on the right through the ratio of P(T|D) to P(T|D̄) gives P(D|T) on the left of the nomogram.
Philippe Rocca-Serra
http://www.ncbi.nlm.nih.gov/pubmed/1143310
N Engl J Med 1975; 293:257July 31, 1975
DOI: 10.1056/NEJM197507312930513
Fagan nomogram
Two-Step Fagan Nomogram, which adds two extra axis between the LR axis that represents sensibility and specificity to calculate negative and positive likelihood ratios in the same nomogram
Philippe Rocca-Serra
http://www.ncbi.nlm.nih.gov/pubmed/23468201
2 step Fagan nomogram
the likelihood ratio is a ratio which is formed by dividing the post-test odds with the pre-test odds in the context of a Bayesian formulation
Philippe Rocca-Serra
likelihood ratio
the likelihood ratio of negative results is a ratio which is formed by dividing the difference between 1 and sensitivity of the test by the specificity value of a test.. This can be expressed also as dividing the probability of a person who has the disease testing negative by the probability of a person who does not have the disease testing negative.
Philippe Rocca-Serra
likelihood ratio for negative results
adapted from Wikipedia:
https://en.wikipedia.org/wiki/Likelihood_ratios_in_diagnostic_testing
last accessed: May 2016
negative likelihood ratio
the likelihood ratio of positive results is a ratio which is form by dividing the sensitivity value of a test by the difference between 1 and specificity of the test. This can be expressed also as dividing the probability of the test giving a positive result when testing an affected subject versus the probability of the test giving a positive result when a subject is not affected.
Philippe Rocca-Serra
likelihood ratio for positive results
adapted from Wikipedia:
https://en.wikipedia.org/wiki/Likelihood_ratios_in_diagnostic_testing
last accessed: May 2016
positive likelihood ratio
prevalence is a ratio formed by the number of subjects diagnosed with a disease divided by the total population size.
Philippe Rocca-Serra
adapted from: https://www.health.ny.gov/diseases/chronic/basicstat.htm
prevalence
Incidence is the ratio of the number of new cases of a disease divided by the number of persons at risk for the disease.
Philippe Rocca-Serra
adapted from: https://www.health.ny.gov/diseases/chronic/basicstat.htm
incidence
mortality is a ratio formed by the number of deaths due to a disease divided by the total population size.
Philippe Rocca-Serra
adapted from: https://www.health.ny.gov/diseases/chronic/basicstat.htm
mortality
in the context of binary classification, accuracy is defined as the proportion of true results (both true positives and true negatives) to the total number of cases examined (the sum of true positive, true negative, false positive and false negative).
It can be understood as a measure of the proximity of measurement results to the true value.
Philippe Rocca-Serra
Rand accuracy
Rand index
adapted from wikipedia:
https://en.wikipedia.org/wiki/Accuracy_and_precision#In_binary_classification
last accessed: May 2016
accuracy
precision or positive predictive value is defined as the proportion of the true positives against all the positive results (both true positives and false positives)
Philippe Rocca-Serra
positive predictive value
adapted from wikipedia:
https://en.wikipedia.org/wiki/Accuracy_and_precision#In_binary_classification
last accessed: May 2016
precision
The probability of a patient having the target disorder before a diagnostic test result is known
Philippe Rocca-Serra
http://www.cebm.net/pre-test-probability/
pretest probability
a measure of heterogeneity in meta-analysis is a data item which aims to describe the variation in study outcomes between studies.
Philippe Rocca-Serra
adapted from http://www.statsdirect.com/help/default.htm#meta_analysis/heterogeneity.htm
last accessed: May 2016
measure of heterogeneity
The Cochran's Q statistic is a measure of heterogeneity accros study computed by summing the squared deviations of each study's estimate from the overall meta-analytic estimate, weighting each study's contribution in the same manner as in the meta-analysis.
Philippe Rocca-Serra
Cochran WG. The combination of estimates from different experiments. Biometrics 1954;10: 101-29.
https://doi.org/10.2307/3001666
http://www.inside-r.org/packages/cran/RVAideMemoire/docs/cochran.qtest
Cochran's Q statistic
The quantity called I2, describes the percentage of total variation across studies that is due to heterogeneity rather than chance. I2 can be readily calculated from basic results obtained from a typical meta-analysis as I2 = 100%×(Q - df)/Q, where Q is Cochran's heterogeneity statistic and df the degrees of freedom. Negative values of I2 are put equal to zero so that I2 lies between 0% and 100%. A value of 0% indicates no observed heterogeneity, and larger values show increasing heterogeneity.
Unlike Cochran's Q, it does not inherently depend upon the number of studies considered. A confidence interval for I² is constructed using either i) the iterative non-central chi-squared distribution method of Hedges and Piggott (2001); or ii) the test-based method of Higgins and Thompson (2002). The non-central chi-square method is currently the method of choice (Higgins, personal communication, 2006) – it is computed if the 'exact' option is selected.
Philippe Rocca-Serra
I2
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC192859/
and
http://www.statsdirect.com/help/default.htm#meta_analysis/heterogeneity.htm
I-squared
Tau-squared is an estimate of the between-study variance in a random-effects meta-analysis. The square root of this number (i.e. tau) is the estimated standard deviation of underlying effects across studies.
Philippe Rocca-Serra
http://handbook.cochrane.org/chapter_9/9_5_4_incorporating_heterogeneity_into_random_effects_models.htm
http://www.inside-r.org/packages/cran/meta/docs/metacor
Tau squared
The L’Abbé plot was introduced in 1987 in the context of meta-analyses of clinical trials with dichotomous (binary) outcomes, as a plot of observed risks in the treatment group against observed risks in the control group.
Another formulation is that it plots the event rate in the experimental (intervention) group against the event rate in the control group, as an aid to exploring the heterogeneity of effect estimates within a meta-analysis.
It is diagram used in meta-analysis that compares the risks observed in the experimental and control arms of clinical trials. Each trial is located in the space of a diagram where the sizes of the circles indicate the sizes of the trials. Trials in which the experimental treatment had a higher risk than the control will be in the upper left of the plot. If risk in the both groups is the same the circle will fall on the line of equality. If the control treatment has a higher risk than the experimental treatment then the point will be in the lower right of the plot. It is often used as an indicator of heterogeneity and hence as an indicator of the likelihood that results from different trials can be validly combined. Named after Kristin L'Abbé.
Philippe Rocca-Serra
10.1002/jrsm.6
Graphical displays for meta-analysis: An overview with suggestions for practice
http://www.ncbi.nlm.nih.gov/pubmed/3300460
and
http://www.dictionarycentral.com/definition/l-abb-plot.html
http://www.inside-r.org/packages/cran/meta/docs/labbe.metabin
L'Abbe plot
the proportion of individuals in a population with the outcome of interest
Philippe Rocca-Serra
adapted from:
http://handbook.cochrane.org/chapter_9/9_2_2_4_measure_of_absolute_effect_the_risk_difference.htm
observed risk
The risk difference is the difference between the observed risks (proportions of individuals with the outcome of interest) in the two groups.
The risk difference is straightforward to interpret: it describes the actual difference in the observed risk of events between experimental and control interventions.
Alejandra Gonzalez-Beltran
Philippe Rocca-Serra
http://handbook.cochrane.org/chapter_9/9_2_2_4_measure_of_absolute_effect_the_risk_difference.htm
risk difference
Sidik-Jonkman estimator is a data item computed to estimate heterogeneity parameter (estimate of between-study variance) in a random effect model for meta analysis.
Philippe Rocca-Serra
http://www.ncbi.nlm.nih.gov/pubmed/16955539
http://www.inside-r.org/packages/cran/meta/docs/metacor
metacor(cor, n, studlab,
data=NULL, subset=NULL,
sm=.settings$smcor,
level=.settings$level, level.comb=.settings$level.comb,
comb.fixed=.settings$comb.fixed, comb.random=.settings$comb.random,
hakn="FALSE",
method.tau="SJ",
tau.common=.settings$tau.common,
prediction=.settings$prediction, level.predict=.settings$level.predict,
method.bias=.settings$method.bias,
backtransf=.settings$backtransf,
title=.settings$title, complab=.settings$complab, outclab="",
byvar, bylab, print.byvar=.settings$print.byvar,
keepdata=.settings$keepdata
)
Sidik-Jonkman estimator
Hunter-Schmidt estimator is a data item computed to estimate heterogeneity parameter (estimate of between-study variance) in a random effect model for meta analysis.
Philippe Rocca-Serra
Methods of Meta-Analysis: Correcting Error and Bias in Research Findings. by John E. Hunter, Frank L. Schmidt
doi:10.2307/2289738
http://www.inside-r.org/packages/cran/meta/docs/metacor
metacor(cor, n, studlab,
data=NULL, subset=NULL,
sm=.settings$smcor,
level=.settings$level, level.comb=.settings$level.comb,
comb.fixed=.settings$comb.fixed, comb.random=.settings$comb.random,
hakn="FALSE",
method.tau="HS",
tau.common=.settings$tau.common,
prediction=.settings$prediction, level.predict=.settings$level.predict,
method.bias=.settings$method.bias,
backtransf=.settings$backtransf,
title=.settings$title, complab=.settings$complab, outclab="",
byvar, bylab, print.byvar=.settings$print.byvar,
keepdata=.settings$keepdata
)
Hunter-Schmidt estimator
restricted maximum likelihood estimation is a kind of maximum likelihood estimation data transformation which estimates the variance components of random-effects in univariate and multivariate meta-analysis. in contrast to 'maximum likelihood estimation', reml can produce unbiased estimates of variance and covariance parameters.
Philippe Rocca-Serra
https://doi.org/10.1093/biomet/58.3.545
REML
reml(y, v, x, data, RE.constraints = NULL, RE.startvalues = 0.1,
RE.lbound = 1e-10, intervals.type = c("z", "LB"),
model.name="Variance component with REML",
suppressWarnings = TRUE, silent = TRUE, run = TRUE, ...)
https://www.rdocumentation.org/packages/metaSEM/versions/1.0.0/topics/reml
restricted maximum likelihood estimation
maximum likelihood estimation (MLE) is a method of estimating the parameters of a statistical model, given observations. MLE attempts to find the parameter values that maximize the likelihood function, given the observations.
The method of maximum likelihood is based on the likelihood function, {\displaystyle {\mathcal {L}}(\theta \,;x)} {\displaystyle {\mathcal {L}}(\theta \,;x)}. We are given a statistical model, i.e. a family of distributions {\displaystyle \{f(\cdot \,;\theta )\mid \theta \in \Theta \}} {\displaystyle \{f(\cdot \,;\theta )\mid \theta \in \Theta \}}, where {\displaystyle \theta } \theta denotes the (possibly multi-dimensional) parameter for the model. The method of maximum likelihood finds the values of the model parameter, {\displaystyle \theta } \theta , that maximize the likelihood function, {\displaystyle {\mathcal {L}}(\theta \,;x)} {\displaystyle {\mathcal {L}}(\theta \,;x)}. I
Philippe Rocca-Serra
https://en.wikipedia.org/wiki/Maximum_likelihood_estimation
http://stat.ethz.ch/R-manual/R-devel/library/stats4/html/mle.html
maximum likelihood estimation
DerSimonian-Laird estimator s a data item computed to estimate heterogeneity parameter (estimate of between-study variance) in a random effect model for meta analysis. The estimator is used in simple noniterative procedure for characterizing the distribution of treatment effects in a series of studies
Philippe Rocca-Serra
doi:10.1016/j.cct.2006.04.004
http://www.ncbi.nlm.nih.gov/pubmed/3802833
doi:10.1016/0197-2456(86)90046-2
http://www.inside-r.org/packages/cran/meta/docs/metacor
metacor(cor, n, studlab,
data=NULL, subset=NULL,
sm=.settings$smcor,
level=.settings$level, level.comb=.settings$level.comb,
comb.fixed=.settings$comb.fixed, comb.random=.settings$comb.random,
hakn="FALSE",
method.tau="DL",
tau.common=.settings$tau.common,
prediction=.settings$prediction, level.predict=.settings$level.predict,
method.bias=.settings$method.bias,
backtransf=.settings$backtransf,
title=.settings$title, complab=.settings$complab, outclab="",
byvar, bylab, print.byvar=.settings$print.byvar,
keepdata=.settings$keepdata
)
DerSimonian-Laird estimator
a random effect meta analysis procedure defined by Hartung and Knapp and by Sidik and Jonkman which performs better than DerSimonian and Laird approach, especially when there is heterogeneity and the number of studies in the meta-analysis is small.
Philippe Rocca-Serra
HKSJ method
doi:10.1186/1471-2288-14-25
http://www.inside-r.org/packages/cran/meta/docs/metacor
metacor(cor, n, studlab,
data=NULL, subset=NULL,
sm=.settings$smcor,
level=.settings$level, level.comb=.settings$level.comb,
comb.fixed=.settings$comb.fixed, comb.random=.settings$comb.random,
hakn="TRUE",
method.tau="HS",
tau.common=.settings$tau.common,
prediction=.settings$prediction, level.predict=.settings$level.predict,
method.bias=.settings$method.bias,
backtransf=.settings$backtransf,
title=.settings$title, complab=.settings$complab, outclab="",
byvar, bylab, print.byvar=.settings$print.byvar,
keepdata=.settings$keepdata
)
meta analysis by Hartung-Knapp-Sidik-Jonkman method
a meta analysis which relies on the computation of the DerSimonian and Leard estimator as a measure of heterogeneity over a set of studies.
Philippe Rocca-Serra
http://www.inside-r.org/packages/cran/meta/docs/metacor
metacor(cor, n, studlab,
data=NULL, subset=NULL,
sm=.settings$smcor,
level=.settings$level, level.comb=.settings$level.comb,
comb.fixed=.settings$comb.fixed, comb.random=.settings$comb.random,
hakn="FALSE",
method.tau="DL",
tau.common=.settings$tau.common,
prediction=.settings$prediction, level.predict=.settings$level.predict,
method.bias=.settings$method.bias,
backtransf=.settings$backtransf,
title=.settings$title, complab=.settings$complab, outclab="",
byvar, bylab, print.byvar=.settings$print.byvar,
keepdata=.settings$keepdata
)
meta analysis by DerSimonian and Leard method
a meta analysis which relies on the computation of the Hunter and Schmidt estimator as a measure of heterogeneity over a set of studies by considering the weighted mean of the raw correlation coefficient. Hunter and Schmidt developed what is commonly termed validity generalization procedures (Schmidt and Hunter, 1977). These involve correcting the effect sizes in the meta-analysis for sampling, and measurement error
and range restriction.
Philippe Rocca-Serra
Hunter JE, Schmidt FL. Methods of Meta-analysis: correcting error and bias in research findings. Newbury Park, CA: Sage 1990.
http://www.inside-r.org/packages/cran/meta/docs/metacor
metacor(cor, n, studlab,
data=NULL, subset=NULL,
sm=.settings$smcor,
level=.settings$level, level.comb=.settings$level.comb,
comb.fixed=.settings$comb.fixed, comb.random=.settings$comb.random,
hakn="FALSE",
method.tau="HS",
tau.common=.settings$tau.common,
prediction=.settings$prediction, level.predict=.settings$level.predict,
method.bias=.settings$method.bias,
backtransf=.settings$backtransf,
title=.settings$title, complab=.settings$complab, outclab="",
byvar, bylab, print.byvar=.settings$print.byvar,
keepdata=.settings$keepdata
)
meta analysis by Hunter-Schmidt method
McNemar's test is a statistical test used on paired nominal data. It is applied to 2 × 2 contingency tables with a dichotomous trait, with matched pairs of subjects, to determine whether the row and column marginal frequencies are equal (that is, whether there is "marginal homogeneity"). It is named after Quinn McNemar, who introduced it in 1947.
An application of the test in genetics is the transmission disequilibrium test for detecting linkage disequilibrium
Philippe Rocca-Serra
McNemar's Chi-squared Test for Count Data
test of the marginal homogeneity of a contingency table
within-subjects chi-squared test
adapted from Wikipedia:
https://en.wikipedia.org/wiki/McNemar%27s_test
last accessed: May 2016
https://www.ncbi.nlm.nih.gov/pubmed/20254758
mcnemar.test(x, y = NULL, correct = TRUE)
from:
https://stat.ethz.ch/R-manual/R-devel/library/stats/html/mcnemar.test.html
McNemar test
Cochran's Q test is a statistical test used for unreplicated randomized block design experiments with a binary response variable and paired data.
In the analysis of two-way randomized block designs where the response variable can take only two possible outcomes (coded as 0 and 1), Cochran's Q test is a non-parametric statistical test to verify whether k treatments have identical effects.
Philippe Rocca-Serra
adapted from:
http://www.inside-r.org/packages/cran/CVST/docs/cochranq.test
and
https://en.wikipedia.org/wiki/Cochran%27s_Q_test
last accessed: May 2016
cochran.qtest(formula, data, alpha = 0.05, p.method = "fdr")
from:
http://www.inside-r.org/packages/cran/RVAideMemoire/docs/cochran.qtest
Cochran's q test for heterogeneity
a probability distribution scale parameter is a measure of variation which is set by the operator when selecting a parametric probability distribution and which defines how spread the distribution is. The larger the value of the scale parameter is, the more spread out the distribution.
user request:
https://github.com/ISA-tools/stato/issues/47
Philippe Rocca-Serra
adapted from Wikipedia:
https://en.wikipedia.org/wiki/Scale_parameter.
last accessed: 2016/11/11
scale
statistical dispersion
probability distribution scale parameter
a probability distribution shape parameter is a data item which is set by the operator when selecting a parametric probability distribution and which dictates the way the profile but not the location or size of the distribution plot looks like.
user request:
https://github.com/ISA-tools/stato/issues/47
Alejandra Gonzalez-Beltran
Philippe Rocca-Serra
adapted from Wikipedia:
https://en.wikipedia.org/wiki/Shape_parameter
last accessed: 2016-11-11
shape
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/GammaDist.html
probability distribution shape parameter
a scale estimator is a measurement datum (a statistic) which is calculated to approach the actual scale parameter of a probability distribution from observed data.
user request:
https://github.com/ISA-tools/stato/issues/47
Philippe Rocca-Serra
adapted from Wikipedia:
https://en.wikipedia.org/wiki/Scale_parameter
last accessed: 2016/11/11
https://stat.ethz.ch/R-manual/R-devel/library/mgcv/html/gam.scale.html
scale estimator
a log-normal (or lognormal) distribution is a continuous probability distribution of a random variable whose logarithm is normally distributed. Thus, if the random variable {\displaystyle X} X is log-normally distributed, then {\displaystyle Y=\ln(X)} Y=\ln(X) has a normal distribution. Likewise, if {\displaystyle Y} Y has a normal distribution, then {\displaystyle X=\exp(Y)} X=\exp(Y) has a log-normal distribution. A random variable which is log-normally distributed takes only positive real values. The distribution is occasionally referred to as the Galton distribution or Galton's distribution, after Francis Galton.
user request:
https://github.com/ISA-tools/stato/issues/47
Alejandra Gonzalez-Beltran
Philippe Rocca-Serra
adapted from Wikipedia:
https://en.wikipedia.org/wiki/Log-normal_distribution
last accessed: 2016/11/11
dlnorm(x, meanlog = 0, sdlog = 1, log = FALSE)
plnorm(q, meanlog = 0, sdlog = 1, lower.tail = TRUE, log.p = FALSE)
qlnorm(p, meanlog = 0, sdlog = 1, lower.tail = TRUE, log.p = FALSE)
rlnorm(n, meanlog = 0, sdlog = 1)
https://stat.ethz.ch/R-manual/R-devel/library/stats/html/Lognormal.html
log normal distribution
outlier detection testing objective is a statistical objective of a data transformation which aims to test a null hypothesis that an observation is not an outlier.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
outlier detection testing objective
outlier detection testing objective
Dixon test is a statistical test used to detect outliers in a univariate data set assumed to come from a normally distributed population.
Le test de Dixon est un test statistique destiné à identifier des données abérrantes dans un jeu de données associées à une variable aléatoire univariée dont la distribution sous-jacente est supposée normale.
Philippe Rocca-Serra
Dixon test
Q test
Robert B. Dean and Wilfrid J. Dixon (1951) "Simplified Statistics for Small Numbers of Observations". Anal. Chem., 1951, 23 (4), 636–638.
adapted from Wikipedia:
https://en.wikipedia.org/wiki/Dixon%27s_Q_test
last accessed: 2016-11-19
dixon.outliers(data)
from:
http://finzi.psych.upenn.edu/library/referenceIntervals/html/dixon.outliers.html
Dixon Q test
1
Grubbs' test is a statistical test used to detect one outlier in a univariate data set assumed to come from a normally distributed population.
Le test de Grubb est un test statistique destiné à identifier une (et une seule) abérration dans un jeu de données associées à une variable aléatoire univariée dont la distribution sous-jacente est supposée normale.
Philippe Rocca-Serra
maximum normed residual test
adapted from Wikipedia:
https://en.wikipedia.org/wiki/Grubbs'_test_for_outliers
last accessed: 2016-11-19
rgrubbs.test(x, alpha = 0.05)
from:
http://finzi.psych.upenn.edu/library/OutlierDM/html/rgrubbs.test.html
Grubbs' test
Tietjen-Moore test for outlier is a statistical test used to detect outliers and corresponds to a generalization of the Grubb's test, thus allowing detection of more than one outlier in a univariate data set assumed to come from a normally distributed population.
If testing for a single outlier, the Tietjen-Moore test is equivalent to the Grubbs' test.
Le test de Tietjen-Moore est un test statistique destiné à identifier des données abérrantes dans un jeu de données associées à une variable aléatoire univariée dont la distribution sous-jacente est supposée normale.
Ce test est une généralisation du test de Grubb dans le sens où il permet de tester pour la présence de plus d'une seule et unique abérration.
Philippe Rocca-Serra
Tietjen-Moore test
adapted from NIST:
http://www.itl.nist.gov/div898/handbook/eda/section3/eda35h2.htm
last accessed: 2016-11-19
FindOutliersTietjenMooreTest(dataSeries,k,alpha=0.05)
from:
https://rdrr.io/rforge/climtrends/man/findOutliers.Tietjen.Moore.test.html
Tietjen-Moore test for outliers
The Extreme Studentized Deviate Test is a statistical test used to detect outliers in a univariate data set assumed to come from a normally distributed population.
The ESD Test differs from the Grubbs' test and the Tietjen-Moore test in the sense that it contains built-in correction for multiple testing.
Philippe Rocca-Serra
ESD test for outliers
generalized ESD test for outliers
Rosner, Bernard (May 1983), Percentage Points for a Generalized ESD Many-Outlier Procedure,Technometrics, 25(2), pp. 165-172.
STATO
adapted from NIST:
http://www.itl.nist.gov/div898/handbook/eda/section3/eda35h3.htm
last accessed: 2016-11-19
rgrubbs.test(x, alpha = 0.05)
from:
http://finzi.psych.upenn.edu/library/OutlierDM/html/rgrubbs.test.html
generalized extreme studentized deviate test
1
2
a split-plot design is kind of factorial design which is used when running a full factorial completely randomized design is inpractical, either for cost or practicalities (e.g. equipment, fields), in other words, when a restricted randomization has to be applied. A split-plot design is used whenever practioners fix the level of 'hard to change factor' and run all the combinations of the other factors. The hard to change factor is also refered to as the 'whole plot' factor, while the remainders of the factors are refered to as 'split plot factor'.
Performing a split-plot design therefore means fixing one factor level, and then applying the treatments formed by the cartesian products of the levels for the other factors. A mininum of 2 factors are required and one being applied before the other(s).
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
adapted from:
http://www.minitab.com/uploadedFiles/Content/News/Published_Articles/recognize_split_plot_experiment.pdf
adapted from wikipedia:
https://en.wikipedia.org/wiki/Restricted_randomization
last accessed: 14.12.2016
https://pdfs.semanticscholar.org/bb4b/d979610388c76bb81568f14a886304ce4662.pdf
split-plot design
2
3
a split split plot design is a study design where restricted randomization affect 2 study factors (and not 1 as in split-plot design). Such design is only possible if at least 3 independent variables are present.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
adapted from https://onlinecourses.science.psu.edu/stat503/node/72
last accessed 2016/12/15
split split plot design
Restricted randomization is a kind of randomization which is used or occured when hard to change factors exist in a study design. In other words, when complete randomization is not possible, a case of restricted randomization exists, for instance in the case of split-plot design.
Restricted randomization allows intuitively poor allocations of treatments to experimental units to be avoided, while retaining the theoretical benefits of randomization.
Restricted randomization can also result from an unplanned event and is then something that should be avoided. RandomizeR R package can be used to detect such events and assess the quality of randomization process.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
Adapted from Wikipedia:
https://en.wikipedia.org/wiki/Restricted_randomization
last accessed: 2016/12/15
restricted randomization
a 'whole plot number' is a data item used to count and identify the actual piece of land (in the case of real field based trials) used in a split plot design experiment and receiving treatments corresponding to the levels of a factor whose randomization is restricted (these factors are known as 'hard to change' factors).
In the case of non-field based trials, the 'whole plot' is a metaphor.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
adapted from http://people.stat.sfu.ca/~cschwarz/Stat-650/Notes/PDFbigbook-R/R-part012.pdf
last accessed: 2016/12/15
whole plot number
a 'sub plot number' is a data item used to count and identify the actual piece of land located within a 'whole plot', in the case of real field based trials using a split-plot design, and received completely randomized treatments corresponding to the factor levels combinations of the remainder factors declared in the experiment.
in the case of 'split-split plot design', sub-plots also receive treatments corresponding to a factor whose randomization is restriction. In such configuration, each 'sub-plot' is itself divided into 'sub sub-plot', which then received the remainder of the treatments in completely randomized fashion.
In the case of non-field based trials, the notion 'sub-plot' is a metaphor.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
adapted from http://people.stat.sfu.ca/~cschwarz/Stat-650/Notes/PDFbigbook-R/R-part012.pdf
last accessed: 2016/12/15
sub-plot number
a 'sub sub-plot number' is a data item used to count and identify the actual piece of land located within a 'sub plot', in the case of real field based trials using a split-split-plot design, and received completely randomized treatments corresponding to the factor levels combinations of the remainder factors declared in the experiment.
in the case of 'split-split plot design', sub-plots also receive treatments corresponding to a factor whose randomization is restriction. In such configuration, each 'sub-plot' is itself divided into 'sub sub-plot', which then received the remainder of the treatments in completely randomized fashion.
In the case of non-field based trials, the notion 'sub sub-plot' is a metaphor.
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
adapted from http://people.stat.sfu.ca/~cschwarz/Stat-650/Notes/PDFbigbook-R/R-part012.pdf
last accessed: 2016/12/15
sub sub-plot number
"Wilks' lambda is a test statistic used in multivariate analysis of variance (MANOVA) to test whether there are differences between the means of identified groups of subjects on a combination of dependent variables."
Alejandra Gonzalez-Beltran
Philippe Rocca-Serra
http://www.blackwellpublishing.com/specialarticles/jcn_9_381.pdf
https://stat.ethz.ch/R-manual/R-devel/library/stats/html/summary.manova.html
## S3 method for class 'manova'
summary(object,
test = c("Pillai", "Wilks", "Hotelling-Lawley", "Roy"),
intercept = FALSE, tol = 1e-7, ...)
https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.mstats.f_value_wilks_lambda.html#scipy.stats.mstats.f_value_wilks_lambda
scipy.stats.mstats.f_value_wilks_lambda(ER, EF, dfnum, dfden, a, b)
Wilk's Lambda test
"Pillai proposed the trace test for the following three tests: (a) equality of mean vectors of lp‐variate normal distributions with the common but unknown covariance matrix, (b) independence between two sets of variates distributed jointly as a normal distribution with unknown mean vector, and (c) equality of covariance matrices of two p‐variate normal distributions with unknown mean vectors."
Alejandra Gonzalez-Beltran
Philippe Rocca-Serra
https://doi.org/10.1002/0470011815.b2a13067
Pillai's trace test
"The Lawley–Hotelling trace is used to test the equality of mean vectors of k p‐variate normal distributions with common but unknown covariance matrix. The explicit form of the null distribution of T$_{0}^{2}$equation image is the F distribution. The asymptotic null distribution is the chi‐square distribution. The power function of the test is described and its power is compared with the likelihood ratio test. "
Alejandra Gonzalez-Beltran
Philippe Rocca-Serra
https://doi.org/10.1002/0470011815.b2a13035
Lawley–Hotelling Trace
Hotelling-Lawley Trace test
"Roy's maximum root test finds the maximum characteristic root or eigenvalue statistic for testing equality of k p-variate normal distributions with same covariance matrix, independence between two sets of variables jointly distributed as a normal distribution, equality of covariance matrices of two p-variate normal distributions, whether the covariance matrix of a p-variabte normal distribution with unknown mean vector equals a specified matrix"
Alejandra Gonzalez-Beltran
Philippe Rocca-Serra
Oxford Dictionary of Statistical Terms
https://onlinecourses.science.psu.edu/stat505/node/163
Roy’s Maximum Root test
"The multivariate analysis of variance, or MANOVA, is a procedure for comparing multivariate sample means. As a multivariate procedure, it is used when there are two or more dependent variables, and is typically followed by significance tests involving individual dependent variables separately.
It helps to answer:
1. Do changes in the independent variable(s) have significant effects on the dependent variables?
2. What are the relationships among the dependent variables?
3. What are the relationships among the independent variables?"
Alejandra Gonzalez-Beltran
Philippe Rocca-Serra
https://en.wikipedia.org/wiki/Multivariate_analysis_of_variance
MANOVA
https://stat.ethz.ch/R-manual/R-devel/library/stats/html/manova.html
multivariate analysis of variance
In Bayesian statistics context, a credible interval is an interval of a posterior distribution which is such that the density at any point inside the interval is greater than the density at any point outside and that the area under the curve for that interval is equal to a prespecified probability level. For any probability level there is generally only one such interval, which is also often known as the highest posterior density region. Unlike the usual confidence interval associated with frequentist inference, here the intervals specify the range within which parameters lie with a certain probability.
The Bayesian counterparts of the confidence interval used in Frequentists Statistics.
Philippe Rocca-Serra
Bayesian credibility interval
Adapted from Wikipedia:
https://en.wikipedia.org/wiki/Credible_interval
and from the Cambridge Dictionary of Statistics, fourth edition, ISBN-13 978-0-511-78827-7
last accessed: 2017-07-01
HPD
region of highest posterior density
credible interval
In Bayesian statistics context, a 95% credible interval is a credible interval which,given the data, includes the true parameter with probability of 95%.
Philippe Rocca-Serra
Bayesian credibility interval at 95%
Wikipedia:
https://en.wikipedia.org/wiki/Credible_interval
last accessed: 2017-07-01
95% credible interval
" In clinical trials, it gives you an idea of how much difference there is between the averages of the experimental group and control groups."
"The mean difference, or difference in means, measures the absolute difference between the mean value in two different groups."
Alejandra Gonzalez-Beltran
Philippe Rocca-Serra
http://www.statisticshowto.com/mean-difference/
MD
difference in means
mean difference
In Bayesian statistics context, a 99% credible interval is a credible interval which, given the data, includes the true parameter with probability of 99%.
Philippe Rocca-Serra
Bayesian credible interval at 99%
Wikipedia:
https://en.wikipedia.org/wiki/Credible_interval
last accessed: 2017-07-01
99% credible interval
group sequential design is a study design used in clinical trial settings in which interim analyses of the data are conducted after groups of patients are recruited. After each interim analysis, the trial may stop early if the evidence so far shows the new treatment is particularly effective or ineffective. Such designs are ethical and cost-effective, and so are of great interest in practice.
Philippe Rocca-Serra
adapted from https://www.jstatsoft.org/article/view/v066i02/v66i02.pdf
https://cran.r-project.org/web/packages/gsDesign/index.html
group sequential design
interim analysis is a data transformation used to analyzed studies implementing a group-sequential design, to evaluate and interpret the accumulating information during a clinical trial. It means that the analysis of data that is conducted before full data collection has been completed. Clinical trials are unusual in that enrollment of patients is a continual process staggered in time. This means that if a treatment is particularly beneficial or harmful compared to the concurrent placebo group while the study is on-going, the investigators are ethically obliged to assess that difference using the data at hand and to make a deliberate consideration of terminating the study earlier than planned.
Philippe Rocca-Serra
adapted from https://onlinecourses.science.psu.edu/stat509/node/75
and from wikipedia:
https://en.wikipedia.org/wiki/Interim_analysis
last accessed: 2017-10-9
interim analysis
the O'brien-Flemming boundary analysis is a kind of interim-analysis method implemented by O'brien and Flemming to account for the
As all frequentist methods of the same type, it focuses on controlling the type I error rate as the repeated hypothesis testing of accumulating data increases the type I error rate of a clinical trial.
O'brien-Flemming boundary analysis
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3856440/
The Pocock boundary analysis gives a p-value threshold for each interim analysis which guides the data monitoring committee on whether to stop the trial. The boundary used depends on the number of interim analyses.
The Pocock boundary is simple to use in that the p-value threshold is the same at each interim analysis. The disadvantages are that the number of interim analyses must be fixed at the start and it is not possible under this scheme to add analyses after the trial has started. Another disadvantage is that investigators and readers frequently do not understand how the p-values are reported: for example, if there are five interim analyses planned, but the trial is stopped after the third interim analysis because the p-value was 0.01, then the overall p-value for the trial is still reported as <0.05 and not as 0.01.
As all frequentist methods of the same type, it focuses on controlling the type I error rate as the repeated hypothesis testing of accumulating data increases the type I error rate of a clinical trial.
Philippe Rocca-Serra
https://doi.org/10.1093%2Fbiomet%2F64.2.191
and Wikipedia:
https://en.wikipedia.org/wiki/Pocock_boundary
Pocock boundary analysis
The Haybittle–Peto boundary analysis is an interim analysis where a rule for deciding when to stop a clinical trial prematurely is defined. It is named for John Haybittle and Richard Peto.
The Haybittle–Peto boundary is one such stopping rule, and it states that if an interim analysis shows a probability of equal to, or less than 0.001 that a difference as extreme or more between the treatments is found, given that the null hypothesis is true, then the trial should be stopped early. The final analysis is still evaluated at the normal level of significance (usually 0.05).[3][4] The main advantage of the Haybittle–Peto boundary is that the same threshold is used at every interim analysis, unlike the O'Brien–Fleming boundary, which changes at every analysis. Also, using the Haybittle–Peto boundary means that the final analysis is performed using a 0.05 level of significance as normal, which makes it easier for investigators and readers to understand. The main argument against the Haybittle–Peto boundary is that some investigators believe that the Haybittle–Peto boundary is too conservative and makes it too difficult to stop a trial.
As all frequentist methods of the same type, it focuses on controlling the type I error rate as the repeated hypothesis testing of accumulating data increases the type I error rate of a clinical trial.
Philippe Rocca-Serra
10.1259/0007-1285-44-526-793
and adapted from Wikipedia:
https://en.wikipedia.org/wiki/Haybittle%E2%80%93Peto_boundary
Haybittle-Peto boundary analysis
A lnear mixed model is a mixed model containing both fixed effects and random effects and in which factors and covariates are assumed to have a linear relationship to the dependent variable. These models are useful in a wide variety of disciplines in the physical, biological and social sciences. They are particularly useful in settings where repeated measurements are made on the same statistical units (longitudinal study), or where measurements are made on clusters of related statistical units. Because of their advantage in dealing with missing values, mixed effects models are often preferred over more traditional approaches such as repeated measures ANOVA.
Fixed-effects factors are generally considered to be the variables whose values of interest are all represented in the data file.
Random-effects factors are variables whose values correspond to unwanted variation. They are useful when trying to understand variability in the dependent variable which was not anticipated and exceeds what was expected.
Linear mixed models also allow to specify specific interactions between factors, and allow the evaluation of the various linear effect that a particular combination of factor levels may have on a response variable.
Finally, linear mixed models allow to specify variance components in order to describe the relation between various random effects levels.
Hanna Cwiek
Pawel Krajewski
Philippe Rocca-Serra
LMM
adapted from Wikipedia:
https://en.wikipedia.org/wiki/Mixed_model
linear mixed model
An empirical measure is a random measure arising from a particular realization of a (usually finite) sequence of random variables.
Alejandra Gonzalez-Beltran
Philippe Rocca-Serra
https://en.wikipedia.org/wiki/Empirical_measure
empirical measure
A model term is a data item set in statistical model formula to apportion source of variation.
Alejandra Gonzalez-Beltran
Hanna Cwiek
Philippe Rocca-Serra
STATO
statistical model term
model term
the model random effect term is model term which aims to account for the unwanted variability in the data associated with a range of independent variables which are not the primary interest in the dataset. It is there also known as the variance component of the model
Alejandra Gonzalez-Beltran
Hanna Cwiek
Philippe Rocca-Serra
variance component
model random effect term
a model fixed effect term is a model term which accounts for variation explained by an independent variable and its levels.
Alejandra Gonzalez-Beltran
Hanna Cwiek
Philippe Rocca-Serra
model fixed effect term
a model interaction effect term is a model term which accounts for variation explained by the combined effects of the factor levels of more than one (usually 2) independent variables.
Alejandra Gonzalez-Beltran
Hanna Cwiek
Philippe Rocca-Serra
model interaction effect term
a model error term is a model term which accounts for residual variation not explained by the other components (fixed and random effect terms)
Alejandra Gonzalez-Beltran
Hanna Cwiek
Philippe Rocca-Serra
model error term
a statistic estimator is a data item which is computed from a dataset to provide an approximated value (an estimator) for a 'statistical parameter' (a 'characteristics/parameter' of the true underlying distribution) of a real population.
Hanna Cwiek
Tom Nichols
Philippe Rocca-Serra
STATO
statistic estimator
An estimate of the number of degrees of freedom.
Alejandra Gonzalez-Beltran
Hanna Cwiek
Philippe Rocca-Serra
STATO
degree of freedom approximation
The Kenward-Roger method's fundamental idea is to calculate the approximate mean and variance of their statistic and then match moments with an F distribution to obtain the denominator degrees of freedom.
Alejandra Gonzalez-Beltran
Hanna Cwiek
Philippe Rocca-Serra
https://support.sas.com/documentation/cdl/en/statug/68162/HTML/default/viewer.htm#statug_glimmix_details40.htm
https://www.jstatsoft.org/article/view/v059i09
https://www.ncbi.nlm.nih.gov/pubmed/9333350
Kenward-Roger
https://www.rdocumentation.org/packages/lmerTest/versions/2.0-36/topics/anova-methods
ibrary(lme4)
library(pbkrtest)
fm1 <- lmer(Reaction ~ Days + (Days| Subject), sleepstudy)
get_Lb_ddf(fm1, lme4::fixef(fm1))
Kenward-Roger degree of freedom approximation
Satterthwaite degree of freedom approximation is a type of degree of freedom approximation which is used to estimate an “effective degrees of freedom” for a probability distribution formed from several independent normal distributions where only estimates of the variance are known. It was originally developed by statistician Franklin E. Satterthwaite.
Alejandra Gonzalez-Beltran
Hanna Cwiek
Philippe Rocca-Serra
satterthwaite
Satterthwaite, F. E. (1946), "An Approximate Distribution of Estimates of Variance Components.", Biometrics Bulletin, 2: 110–114, doi:10.2307/3002019
Welch-Satterthwaite
https://www.rdocumentation.org/packages/metRology/versions/0.9-23-2/topics/welch.satterthwaite
Satterthwaite degree of freedom approximation
a data transformation to determine the number of degree of freedom
Alejandra Gonzalez-Beltran
Hanna Cwiek
Philippe Rocca-Serra
between-within
https://www.ncbi.nlm.nih.gov/pubmed/25899170
between-within denominator degrees of freedom approximation
RR-BLUP is a data transformation used in the context of estimating breeding value using a Bayesian ridge regression. It can be obtained from Bayes B procedure by setting pi parameter to zero ( ) and assuming that all the markers have the same variance.
term request by Guillaume Bauchet, cassavabase.org, Cornell University
Philippe Rocca-Serra
Comparison Between Linear and Non-parametric Regression Models for Genome-Enabled Prediction in Wheat
Paulino Pérez-Rodríguez, Daniel Gianola, Juan Manuel González-Camacho, José Crossa, Yann Manès and Susanne Dreisigacker
G3: GENES, GENOMES, GENETICS December 1, 2012 vol. 2 no. 12 1595-1605; https://doi.org/10.1534/g3.112.003665
RRBLUP
ridge regression best linear unbiaised predictor
a data transformation which calculate predictions of breeding values using an animal model and a relationship matrix calculated from the genomic/genetic markers (G Matrix), in constrast to using Pedigree information as in BLUP, also known as ABLUP
Philippe Rocca-Serra
adapted from:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3382275/
and from:
https://www.rdocumentation.org/packages/pedigree/versions/1.4/topics/gblup
and from
GBLUP
gblup(formula, data, M, lambda)
where:
formula: formula of the model, do not include the random effect due to animal (generally ID).
data: data.frame with columns corresponding to ID and the columns mentioned in the formula.
M: Matrix of marker genotypes, usually the count of one of the two SNP alleles at each markers (0, 1, or 2).
lambda : Variance ratio (σ2e/σ2a)
https://www.rdocumentation.org/packages/pedigree/versions/1.4/topics/gblup
genomic best linear unbiased prediction
a data transformation which calculate estimates of genomic estimated breeding values (GEBVs) on an animal or plant model utilizing trait-specific marker information.
Philippe Rocca-Serra
from:
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0012648
TABLUP
trait-specific relationship matrix best linear unbiaised prediction
Bayes A is a data transformation used in the context of estimating breeding value, which relies on a Bayesian model and treats the prior probability π that a SNP has zero effect as unknown (i.e π=0)
Philippe Rocca-Serra
A fast algorithm for BayesB type of prediction of genome-wide estimates of genetic value.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2637029/
doi: 10.1186/1297-9686-41-2.
and:
Prediction of total genetic value using genome-wide dense marker maps.
Meuwissen TH, Hayes BJ, Goddard ME.
Genetics. 2001 Apr;157(4):1819-29.
PMID: 11290733
Bayes A
Bayes B is a data transformation used in the context of estimating breeding value, which relies on a Bayesian model, treats the prior probability π that a SNP has zero effect to a set value (i.e π >0) and uses a mixture distribution.
Philippe Rocca-Serra
A fast algorithm for BayesB type of prediction of genome-wide estimates of genetic value.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2637029/
doi: 10.1186/1297-9686-41-2.
and:
Prediction of total genetic value using genome-wide dense marker maps.
Meuwissen TH, Hayes BJ, Goddard ME.
Genetics. 2001 Apr;157(4):1819-29.
PMID: 11290733
Bayes B
the estimated breeding value of an organism is a data item computed to estimate the true breeding value defined as genetic merit of an organism, half of which will be passed on to its progeny. While the exact breeding value can not been known, for performance traits it is possible to make good estimates. These estimates are called Estimated Breeding Values (EBVs). EBVs are expressed in the units of measurement for each particular trait. These estimates are output of various estimation methods which differ depending on the underlying assumptions (equal variance of marker effect, all markers contributing to the trait) , the mathemical methods used (bayesian or non-bayesians) and the genetic inheritance models being considered (additive, dominant, epistatic) selected by the analysts.
Philippe Rocca-Serra
adapted from:
http://abri.une.edu.au/online/pages/understanding_ebvs_char.htm
EBV
estimated breeding value
Additive genetic model is a data item which refer to the contributions to the final phenotype from more than one gene, or from alleles of a single gene (in heterozygotes), that combine in such a way that the sum of their effects in unison is equal to the sum of their effects individually.
Philippe Rocca-Serra
additive genetic inheritance model
Additive genetic model is a data item which refer to the contributions to the final phenotype from more than one gene, or from alleles of a single gene (in heterozygotes), that combine in such a way that the sum of their effects in unison is equal to the sum of their effects individually and dominance (of alleles at a single locus).
Philippe Rocca-Serra
additive dominant genetic inheritance model
Additive genetic model ris a data item which refer to the contributions to the final phenotype from more than one gene, or from alleles of a single gene (in heterozygotes), that combine in such a way that the sum of their effects in unison is equal to the sum of their effects individually and additive dominant ( (of alleles at a single locus) ) and epistasis (of alleles at more different loci)
Philippe Rocca-Serra
additive dominant genetic and epistatic inheritance model
Dunn’s Multiple Comparison Test is a post hoc (i.e. it’s run after an ANOVA) non parametric test (a “distribution free” test that doesn’t assume your data comes from a particular distribution). It is one of the least powerful of the multiple comparisons tests and can be a very conservative test–especially for larger numbers of comparisons. The Dunn is an alternative to the Tukey test when you only want to test for differences in a small subset of all possible pairs; For larger numbers of pairwise comparisons, use Tukey’s instead. Use Dunn’s when you choose to test a specific number of comparisons before you run the ANOVA and when you are not comparing to controls. If you are comparing to a control group, use the Dunnett test instead.
Philippe Rocca-Serra
Dunn, O.J. (1961) Multiple comparisons among means. JASA, 56: 54-64
Dunn, Olive Jean (1964). "Multiple comparisons using rank sums". Technometrics. 6 (3): 241–252. doi:10.2307/1266041
and adapted from:
http://www.statisticshowto.com/dunns-test/
Dunn's test
## Default S3 method:
dunnTest(x, g, method = dunn.test::p.adjustment.methods[c(4,
2:3, 5:8, 1)], two.sided = TRUE, altp = two.sided, ...)
from:
http://www.rforge.net/doc/packages/FSA/dunnTest.html
Dunn’s multiple comparison test
Conover-Iman test for stochastic dominance is a stastical test for multiple group comparisons and reports the results among multiple pairwise comparisons after a Kruskal-Wallis test for stochastic dominance among k groups (Kruskal and Wallis, 1952). The interpretation of stochastic dominance requires an assumption that the CDF of one group does not cross the CDF of the other.
The null hypothesis for each pairwise comparison is that the probability of observing a randomly selected value from the first group that is larger than a randomly selected value from the second group equals one half; this null hypothesis corresponds to that of the Wilcoxon-Mann-Whitney rank-sum test.
Like the rank-sum test, if the data can be assumed to be continuous, and the distributions are assumed identical except for a difference in location, Conover-Iman test may be understood as a test for median difference. conover.test accounts for tied ranks.
The Conover-Iman test is strictly valid if and only if the corresponding Kruskal-Wallis null hypothesis is rejected.
Philippe Rocca-Serra
Conover, W. J. and Iman, R. L. (1979). On multiple-comparisons procedures. Technical Report LA-7677-MS, Los Alamos Scientific Laboratory.
posthoc.kruskal.conover.test(x, …)
# S3 method for default
posthoc.kruskal.conover.test( x, g, p.adjust.method =
p.adjust.methods, …)
# S3 method for formula
posthoc.kruskal.conover.test(formula, data, subset,
na.action, p.adjust.method = p.adjust.methods, …)
https://www.rdocumentation.org/packages/PMCMR/versions/4.2/topics/posthoc.kruskal.conover.test
conover.test makes k(k-1)/2 multiple pairwise comparisons based on Conover-Iman t-test-statistic of the rank differences.
Conover-Iman test of multiple comparisons using rank sums
application to breeding value estimation and genomic selection
https://www.ncbi.nlm.nih.gov/pubmed/20122298
Bayesian LASSO is a data transformation where the regression parameters have independent Laplace (i.e., double-exponential) priors and are used to interprete Lasso estimate for linear regression parameters as Bayesian posterior mode estimates in accordance to a Bayesian framework.
Philippe Rocca-Serra
https://www.tandfonline.com/doi/abs/10.1198/016214508000000337
Bayes LASSO
Bayesian least absolute shrinkage and selection operator
a genotype matrix is a kind of genomic relationship matrix in the rawest of form and which simply corresponds to a matrix of individuals genotype for a given set of markers or genomic positions. Columns are snps or markers, Rows are individuals. Each column/row cell contains a genotype expressed as, in the genome is diploid, as a pair of characters chosen from ATGC where the dominant variant is uppercased and the recessive variant is lower cased.
Philippe Rocca-Serra
http://articles.extension.org/pages/68019/genomic-relationships-and-gblup
genotype matrix
the MAF matrix is a genomic relationship matrix which is obtained from the genotype matrix by counting the number of minor alleles at each locus
Philippe Rocca-Serra
http://articles.extension.org/pages/68019/genomic-relationships-and-gblup
MAF matrix
gene content matrix
matrix of minor allele count
MAF matrix
the M matrix is a genomic relationship matrix which is obtained by subtracting 1 to every value of the MAF matrix (gene content matrix). The values of the M matrix are only -1, 0 or 1 and makes computation easier.
M = MAF-1
Philippe Rocca-Serra
http://articles.extension.org/pages/68019/genomic-relationships-and-gblup
Efficient Methods to Compute Genomic Predictions. J. Dairy Sci. 91:4414-4423. 2008
P. M. VanRaden.
10.31 68/jds.2007-0980
deviation of 1 from the gene content matrix
>M = MAF-1
M matrix
P matrix is a kind of genomic relationship matrix which contains allele frequencies expressed as a difference from 0.5 and multiplied by 2.
Philippe Rocca-Serra
http://articles.extension.org/pages/68019/genomic-relationships-and-gblup
Efficient Methods to Compute Genomic Predictions. J. Dairy Sci. 91:4414-4423. 2008
P. M. VanRaden.
10.31 68/jds.2007-0980
P matrix
the Z-matrix is a genomic relationship matrix which is obtained by substracted the M matrix with the P matrix. It is also known as the incidence matrix for the markers.
Philippe Rocca-Serra
http://articles.extension.org/pages/68019/genomic-relationships-and-gblup
Efficient Methods to Compute Genomic Predictions. J. Dairy Sci. 91:4414-4423. 2008
P. M. VanRaden.
10.31 68/jds.2007-0980
incidence matrix for genotyping markers
Z matrix
The degree of freedom numerator is the number of degrees of freedom that the estimate of variance used in the numerator is based on. It is one of the parameters for the F-distribution used to compute probabilities in analysis of variance.
term request:
https://github.com/ISA-tools/stato/issues/71
Hanna Cwiek
Philippe Rocca-Serra
df1
num df
numerator degrees of freedom
augmented design is a kind of experimental design where the goal is to compare existing (control) treatments with new treatments that have an experimental constraint of "limited replication". To understand limited replication, consider about experiments that may only allow a single representation of the new treatment, this limitation may be many times due to the cost associated with the experiment, limited resources, or limited number of new units that can be used in the experiment. In contrast, the existing treatments are referred as checks and are generally replicated multiple times. With augmented design one can estimate the following:
a) Differences between checks and new treatments,
b) Differences among new treatments,
c) Differences among check treatments, and
d) Differences among new and check treatments combined.
Philippe Rocca-Serra
Federer, W.T. 1956. Augmented (or hoonuiaku) designs. Hawaiian Planters’ Record LV(2): 191–208)
http://rna.genomics.purdue.edu/
an example of dataset representing an augmented design:
https://www.rdocumentation.org/packages/sommer/versions/3.2/topics/augment
augmented experimental design
a probability distribution location parameter is a data item which is set by the operator when selecting a parametric probability distribution and which dictates the way the location but not the profile or size of the distribution plot looks like.
https://github.com/ISA-tools/stato/issues/50
Philippe Rocca-Serra
adapted from:
https://docs.scipy.org/doc/scipy/reference/tutorial/stats.html#shifting-and-scaling
norm.stats(loc=3, scale=4, moments="mv"), where loc is the location parameter indicated how much shift is given
probability distribution location parameter
the Weibull probability distribution is continuous probabibility distribution which is used to model time to fail, time to repair and material strength in material science. In biomedicine, the Weibull probability is used to in determining 'hazard functions'.
The 'location parameter' of the Weibull probability distribution can be used to define a failure-free zone.
If the quantity X is a "time-to-failure", the Weibull distribution gives a distribution for which the failure rate is proportional to a power of time. The shape parameter, k, is that power plus one, and so this parameter can be interpreted directly as follows:
A value of {\displaystyle k<1\,} {\displaystyle k<1\,} indicates that the failure rate decreases over time. This happens if there is significant "infant mortality", or defective items failing early and the failure rate decreasing over time as the defective items are weeded out of the population. In the context of the diffusion of innovations, this means negative word of mouth: the hazard function is a monotonically decreasing function of the proportion of adopters;
A value of {\displaystyle k=1\,} {\displaystyle k=1\,} indicates that the failure rate is constant over time. This might suggest random external events are causing mortality, or failure. The Weibull distribution reduces to an exponential distribution;
A value of {\displaystyle k>1\,} {\displaystyle k>1\,} indicates that the failure rate increases with time. This happens if there is an "aging" process, or parts that are more likely to fail as time goes on. In the context of the diffusion of innovations, this means positive word of mouth: the hazard function is a monotonically increasing function of the proportion of adopters. The function is first concave, then convex with an inflexion point at {\displaystyle (e^{1/k}-1)/e^{1/k},k>1\,} {\displaystyle (e^{1/k}-1)/e^{1/k},k>1\,}.
Philippe Rocca-Serra
Weibull distribution
adapted from :
https://en.wikipedia.org/wiki/Weibull_distribution
and from
http://www.engineeredsoftware.com/nasa/weibull.htm
Weibull probability distribution
pweibull(q, shape, scale = 1, lower.tail = TRUE, log.p = FALSE)
from:
https://stat.ethz.ch/R-manual/R-devel/library/stats/html/Weibull.html
scipy.stats.weibull_min
from:
https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.weibull_min.html#scipy.stats.weibull_min
Weibull probability distribution
statistical sampling is a planned process which aims at assembling a population of observation units (samples) in as an unbiaised manner as possible in order to obtain or infer information about the actual population these samples have been drawn.
Philippe Rocca-Serra
STATO
statistical sampling
Simple random sampling is a statistical sampling process which creates a sample of a size n entirely by chance. In such process, each unit has the same probability of being selected. Depending on the size of the population being sampled, the sampling process may be done with or without replacement
Alejandra Gonzalez-Beltran
Philippe Rocca-Serra
https://en.wikipedia.org/wiki/Simple_random_sample
random sampling
srswor(n,N)
from
https://www.rdocumentation.org/packages/sampling/versions/2.8/topics/srswor
[simple random sampling without replacement]
Sampling R package
srswr(n,N)
from:
https://www.rdocumentation.org/packages/sampling/versions/2.8/topics/srswr
[simple random sampling with replacement]
Sampling R package
simple random sampling
It is a sampling process used, among other things, in ecological studies when studying how things change in an given environment
line intercept sampling is a sampling process by which an element in a spatial region is included in a sample if it is intersected by a line chosen by the operator.
Philippe Rocca-Serra
LIS
LIS sampling
Lee Kaiser, Biometrics, Vol. 39, No. 4 (Dec., 1983), pp. 965-976
http://www.jstor.org/stable/2531331
line intercept sampling
line intercept sampling
Quadrat sampling is a classic tool for the study of ecology, especially biodiversity. In general, a series of squares (quadrats) of a set size are placed in a habitat of interest and the species within those quadrats are identified and recorded. Passive quadrat sampling (done without removing the organisms found within the quadrat) can be either done by hand, with researchers carefully sorting through each individual quadrat or, more efficiently, can be done by taking a photograph of the quadrat for future analysis.
Alejandra Gonzalez-Beltran
Philippe Rocca-Serra
http://www.coml.org/investigating/observing/quadrat_sampling.html
quadrat sampling
Cluster sampling is a sampling plan used when mutually homogeneous yet internally heterogeneous groupings are evident in a statistical population.
Alejandra Gonzalez-Beltran
Philippe Rocca-Serra
https://en.wikipedia.org/wiki/Cluster_sampling
cluster(data, clustername, size, method=c("srswor","srswr","poisson",
"systematic"),pik,description=FALSE)
from:
https://www.rdocumentation.org/packages/sampling/versions/2.8/topics/cluster
Sampling R package
cluster sampling
Probability proportional to size ('PPS') sampling is a sampling method in which the selection probability for each element is set to be proportional to its size measure, up to a maximum of 1. In a simple PPS design, these selection probabilities can then be used as the basis for Poisson sampling. However, this has the drawback of variable sample size, and different portions of the population may still be over- or under-represented due to chance variation in selections.
Alejandra Gonzalez-Beltran
Philippe Rocca-Serra
https://en.wikipedia.org/wiki/Sampling_(statistics)#Probability-proportional-to-size_sampling
PPS sampling
probability-proportional-to-size sampling
stratified sampling is a statistical sampling method which divides the population into homogenous subpopulations, which are then sampled using random or systematic sampling methods
Philippe Rocca-Serra
adapted from wikipedia:
https://en.wikipedia.org/wiki/Stratified_sampling
stratified sampling
strata(data, stratanames=NULL, size, method=c("srswor","srswr","poisson",
"systematic"), pik,description=FALSE)
From:
https://www.rdocumentation.org/packages/sampling/versions/2.8/topics/strata
R Sampling Package
stratified sampling
systematic sampling is a process for collecting samples and assembling a statistical sample using a system or method (.e.g unequal probabilities, without replacement, fixed sample size), as opposed to a random sampling.
Philippe Rocca-Serra
Madow, W.G. (1949), On the theory of systematic sampling, II, Annals of Mathematical Statistics, 20, 333-354.
from:
https://www.rdocumentation.org/packages/sampling/versions/2.8/topics/UPsystematic
Sampling R package
systematic sampling
Quota sampling is a method for selecting survey participants that is a non-probabilistic version of stratified sampling.
In quota sampling, a population is first segmented into mutually exclusive sub-groups, just as in stratified sampling. Then judgment is used to select the subjects or units from each segment based on a specified proportion. For example, an interviewer may be told to sample 200 females and 300 males between the age of 45 and 60. This means that individuals can put a demand on who they want to sample (targeting).
Alejandra Gonzalez-Beltran
Philippe Rocca-Serra
https://en.wikipedia.org/wiki/Quota_sampling
quota sampling
Panel sampling is the method of first selecting a group of participants through a random sampling method and then asking that group for (potentially the same) information several times over a period of time. Therefore, each participant is interviewed at two or more time points; each period of data collection is called a "wave". The method was developed by sociologist Paul Lazarsfeld in 1938 as a means of studying political campaigns.
Alejandra Gonzalez-Beltran
Philippe Rocca-Serra
https://en.wikipedia.org/wiki/Sampling_(statistics)#Panel_sampling
panel sampling
Snowball sampling (or chain sampling, chain-referral sampling, referral sampling) is a non-probability sampling technique where existing study subjects recruit future subjects from among their acquaintances. Thus the sample group is said to grow like a rolling snowball.
Alejandra Gonzalez-Beltran
Philippe Rocca-Serra
from:
https://en.wikipedia.org/wiki/Snowball_sampling
chain sampling
referral sampling
snowball sampling
chain-referral sampling
The voluntary sampling method is a type of non-probability sampling. A voluntary sample is made up of people who self-select into the survey. Often, these subjects have a strong interest in the main topic of the survey. Volunteers may be invited through advertisements on Social Media Sites
Alejandra Gonzalez-Beltran
Philippe Rocca-Serra
https://en.wikipedia.org/wiki/Sampling_(statistics)#Voluntary_Sampling
voluntary sampling
Convenience sampling (also known as grab sampling, accidental sampling, or opportunity sampling) is a type of non-probability sampling that involves the sample being drawn from that part of the population that is close to hand. This type of sampling is most useful for pilot testing.
Alejandra Gonzalez-Beltran
Philippe Rocca-Serra
from wikipedia:
https://en.wikipedia.org/wiki/Convenience_sampling
accidental sampling
grab sampling
opportunity sampling
convenience sampling
Brewer's sampling is a statistical sampling method which was proposed by Brewer in 1975 and uses unequal probabibility sampling technique
Philippe Rocca-Serra
Brewer, Kenneth RW (1975). A Simple Procedure For Sampling πpswor1. Australian Journal of Statistics, 17(3), 166-172.
UPbrewer(pik,eps=1e-06)
from:
https://www.rdocumentation.org/packages/sampling/versions/2.8/topics/UPbrewer
Sampling R package
Brewer sampling
In imbalanced datasets, where the sampling ratio does not follow the population statistics, one can resample the dataset in a conservative manner called minimax sampling. The minimax sampling has its origin in Anderson minimax ratio whose value is proved to be 0.5: in a binary classification, the class-sample sizes should be chosen equally. This ratio can be proved to be minimax ratio only under the assumption of LDA classifier with Gaussian distributions. The notion of minimax sampling is recently developed for a general class of classification rules, called class-wise smart classifiers.
Alejandra Gonzalez-Beltran
Philippe Rocca-Serra
https://en.wikipedia.org/wiki/Sampling_(statistics)#Minimax_sampling
minimax sampling
complete randomization is a group randomization where experimental units are randomly assigned to the entire set of groups defined by the experimental treatments.
Philippe Rocca-Serra
STATO
crPar(N, K = 2, ratio = rep(1, K), groups = LETTERS[1:K])
from:
https://www.rdocumentation.org/packages/randomizeR/versions/1.4/topics/crPar
complete randomization
Data imputation is a data transformation process whereby missing data is replaced with an estimated value for the missing element. The substituted values are intended to create a data record that does not fail edits. Various methods may be used to produce these substituted values.
Philippe Rocca-Serra
adapted from wikipedia and from the OECD glossary of statistical terms
https://stats.oecd.org/glossary/detail.asp?ID=3406
data imputation
last observation carried forward data imputation is a type of data imputation which uses a very simple, self explanatory method for substituted a missing value for an observation. It should be noted that this method gives a biased estimate of the treatment effect and underestimates the variability of the estimated result and should be used cautiously.
Philippe Rocca-Serra
adapted from Wikipedia:
https://en.wikipedia.org/wiki/Analysis_of_clinical_trials#Last_observation_carried_forward
last observation carried forward data imputation
regression data imputation is a type of data imputation where missing values are replaced with the value of a regression function coefficient.
Philippe Rocca-Serra
regression data imputation
substitution by the mean data imputation is a type of data imputation where missing values are replaced with the value the variable mean.
Philippe Rocca-Serra
substitution by the mean data imputation
https://www.r-bloggers.com/imputing-missing-data-with-r-mice-package/
multivariate imputation with chained equations (MICE) is a type of data imputation which uses an algorithm devised by Stef van Buuren and Karin Groothuis-Oudshoorn
Philippe Rocca-Serra
MICE: Multivariate Imputation by Chained Equations in R by Stef van Buuren and Karin Groothuis-Oudshoorn. Journal of Statistical Software,
http://www.stefvanbuuren.nl/publications/mice%20in%20r%20-%20draft.pdf
MICE
library(mice)
miceMod <- mice(BostonHousing[, !names(BostonHousing) %in% "medv"], method="rf") # perform mice imputation, based on random forests.
multivariate imputation with chained equations
k-nearest neighbour imputation is a data imputation which uses the k-nearest neighbour algorithm to compute a substitution value for the missing values. For every observation to be imputed, it identifies ‘k’ closest observations based on the euclidean distance and computes the weighted average (weighted based on distance) of these ‘k’ obs.
Philippe Rocca-Serra
adapted from:
http://r-statistics.co/Missing-Value-Treatment-With-R.html
kNN data imputation
library(DMwR)
knnOutput <- knnImputation(BostonHousing[, !names(BostonHousing) %in% "medv"]) # perform knn imputation.
anyNA(knnOutput)
from:
http://r-statistics.co/Missing-Value-Treatment-With-R.html
k-nearest neighbour data imputation
Matthews Correlation Coefficient (or MCC) is a correlation coefficient which is a measure of the quality of binary (two-class) classifications, introduced by biochemist Brian W. Matthews in 1975.
Philippe Rocca-Serra
adapted from wikipedia:
https://en.wikipedia.org/wiki/Matthews_correlation_coefficient
and from:
https://doi.org/10.1016/0005-2795(75)90109-9
MCC
mcc(preds = NULL, actuals = NULL, TP = NULL, FP = NULL, TN = NULL, FN = NULL)
from:
https://www.rdocumentation.org/packages/mltools/versions/0.3.4/topics/mcc
Matthews correlation coefficient
a covariance matrix is a square matrix that contains the variances and covariances associated with several variables. The diagonal elements of the matrix contain the variances of the variables and the off-diagonal elements contain the covariances between all possible pairs of variables.
Philippe Rocca-Serra
dispersion matrix
variance-covariance matrix
covariance matrix
https://www.r-bloggers.com/setup-up-the-inverse-of-additive-relationship-matrix-in-r/
the numerator relationship matrix is the matrix of *expected* additive genetic relationships between individuals. This matrix was originally used by Henderson (Henderson, C.R. 1976. A simple method for computing the inverse of a numerator relationship matrix used in prediction of breeding values. Biometrics 32:69-83.) to account for covariances between random effects, and therefore to use information from relatives in estimation of breeding value. Among the properties of the NRM matrix (also known as the A matrix), it is symmetric, the diagonal value correspond to 1+ the inbreeding coefficient for an individual.
Philippe Rocca-Serra
A matrix
adapted from:
https://jvanderw.une.edu.au/Genetic_properties_of_the_animal_model.pdf
and from:
Henderson, C.R. 1976. A simple method for computing the inverse of a numerator relationship matrix used in prediction of breeding values. Biometrics 32:69-83.
https://doi.org/10.2307/2529339
NRM
https://rdrr.io/cran/sommer/man/A.mat.html
numerator relationship matrix
The degree of freedom denominator is the number of degrees of freedom that the estimate of variance used in the denominator is based on. It is one of the parameters for the F-distribution used to compute probabilities in analysis of variance.
term request:
https://github.com/ISA-tools/stato/issues/71
Hanna Cwiek
Philippe Rocca-Serra
den df
df2
denominator degrees of freedom
A matrix of relationships among a group of individuals, which can be used to predict breeding values, to manage inbreeding and in genetic conservation. It can be calculated from the pedigree, but it is also possible to calculate the relationship matrix from genotypes at genetic markers such as single-nucleotide polymorphisms (SNPs). Elements of the genomic relationship matrix are estimates of the realized proportion of the genome that two individuals share, whereas the pedigree-derived relationship matrix is the expectation of this proportion.
Philippe Rocca-Serra
https://doi.org/10.3168/jds.2007-0980
https://www.ncbi.nlm.nih.gov/pubmed/22059574
realized genomic relationship matrix
relationship matrix
https://cran.r-project.org/web/packages/snpReady/snpReady.pdf
G matrix
a scaled t distribution is a kind of Student's t distribution which is shifted by 'mean' and scaled by standard deviation 'sd'.
Philippe Rocca-Serra
R documentation
t.scaled(x, df, mean = 0, sd = 1, ncp, log = FALSE)
from:
https://www.rdocumentation.org/packages/metRology/versions/0.9-23-2/topics/Scaled%20t%20distribution
scaled t distribution
a Bayesian model is a statistical model where inference is based on using Bayes theorem to obtain a posterior distribution for a quantity (or quantities) of interest for some model (such as parameter values) based on some prior distribution for the relevant unknown parameters and the likelihood from the model.
Philippe Rocca-Serra
adapted from several sources:
Oxford Dictionary of Statistics: 10.1093/acref/9780199541454.001.0001
http://www.scholarpedia.org/article/Bayesian_statistics
https://stats.stackexchange.com/questions/129017/what-exactly-is-a-bayesian-model
Bayesian model
a prior probability distribution is a probability distribution used as input to a Bayesian model to represent a priori knowledge about a model parameter. Along with the acquired/observed data, it is used to compute a posterior distribution according to the Bayes theorem.
Philippe Rocca-Serra
Oxford Dictionary of Statistics: 10.1093/acref/9780199541454.001.0001
prior probability distribution
a posterior probability distribution is a probability distribution computed in a Bayesian model approach given a prior distribution and a set of events/observations.
Philippe Rocca-Serra
Oxford Dictionary of Statistics: 10.1093/acref/9780199541454.001.0001
posterior probability distribution
Bayes C pi is a data transformation used to compute estimated breeding values using a Bayesian model and which assesses the SNP effect using MonteCarlo Markov chain methods. Bayes C pi treats the prior probability π that a SNP has zero effect as unknown.
The method was devised to address short comings of Bayes A and Bayes B approaches
Philippe Rocca-Serra
adapted from:
BMC Bioinformatics. 2011 May 23;12:186. doi: 10.1186/1471-2105-12-186.
Extension of the bayesian alphabet for genomic selection.
Habier D, Fernando RL, Kizilkaya K, Garrick DJ.
but also
https://cran.r-project.org/web/packages/gdmp/gdmp.pdf
and
https://jvanderw.une.edu.au/RFSlides.pdf
Bayes C pi
Bayes C pi
genetic inheritance model is a data item defining the assumption used by a breeding value estimation method to consider when running the calculations.
Philippe Rocca-Serra
STATO
genetic inheritance model
sampling from a probability distribution is a data transformation which aims at obtaining a sequence of random samples from a probability distribution for which direct sampling is difficult.
Philippe Rocca-Serra
STATO
sampling from a probability distribution
Gibbs sampling or a Gibbs sampler is a Markov chain Monte Carlo (MCMC) algorithm for obtaining a sequence of observations which are approximated from a specified multivariate probability distribution, when direct sampling is difficult.
Philippe Rocca-Serra
adapted from wikipedia:
https://en.wikipedia.org/wiki/Gibbs_sampling
Geman, S.; Geman, D. (1984). "Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images". IEEE Transactions on Pattern Analysis and Machine Intelligence. 6 (6): 721–741. doi:10.1109/TPAMI.1984.4767596
Gibbs sampling
the Metropolis–Hastings algorithm is a Markov chain Monte Carlo (MCMC) method for obtaining a sequence of random samples from a probability distribution for which direct sampling is difficult.
Philippe Rocca-Serra
https://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm
Hastings, W.K. (1970). "Monte Carlo Sampling Methods Using Markov Chains and Their Applications". Biometrika. 57 (1): 97–109. doi:10.1093/biomet/57.1.97
Metropolis–Hastings sampling
a continuous multivariate probability distribution is a continuous probability distribution which describes the possible values, and corresponding probabilities, of two or more (usually three or more) associated random variables.
Philippe Rocca-Serra
http://www.oxfordreference.com/view/10.1093/acref/9780199541454.001.0001/acref-9780199541454-e-1095?rskey=355sBu&result=1
continuous multivariate probability distribution
a discrete multivariate probability distribution is a discrete probability distribution which describes the possible values, and corresponding probabilities, of two or more (usually three or more) associated random variables.
Philippe Rocca-Serra
http://www.oxfordreference.com/view/10.1093/acref/9780199541454.001.0001/acref-9780199541454-e-1095?rskey=355sBu&result=1
discrete multivariate probability distribution
A data transformation that produces a reproducing kernel Hilbert space (or RKHS), which is a Hilbert space of functions in which point evaluation is a continuous linear functional.
Alejandra Gonzalez-Beltran
Philippe Rocca-Serra
https://en.wikipedia.org/wiki/Reproducing_kernel_Hilbert_space
RKHS
https://www.rdocumentation.org/packages/KGode/versions/1.0.1/topics/rkhs
rkhs
reproducing kernel Hilbert space procedure
a state space model is a kind of statistical model which describes the probabilistic dependence between the latent state variable and the observed measurement. The state or the measurement can be either continuous or discrete. The term “state space” originated in 1960s in the area of control engineering (Kalman, 1960). SSM provides a general framework for analyzing deterministic and stochastic dynamical systems that are measured or observed through a stochastic process.
Philippe Rocca-Serra
http://www.scholarpedia.org/article/State_space_model
SSM
state space model
genomic estimated breeding value (GEBV) is an estimated breeding value derived from information in an organism DNA (genotype). GEBV is calculated differently to conventional Estimated Breeding Values using advanced modeling technique to deal with high dimensionality data.
Alejandra Gonzalez-Beltran
Philippe Rocca-Serra
adapted from:
https://businesswales.gov.wales/farmingconnect/posts/genomic-breeding-values
GEBV
genomic estimated breeding value
In a planned experiment where to covariance (genotype x environment) can be controlled and held at 0, the heritability is defined as the ratio of the variance of the genotypic variables to the variance of the phenotypic variables.
H2 = Var(G)/Var(P)
H2 is the broad-sense heritability. This reflects all the genetic contributions to a population's phenotypic variance including additive, dominant, and epistatic (multi-genic interactions), as well as maternal and paternal effects, where individuals are directly affected by their parents' phenotype, for example, milk production in mammals.
Philippe Rocca-Serra
https://en.wikipedia.org/wiki/Heritability
H2
broad sense heritability
heritability
A particularly important component of the genetic variance is the additive variance, Var(A), which is the variance due to the average effects (additive effects) of the alleles. Since each parent passes a single allele per locus to each offspring, parent-offspring resemblance depends upon the average effect of single alleles. Additive variance represents, therefore, the genetic component of variance responsible for parent-offspring resemblance. The additive genetic portion of the phenotypic variance is known as Narrow-sense heritability and is defined as:
h2 = Var(A)/Var(P)
Philippe Rocca-Serra
https://en.wikipedia.org/wiki/Heritability
h2
narrow sense heritability
Bayes R is a data transformation used in the context of estimating breeding value, which relies on a Bayesian model to compute 'genomic estimated breeding values'. In contrast to Bayes B methods, the new method assumes that the true SNP effects are derived from a series of normal distributions, the first with zero variance, up to one with a variance of approximately 1% of the genetic variance.
Philippe Rocca-Serra
Erbe M, Hayes BJ, Matukumalli LK, Goswami S, Bowman PJ, Reich CM, et al. Improving accuracy of genomic predictions within and between dairy cattle breeds with imputed high-density single nucleotide polymorphism panels. J Dairy Sci. 2012;95:4114–29
doi: 10.3168/jds.2011-5019.
https://www.ncbi.nlm.nih.gov/pubmed/22720968
Bayes R
0
The double exponential distribution (a.k.a. Laplace distribution) is the distribution of differences between two independent variates with identical exponential distributions (Abramowitz and Stegun 1972, p. 930).
Philippe Rocca-Serra
http://mathworld.wolfram.com/LaplaceDistribution.html
double exponential probability distribution
dLaplace(x, mu = 0, b = 1, params = list(mu, b), ...)
https://www.rdocumentation.org/packages/ExtDist/versions/0.6-3/topics/Laplace
https://docs.scipy.org/doc/scipy-0.19.0/reference/generated/scipy.stats.laplace.html
Laplace probability distribution
Bootstrapping is the practice of estimating properties of an estimator (such as its variance) by measuring those properties when sampling from an approximating distribution.
Philippe Rocca-Serra
adapted from wikipedia:
https://en.wikipedia.org/wiki/Bootstrapping_(statistics)
bootstrap
sampling distribution estimation by bootstrapping
random forest procedure is a type of data transformation used in classification and statistical learning using regression. The random forest procedure is a meta estimator that fits a number of classifying decision trees on various sub-samples of the dataset (it operates by constructing a multitude of decision trees at training time) and use averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default). The random forest procedure outputs the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.
Philippe Rocca-Serra
adapted from:
https://en.wikipedia.org/wiki/Random_forest
and http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html
random forest
# S3 method for formula
randomForest(formula, data=NULL, ..., subset, na.action=na.fail)
# S3 method for default
randomForest(x, y=NULL, xtest=NULL, ytest=NULL, ntree=500,
mtry=if (!is.null(y) && !is.factor(y))
max(floor(ncol(x)/3), 1) else floor(sqrt(ncol(x))),
replace=TRUE, classwt=NULL, cutoff, strata,
sampsize = if (replace) nrow(x) else ceiling(.632*nrow(x)),
nodesize = if (!is.null(y) && !is.factor(y)) 5 else 1,
maxnodes = NULL,
importance=FALSE, localImp=FALSE, nPerm=1,
proximity, oob.prox=proximity,
norm.votes=TRUE, do.trace=FALSE,
keep.forest=!is.null(y) && is.null(xtest), corr.bias=FALSE,
keep.inbag=FALSE, ...)
# S3 method for randomForest
print(x, ...)
from https://www.rdocumentation.org/packages/randomForest/versions/4.6-14/topics/randomForest
sklearn.ensemble.RandomForestClassifier(n_estimators=10, criterion=’gini’, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=’auto’, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=1, random_state=None, verbose=0, warm_start=False, class_weight=None)
from:
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
sklearn.ensemble.RandomForestRegressor(n_estimators=10, criterion=’mse’, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=’auto’, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=1, random_state=None, verbose=0, warm_start=False)
from:
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html
random forest procedure
log likelihood is a data item which corresponds to the natural logarithm of the likelihood.
log likelihood is a data item commonly used to provide a measure of accuracy of a model.
Philippe Rocca-Serra
adapted from wikipedia
ogLik(object, ...)
## S3 method for class 'lm'
logLik(object, REML = FALSE, ...)
from:
https://stat.ethz.ch/R-manual/R-patched/library/stats/html/logLik.html
log likelihood
A data transformation process in which the Holm p-value procedure is applied with the aim of correcting false discovery rate
Philippe Rocca-Serra
Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6, 65–70. http://www.jstor.org/stable/4615733.
Holm fdr
p.adjust(p, method = holm, n = length(p))
from:
https://stat.ethz.ch/R-manual/R-devel/library/stats/html/p.adjust.html
Holm false discovery rate correction
A data transformation process in which the Hommel p-value procedure is applied with the aim of correcting false discovery rate
Philippe Rocca-Serra
Hommel, G. (1988). A stagewise rejective multiple test procedure based on a modified Bonferroni test. Biometrika, 75, 383–386. doi: 10.2307/2336190
Hommel fdr
p.adjust(p, method = hommel, n = length(p))
from:
https://stat.ethz.ch/R-manual/R-devel/library/stats/html/p.adjust.html
Hommel false discovery rate correction
number of cross-validation segments is a count which is used as input parameter in a cross validation procedure to evaluate a statistical model.
Philippe Rocca-Serra
number of cross-validation segments
number of predictive components is a count used as input to the principle component analysis (PCA)
Philippe Rocca-Serra
number of predictive components
number of orthogonal components is a count used as input to the orthogonal partial least square discriminant analysis (OPLS-DA)
Philippe Rocca-Serra
number of orthogonal components
A statistical model term testing is a data transformation that accounts for the evaluation of a component of a statistical model or model term.
Alejandra Gonzalez-Beltran
Philippe Rocca-Serra
STATO
statistical model term testing
the Wald test is statistical test which computes a Wald chi-squared test for 1 or more coefficients, given their variance-covariance matrix.
The Wald test (also called the Wald Chi-Squared Test) is a way to find out if explanatory variables in a model are significant. “Significant” means that they add something to the model; variables that add nothing can be deleted without affecting the model in any meaningful way
Philippe Rocca-Serra
Wald test for a term in a regression model:
regTermTest(model, test.terms, null=NULL,df=Inf, method=c("Wald","LRT"))
from:
http://r-survey.r-forge.r-project.org/survey/html/regTermTest.html
wald.test(Sigma, b, Terms = NULL, L = NULL, H0 = NULL, df = NULL, verbose = FALSE)
from:
https://www.rdocumentation.org/packages/aod/versions/1.3/topics/wald.test
Wald test
the Rao-Scott test is a statistical test which tests the hypothesis that all coefficients associated with a particular regression term are zero (or have some other specified values). the LRT uses a linear combination of chisquared distributions
Philippe Rocca-Serra
Lagrange multiplier test
Rao-Scott test
Rao, JNK, Scott, AJ (1984) "On Chi-squared Tests For Multiway Contingency Tables with Proportions Estimated From Survey Data" Annals of Statistics 12:46-60.
Rao score test
regTermTest(model, test.terms, null=NULL,df=Inf, method=c("Wald","LRT"))
from:
http://r-survey.r-forge.r-project.org/survey/html/regTermTest.html
Rao's score test
the frequency (i.e., the proportion) of possible confidence intervals that contain the true value of their corresponding parameter. In other words, if confidence intervals are constructed using a given confidence level in an infinite number of independent experiments, the proportion of those intervals that contain the true value of the parameter will match the confidence level.
A probability measure of the reliability of an inferential statistical test that has been applied to sample data and which is provided along with the confidence interval for the output statistic.
Philippe Rocca-Serra
adapted from wikipedia:
https://en.wikipedia.org/wiki/Confidence_interval
and from:
http://www.oxfordreference.com/view/10.1093/acref/9780191792236.001.0001/acref-9780191792236-e-103?rskey=tQCMI6&result=1
confidence level
http://davidmlane.com/hyperstat/A121160.html
It is a measure of how precise is an estimate of the statistical parameter is.
Standard error is the estimated standard deviation of an estimate. It measures the uncertainty associated with the estimate. Compared with the standard deviations of the underlying distribution, which are usually unknown, standard errors can be calculated from observed data.
Philippe Rocca-Serra
adapted from wikipedia and from SAGE research method article
http://methods.sagepub.com/reference/encyc-of-research-design/n435.xml
standard error of estimate
Biplots are a type of exploratory graph used in statistics, a generalization of the simple two-variable scatterplot. A biplot is constructed by using the singular value decomposition (SVD) to obtain a low-rank approximation to a transformed version of the data matrix X, whose n rows are the samples (also called the cases, or objects), and whose p columns are the variables. The biplot was introduced by K. Ruben Gabriel (1971).
Philippe Rocca-Serra
Gabriel, K. R. (1971). The biplot graphic display of matrices with application to principal component analysis. Biometrika, 58(3), 453–467.
adapted from:
https://en.wikipedia.org/wiki/Biplot
Last accessed: 04/07/2018
biplot(x, y, var.axes = TRUE, col, cex = rep(par("cex"), 2),
xlabs = NULL, ylabs = NULL, expand = 1,
xlim = NULL, ylim = NULL, arrow.len = 0.1,
main = NULL, sub = NULL, xlab = NULL, ylab = NULL, ...)
from:
http://stat.ethz.ch/R-manual/R-devel/library/stats/html/biplot.html
last accessed: 04/07/2018
biplot
The coefficient of determination is a data item measuring the proportion of the variance in the dependent variable that is predictable from the independent variable(s).
In the case of a linear regression mode, the coefficient of determination r2 is the quotient of the variances of the fitted values and observed values of the dependent variable.
r2
> eruption.lm = lm(eruptions ~ waiting, data=faithful)
> summary(eruption.lm)$r.squared
from:
http://www.r-tutor.com/elementary-statistics/simple-linear-regression/coefficient-determination
coefficient of determination
a regression coefficient is a data item generated by a type of data transformation called a regression, which aims to model a response variable by expression the predictor variables as part of a function where variable terms are modified by a number. A regression coefficient is one such number.
Philippe Rocca-Serra
regression coefficient
An eigenvalue is a data item resulting from a data transformation known as eigen value decomposition. It also corresponds to a process of matrix diagonalization or any equivalent operation, ie. transforming the underlying system of equations into a special set of coordinate axes in which the matrix takes this canonical form. Each eigenvalue is paired with a corresponding so-called eigenvector.
Philippe Rocca-Serra
adapted from:
http://mathworld.wolfram.com/Eigenvalue.html
last accessed: 04/07/2018
eigen(x, symmetric, only.values = FALSE, EISPACK = FALSE)
https://stat.ethz.ch/R-manual/R-devel/library/base/html/eigen.html
eigenvalue
Factor analysis is a dimension reduction data transformation that is used to describe variability among observed, correlated variables in terms of a potentially lower number of unobserved variables called factors. Factor analysis is related to principal component analysis (PCA), but the two are not identical. Both PCA and factor analysis aim to reduce the dimensionality of a set of data, but the approaches taken to do so are different for the two techniques. Factor analysis is clearly designed with the objective to identify certain unobservable factors from the observed variables, whereas PCA does not directly address this objective; at best, PCA provides an approximation to the required factors.
term request from Ralf Weber and Gavin Lloyd, University of Birmingham
Alejandra Gonzalez-Beltran
Philippe Rocca-Serra
adapted from Wikipedia:
https://en.wikipedia.org/wiki/Factor_analysis
last accessed: 04/07/2018
Cattell, R. B. (1952). Factor analysis. New York: Harper
factanal(x, factors, data = NULL, covmat = NULL, n.obs = NA,
subset, na.action, start = NULL,
scores = c("none", "regression", "Bartlett"),
rotation = "varimax", control = NULL, ...)
https://stat.ethz.ch/R-manual/R-devel/library/stats/html/factanal.html
factor analysis
In factor analysis, factor loadings express the relationship of each variable to the underlying factor.
Alejandra Gonzalez-Beltran
Philippe Serra-Serra
https://en.wikipedia.org/wiki/Factor_analysis
https://www.theanalysisfactor.com/factor-analysis-1-introduction/
factor loadings
https://stat.ethz.ch/R-manual/R-devel/library/stats/html/loadings.html
loadings
The score indicates how sensitive a likelihood function L(\Theta,X) is to its parameter \Theta. Explicitly, the score for \Theta is the gradient of the log-likelihood with respect to \Theta.
Alejandra Gonzalez-Beltran
Philippe Serra-Serra
https://en.wikipedia.org/wiki/Score_(statistics)
https://www.rdocumentation.org/packages/bnlearn/versions/4.3/topics/score
efficient score
informant
score function
score
score
The selectivity ratio (SR) is defined as the ratio of explained vexpl,i to residual variance vres,i for
the variable i on the target projection (TP) component in the context of Partial Least Squares Analysis.
Philippe Rocca-Serra
https://onlinelibrary.wiley.com/doi/pdf/10.1002/cem.1289
selectivity ratio
selectivity ratio
Partial least squares regression (PLS regression) is a data transformation that bears some relation to principal components regression; instead of finding hyperplanes of maximum variance between the response and independent variables, it finds a linear regression model by projecting the predicted variables and the observable variables to a new space. Because both the X and Y data are projected to new spaces, the PLS family of methods are known as bilinear factor models. Partial least squares Discriminant Analysis (PLS-DA) is a variant used when the Y is categorical.
PLS is used to find the fundamental relations between two matrices (X and Y), i.e. a latent variable approach to modeling the covariance structures in these two spaces. A PLS model will try to find the multidimensional direction in the X space that explains the maximum multidimensional variance direction in the Y space. PLS regression is particularly suited when the matrix of predictors has more variables than observations, and when there is multicollinearity among X values. By contrast, standard regression will fail in these cases (unless it is regularized).
Partial least squares was introduced by the Swedish statistician Herman O. A. Wold, who then developed it with his son, Svante Wold. An alternative term for PLS (and more correct according to Svante Wold[1]) is projection to latent structures, but the term partial least squares is still dominant in many areas. Although the original applications were in the social sciences, PLS regression is today most widely used in chemometrics and related areas. It is also used in bioinformatics, sensometrics, neuroscience and anthropology.
term request from Ralf Weber and Gavin Lloyd, University of Birmingham
Philippe Rocca-Serra
PLS
adapted from wikipedia:
last accessed on 24/07/2018 from:
https://en.wikipedia.org/wiki/Partial_least_squares_regression
Partial least squares analysis
https://rpubs.com/omicsdata/pls
http://scikit-learn.org/stable/modules/generated/sklearn.cross_decomposition.PLSRegression.html
Partial Least Square regression
a version of PLS used for classification, where the input y-block are group labels (categorical variable) rather than a continuous variable
term request from Ralf Weber and Gavin Lloyd, University of Birmingham
Philippe Rocca-Serra
PLS-DA
adapted from wikipedia
partial least squares discriminant analysis
# S3 method for default
plsda(x, y, ncomp = 2, probMethod = "softmax",
prior = NULL, ...)
https://www.rdocumentation.org/packages/caret/versions/6.0-80/topics/plsda
http://scikit-learn.org/stable/modules/generated/sklearn.cross_decomposition.PLSRegression.html
Partial Least Square Discriminant Analysis
The arithmetic mean is defined as the sum of the numerical values of each and every observation divided by the total number of observations. S The arithmetic mean A is defined by the formula
A=sum[Ai] / n where i ranges from 1 to n and Ai represents the value of individual observations.
The arithmetic mean is significantly affected by extreme values and outliers. A better measure of central tendency is the median (http://purl.obolibrary.org/obo/STATO_0000574).
replaced OBI import following addition of restrictions and use in STATO. however the xref to OBI is kept as class metadata
Alejandra Gonzalez-Beltran
Philippe Rocca-Serra
STATO
arithmetic mean
http://docs.scipy.org/doc/numpy/reference/generated/numpy.mean.html
average value
http://purl.obolibrary.org/obo/OBI_0000679
The median is that value of the variate which divides the total frequency into two halves. The median is measure of central tendency of data. It is obtained by arranging the observations in order from smallest to largest value. If there is an odd number of observations, the median is the middle value. If there is an even number of observations, the median is the average of the two middle values.
PRS and AGB added restriction about 'measure of central tendency' and quartile, june 2013 on 'centre value' OBI class.
replaced OBI import following addition of restrictions and use in STATO. however the xref to OBI is kept as class metadata
Alejandra Gonzalez-Beltran
Philippe Rocca-Serra
second quartile
A Dictionary of Statistical Terms, 5th edition, prepared for the International Statistical Institute by F.H.C. Marriott. Published for the International Statistical Institute by Longman Scientific and Technical.
and
Wolfram Alpha
median
center value
http://purl.obolibrary.org/obo/OBI_0000674
a data transformation which finds principal component by applying non-linear iterative partial least squares algorithm
term request from Ralf Weber and Gavin Lloyd, University of Birmingham
Philippe Rocca-Serra
NIPALS
https://cran.r-project.org/web/packages/nipals/vignettes/nipals_algorithm.pdf
nipals(x, ncomp = min(nrow(x), ncol(x)), center = TRUE, scale = TRUE,
maxiter = 500, tol = 1e-06, startcol = 0, fitted = FALSE,
force.na = FALSE, gramschmidt = TRUE, verbose = FALSE)
https://www.rdocumentation.org/packages/mixOmics/versions/6.3.2/topics/nipals
non-iterative Partial Least Squares
A novel algorithm for partial least squares (PLS) regression, SIMPLS, is proposed which calculates the PLS factors directly as linear combinations of the original variables. The PLS factors are determined such as to maximize a covariance criterion, while obeying certain orthogonality and normalization restrictions. This approach follows that of other traditional multivariate methods. The construction of deflated data matrices as in the nonlinear iterative partial least squares (NIPALS)-PLS algorithm is avoided. For univariate y SIMPLS is equivalent to PLS1 and closely related to existing bidiagonalization algorithms. This follows from an analysis of PLS1 regression in terms of Krylov sequences. For multivariate Y there is a slight difference between the SIMPLS approach and NIPALS-PLS2. In practice the SIMPLS algorithm appears to be fast and easy to interpret as it does not involve a breakdown of the data sets.
The acronym SIMPLS comes from 'straightforward implementation of a statistically inspired modification of the PLS method'
term request from Ralf Weber and Gavin Lloyd, University of Birmingham
Philippe Rocca-Serra
SIMPLS: An alternative approach to partial least squares regression
Sijmen de Jong
https://doi.org/10.1016/0169-7439(93)85002-X
simpls(X, Y, ncomp, stripped = FALSE, ...)
https://www.rdocumentation.org/packages/cocorresp/versions/0.3-0/topics/simpls
SIMPLS
a partial least square regression applied when there is only one variable in Y (the matrix of response variables), or it is desirable to model and optimize separately the performance of each of the variables in Y. This case is usually referred to as PLS1 regression (J = 1).
term request from Ralf Weber and Gavin Lloyd, University of Birmingham
Philippe Rocca-Serra
A comparison of nine PLS1 algorithms, Martin Andersson.
https://doi.org/10.1002/cem.1248
plsreg1(x, y, nc = 2, cv = FALSE)
https://www.rdocumentation.org/packages/plspm/versions/0.2-2/topics/plsreg1
PLS1
a partial least square regression applied to a multivariate response variable.
term request from Ralf Weber and Gavin Lloyd, University of Birmingham
Philippe Rocca-Serra
a partial least square regression applied when there Y matrix of response variables is truely multivariate (J >1)
https://doi.org/10.1016/0003-2670(86)80028-9
plsreg2(X, Y, nc = 2)
https://www.rdocumentation.org/packages/plspm/versions/0.2-2/topics/plsreg2
PLS2
improved kernel PLS is a data transformation which implement a very fast kernel algorithm for updating PLS models in a recursive manner and for exponentially discounting past data.
term request from Ralf Weber and Gavin Lloyd, University of Birmingham
Philippe Rocca-Serra
https://onlinelibrary.wiley.com/doi/abs/10.1002/%28SICI%291099-128X%28199701%2911%3A1%3C73%3A%3AAID-CEM435%3E3.0.CO%3B2-%23
kernelpls.fit(X, Y, ncomp, stripped = FALSE, ...)
https://www.rdocumentation.org/packages/pls/versions/2.6-0/topics/kernelpls.fit
improved Kernel PLS
variable importance in projection is a measure computed as part of a partial least square regression to accumulate the importance of each variable j being reflected by w from each component.
term request from Ralf Weber and Gavin Lloyd, University of Birmingham
Philippe Rocca-Serra
VIP
https://doi.org/10.1016/j.chemolab.2012.07.010
and
S. Wold, E. Johansson, M. Cocchi
PLS: Partial Least Squares Projections to Latent Structures, 3D QSAR in drug design, 1 (1993), pp. 523-550
vip(object)
https://www.rdocumentation.org/packages/mixOmics/versions/6.3.2/topics/vip
variable importance in projection
a data transformation which compute the singular-value decomposition of a rectangular matrix.
The singular-value decomposition is very general in the sense that it can be applied to any m × n matrix whereas eigenvalue decomposition can only be applied to certain classes of square matrices.
term request from Ralf Weber and Gavin Lloyd, University of Birmingham
Philippe Rocca-Serra
adapted from wikipedia:
https://en.wikipedia.org/wiki/Singular-value_decomposition
last accessed: 24/08/2018
svd(x, nu = min(n, p), nv = min(n, p), LINPACK = FALSE)
https://www.rdocumentation.org/packages/base/versions/3.5.1/topics/svd
numpy.linalg.svd(a, full_matrices=True, compute_uv=True)
https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.linalg.svd.html
singular value decomposition
best linear unbiased estimator
Philippe Rocca-Serra
Henderson C. R., 1984 Applications of Linear Models in Animal Breeding. University of Guelph, Guelph, Ontario, Canada.
ftp://tech.obihiro.ac.jp/suzuki/Henderson.pdf
BLUE
best linear unbiaised estimator of the fixed effect
best linear unbiased estimator
https://www.ncbi.nlm.nih.gov/pubmed/24033541
"An experiment was conducted to investigate the effect of a prebiotic on performance of partridge. The experiment was carried out with a total of eighty-day-old male Chukar partridge (Alectoris chukar chukar) chicks in a completely randomized design. The dietary treatments consisted of a control and an experimental treatment, and each treatment was replicated four times with 10 chicks per replicate."
a completely randomized design is a type of design of experiment where the observation unit receive treatments (independent variable level) entirely at random. In other words, the observations unit are randomly assigned to treatments.
Completely randomized designs differ from randomized complete block design and should not be confused as in the latter, a blocking variable is first use to assign experimental units to blocks. Then only, the members of each block are then randomly assigned to different treatment groups
term request by Hanna Cwiek, https://github.com/ISA-tools/stato/issues/61
Philippe Rocca-Serra
adapted from http://www.stat.yale.edu/Courses/1997-98/101/expdes.htm and from
A Dictionary of Statistics (3 ed.) , Graham Upton and Ian Cook
Publisher: Oxford University Press
Print Publication Date: 2014
Print ISBN-13: 9780199679188
http://animsci.agrenv.mcgill.ca/StatisticalMethodsII/R/crd/index.html
completely randomized design
the Wald statistic is a statistic is used during a Wald test, a test of significance of the regression coefficient; it is based on the asymptotic normality property of maximum likelihood estimates, and is computed as:
W = b * 1/Var(b) * b
In this formula, b stands for the parameter estimates, and Var(b) stands for the asymptotic variance of the parameter estimates. The Wald statistic is tested against the Chi-square distribution in the Wald test.
term request from Hanna Cwiek:
https://github.com/ISA-tools/stato/issues/67
Philippe Rocca-Serra
adapted from wikipedia and http://documentation.statsoft.com/STATISTICAHelp.aspx?path=glossary/GlossaryTwo/W/WaldStatistic
Wald statistic
degree of freedom calculation is a data transformation which is part of a stastical test and which aims to determine or estimate the number of degrees of freedom in a system.
term request from Hanna Cwiek:
https://github.com/ISA-tools/stato/issues/68
Philippe Rocca-Serra
STATO
degree of freedom calculation
a restricted randomized design is a kind of study design which uses randomization to allocate observation unit to treatment but where intuitively poor allocations of treatments to experimental units are avoided, while retaining the theoretical benefits of randomization. This is often the case when so-called 'hard to change' factors are used in an experimental design.
Philippe Rocca-Serra
adapted from wikipedia
restricted randomized design
https://www.nature.com/articles/srep35323/tables/1
the percentage of variance is an output of Principal Component Analysis which is obtained by forming the ratio of an eigenvalue divided by the sum of all eigenvalues. This produces a "percentage of variance" for each eigenvector.
Philippe Rocca-Serra
adapted from:
https://stats.stackexchange.com/questions/31908/what-is-percentage-of-variance-in-pca
PoV
http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
percentage of variance
the scaled identity covariance structure is a type of covariance structure which has constant variance. The assumption is that there is no correlation between any elements.
Hanna Cwiek
Philippe Rocca-Serra
Tom Nichols
adapted from https://www.ibm.com/support/knowledgecenter/en/SSLVMB_24.0.0/spss/advanced/covariance_structures.html
scaled identity covariance structure
material anatomical entity
Anatomical entity that has mass.
material anatomical entity
anatomical cluster
Anatomical group that has its parts adjacent to one another.
anatomical cluster
length unit
A unit which is a standard measure of the distance between two points.
length unit
mass unit
A unit which is a standard measure of the amount of matter/energy of a physical object.
mass unit
time unit
A unit which is a standard measure of the dimension in which events occur in sequence.
time unit
PLS weight is a information content entity which is generated when performing a Partial Least Square analysis
term request from Ralf Weber and Gavin Lloyd, University of Birmingham
Philippe Rocca-Serra
PLS weight
a dataset which is made up of pedigree information, that is presenting ancestry or lineage information in a set of individuals of an organism.
Philippe Rocca-Serra
pedigree data set
this is experimental, do not use for markup, no STATO ID assigned
Philippe Rocca-Serra
response variable explained by fixed effect of predictor variable hypothesis
this is experimental, do not use for markup, no STATO ID assigned
Philippe Rocca-Serra
response variable explained by interaction effect of predictor variables hypothesis
this is experimental, do not use for markup, no STATO ID assigned
Philippe Rocca-Serra
response variable explained by random effect of predictor variable hypothesis
variance component estimate
example to be eventually removed
Class has all its metadata, but is either not guaranteed to be in its final location in the asserted IS_A hierarchy or refers to another class that is not complete.
metadata complete
term created to ease viewing/sort terms for development purpose, and will not be included in a release
PERSON:Alan Ruttenberg
organizational term
Class has undergone final review, is ready for use, and will be included in the next release. Any class lacking "ready_for_release" should be considered likely to change place in hierarchy, have its definition refined, or be obsoleted in the next release. Those classes deemed "ready_for_release" will also derived from a chain of ancestor classes that are also "ready_for_release."
ready for release
Class is being worked on; however, the metadata (including definition) are not complete or sufficiently clear to the branch editors.
metadata incomplete
Nothing done yet beyond assigning a unique class ID and proposing a preferred term.
uncurated
All definitions, placement in the asserted IS_A hierarchy and required minimal metadata are complete. The class is awaiting a final review by someone other than the term editor.
pending final vetting
Terms with this status should eventually replaced with a term from another ontology.
Alan Ruttenberg
group:OBI
to be replaced with external ontology term
A term that is metadata complete, has been reviewed, and problems have been identified that require discussion before release. Such a term requires editor note(s) to identify the outstanding issues.
Alan Ruttenberg
group:OBI
requires discussion
## Elucidation
This is used when the statement/axiom is assumed to hold true 'eternally'
## How to interpret (informal)
First the "atemporal" FOL is derived from the OWL using the standard
interpretation. This axiom is temporalized by embedding the axiom
within a for-all-times quantified sentence. The t argument is added to
all instantiation predicates and predicates that use this relation.
## Example
Class: nucleus
SubClassOf: part_of some cell
forall t :
forall n :
instance_of(n,Nucleus,t)
implies
exists c :
instance_of(c,Cell,t)
part_of(n,c,t)
## Notes
This interpretation is *not* the same as an at-all-times relation
axiom holds for all times
a false positive rate whose value is 5 per cent
Following discussion with OBCS, deprecated of class STATO_0000043 and creation of instance
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
5 % false positive rate
a false positive rate whose value is 1 per cent
Following discussion with OBCS, deprecated of class STATO_0000281 and creation of instance
Alejandra Gonzalez-Beltran
Orlaith Burke
Philippe Rocca-Serra
STATO
1 % false positive rate
en
Ontology for Biomedical Investigations
Advisors for this project come from the IFOMIS group, Saarbruecken and from the Co-ODE group in Manchester
Alan Ruttenberg
Allyson Lister
Barry Smith
Bill Bug
Bjoern Peters
Carlo Torniai
Chris Mungall
Chris Stoeckert
Chris Taylor
Christian Bolling
Cristian Cocos
Daniel Rubin
Daniel Schober
Dawn Field
Dirk Derom
Elisabetta Manduchi
Eric Deutsch
Frank Gibson
Gilberto Fragoso
Helen C. Causton
Helen Parkinson
Holger Stenzhorn
James A. Overton
James Malone
Jay Greenbaum
Jeffrey Grethe
Jennifer Fostel
Jessica Turner
Jie Zheng
Joe White
John Westbrook
Kevin Clancy
Larisa Soldatova
Lawrence Hunter
Liju Fan
Luisa Montecchi
Matthew Brush
Matthew Pocock
Melanie Courtot
Melissa Haendel
Mervi Heiskanen
Monnie McGee
Norman Morrison
Philip Lord
Philippe Rocca-Serra
Pierre Grenon
Richard Bruskiewich
Richard Scheuermann
Robert Stevens
Ryan R. Brinkman
Stefan Wiemann
Susanna-Assunta Sansone
Tanya Gray
Tina Hernandez-Boussard
Trish Whetzel
Yongqun He
2009-07-31
The Ontology for Biomedical Investigations (OBI) is build in a collaborative, international effort and will serve as a resource for annotating biomedical investigations, including the study design, protocols and instrumentation used, the data generated and the types of analysis performed on the data. This ontology arose from the Functional Genomics Investigation Ontology (FuGO) and will contain both terms that are common to all biomedical investigations, including functional genomics investigations and those that are more domain specific.
OWL-DL
An ontology for the annotation of biomedical and functional genomics experiments.
http://creativecommons.org/licenses/by/4.0/
Please cite the OBI consortium http://purl.obolibrary.org/obo/obi where traditional citation is called for. However it is adequate that individual terms be attributed simply by use of the identifying PURL for the term, in projects that refer to them.
2018-05-23