Developmental Therapeutics Program (DTP)

Home | Sitemap | Contact DTP

Main

DTP Bulk Data for Download

DTP Data Search Tools

COMPARE Analysis

NCI-ALMANAC

ROADMAPS Dataset

DTP Branches and Offices

Office of the Associate Director

Preclinical Therapeutics Grants Branch

Molecular Pharmacology Branch

Biological Testing Branch

Toxicology and Pharmacology Branch

Drug Synthesis and Chemistry Branch

Natural Products Branch

Biological Resources Branch

Pharmaceutical Resources Branch

Information Technology Branch

ImmunoOncology Branch

Databases & Tools | COMPARE Analysis | Introduction

Last Updated: 10/01/15

Methodology

Screening Procedures
Special Concentration Parameters
The Mean Graph
The COMPARE Algorithm
Building the COMPARE Database
Performing a COMPARE Analysis
Average Difference Method
Correlation Coefficient Method

Screening Procedures

The NCI screening procedures were described (9) as were the origins and processing of the cell lines (9, 10, 11, 12). Briefly, cell suspensions that were diluted according to the particular cell type and the expected target cell density (5000-40,000 cells per well based on cell growth characteristics) were added by pipet (100 µ L) into 96-well microtiter plates. Inoculates were allowed a preincubation period of 24 h at 37° C for stabilization. Dilutions at twice the intended test concentration were added at time zero in 100-µ L aliquots to the microtiter plate wells. Usually, test compounds were evaluated at five 10-fold dilutions. In routine testing, the highest well concentration is 10E-4 M, but for the standard agents the highest well concentration used depended on the agent. Incubations lasted for 48 h in 5% CO2 atmosphere and 100% humidity. The cells were assayed by using the sulforhodamine B assay (13, 14). A plate reader was used to read the optical densities, and a microcomputer processed the optical densities into the special concentration parameters defined later.

Special Concentration Parameters GI50, TGI, and LC50

The NCI renamed the IC50 value, the concentration that causes 50% growth inhibition, the GI50 value to emphasize the correction for the cell count at time zero; thus, GI50 is the concentration of test drug where 100 × (T - T0)/(C - T0) = 50 (3, 9). The optical density of the test well after a 48-h period of exposure to test drug is T, the optical density at time zero is T0, and the control optical density is C. The “50” is called the GI50PRCNT, a T/C-like parameter that can have values from +100 to -100. The GI50 measures the growth inhibitory power of the test agent. The TGI is the concentration of test drug where 100 × (T - T0)/(C - T0) = 0. Thus, the TGI signifies a cytostatic effect. The LC50, which signifies a cytotoxic effect, is the concentration of drug where 100 × (T - T0)/T0 = -50. The control optical density is not used in the calculation of LC50.

These concentration parameters are interpolated values. One uses the concentrations giving GI50PRCNT values above and below the reference values (e.g., 50 for GI50) to make interpolations on the concentration axis. Currently, about 45% of the GI50 records in the database are “approximated”. In 42% of the records, the GI50PRCNT for a given cell line does not go to 50 or below. For mean graph (see the discussion later) and COMPARE purposes, the value assumed for the GI50 in such a case is the highest concentration tested (HICONC). Similar approximations are made when the GI50 cannot be calculated because the GI50PRCNT does not go as high as 50 or above (3% of total). In this case, the lowest concentration tested is used for the GI50. Corresponding approximations are made for the TGI and for the LC50.

We use these “approximated” GI50 (TGI and LC50) values in the mean graph and in COMPARE because they represent valued information even though the information is less exact than the measured values would be if the measured values were available. In an extreme case where a compound is essentially inert and the GI50s are all represented by the HICONC approximation, the mean graph becomes a flat vertical line (the mean line) and COMPARE has no pattern to correlate. The opposite extreme case is where a compound is so potent that the lowest concentration tested is used to approximate all of the GI50s. In this case, the mean graph is also a flat vertical line and COMPARE has nothing to correlate. The difference in the two extreme cases is in the retests that are done. The inert compound would not be retested. The potent compound would be retested at a more appropriate concentration range.

Between the two extremes are examples with few or many approximated GI50s. These can give good results in COMPARE, but the presence of the approximated GI50 requires an additional strategy in the database preparation. The strategy is to treat the data for a given compound in groups defined by the range of concentrations used in the experiment. These ranges are conveniently labeled according to the HICONC. Thus, if multiple tests of a compound are present in the database, only those experiments with the same HICONC are averaged.

This strategy results in compounds having more than one entry in the database. There are differences in the “approximated” GI50 content of the averaged data, and the averages are calculated from different experiments. Therefore, one should expect that the COMPARE-generated correlation coefficients may be different for the same compound tested at different HICONCs. Moreover, at run time the COMPARE user has the option to choose any one of the HICONC sets for the probe pattern averaging, or the user may choose to average all seed data regardless of the HICONC. The consequences of these options and strategies will be apparent later in the examples provided under Applications of COMPARE. For instance, the probe may find itself several times in the COMPARE list at less than 1.00 correlation coefficient.

The Mean Graph

The discussion of COMPARE presented in this chapter requires an understanding of the mean graph, a means of presenting the in vitro test results developed by the staff of DTP (2, 4, 5) to emphasize differential effects of test compounds on various human tumor cell lines.

A “mean graph” is a pattern created by plotting positive and negative values generated from a set of GI50, TGI, or LC50 values. The positive and negative values are plotted along a vertical line that represents the mean response of all the cell lines in the panel to the test agent. Positive values project to the right of the vertical line and represent cellular sensitivities to the test agent that exceed the mean. Negative values project to the left and represent cell line sensitivities to the test agent that are less than the average value.

The positive and negative values, called deltas, are generated from the GI50 data (or TGI or LC50 data) by a three-step calculation. The GI50 value for each cell line tested against a test compound is converted to its log10 GI50 value. These log10 GI50 values are averaged. Each log10 GI50 value is subtracted from the average to create the delta. Thus, a bar projecting 3 units to the right denotes that the GI50 (or TGI or LC50) for that cell line occurs at a concentration 1000 times less than the average concentration required for all the cell lines used in the experiment.

The complete presentation and organization of the mean graph data were intended to optimize subpanel specific effects, for the listing of cell lines is by disease type, but this presentation of the data is only incidental to the COMPARE concept.

Experience with a wide variety of test compounds has led to the conclusion that a presentation of the mean graphs at all three special concentrations, the GI50, TGI, and LC50, is most useful (3).

The COMPARE Algorithm

COMPARE analyses are rank-ordered lists of compounds. Every compound from one of several specially prepared databases is ranked for similarity of its in vitro cell growth pattern to the in vitro cell growth pattern of a selected seed or probe compound. To derive COMPARE rankings, a scaler index of similarity between the seed compound cell growth pattern and the pattern for each of the COMPARE database compounds must be created. Two indexes of similarity, the average difference between deltas and the correlation coefficient, are described later, but others are possible (8). The average difference method (ADM) was developed and reported (4) before the correlation coefficient method (CCM), and the database preparation procedure for the ADM gives a database usable by either method. Therefore, the databases described later are those required by the ADM. The CCM can use these databases, but the mathematical characteristics of this method make the computation of deltas unnecessary.

Individual cell growth patterns can be represented by delta values. The number that is depicted in the mean graph as the length and direction of a bar on the graph is called delta. The ADM utilizes the same description of in vitro data as the mean graph (for important details, see the discussion of “approximated” GI50, TGI, and LC50).

Building the COMPARE Database.

To facilitate routine use of COMPARE, several types of COMPARE databases are precalculated with values and stored in SAS data sets (SAS Institute, Inc.). These COMPARE databases are automatically updated (usually each week) as additional compounds are tested. The following detailed description illustrates the processes for collecting data for an ADM database.

The main Oracle database table that contains the GI50 data of the in vitro anticancer screen was searched by Structured Query language (SQL) (the search language for Oracle and many other relational database management systems, SQL). The resulting file was output as an SAS data set ready for further processing by SAS programs. Each test was represented in the SAS data set by six variables: NSC, Panelnbr, Cellnbr, HICONC, GI50, and Discreet status. The NSC identifies which compound was tested. The Panelnbr and Cellnbr together describe the cell line used in the test. HICONC defines the highest dose used in the five-dose assay. The Discrete status denotes whether the compound is confidential or not. The log10 of the GI50 was taken. The database was then sorted by NSC, HICONC, and cell line (Panelnbr and Cellnbr), and the log10 GI50 values were averaged for the same NSC, HICONC, PANELNBR, and CELLNBR. The averaged log10GI50 were named M_GI50. For preparation of a database suitable for the CCM (but not the ADM), it is only necessary to sort the database at this stage by cell line (e.g., by Panelnbr and Cellnbr). The database would then be ready for use.

For preparation of a database suitable for both the ADM and the CCM, deltas have to be computed. For each NSC at a particular HICONC, the means of the M_GI50 values were calculated and named MeanGI50. By subtracting the M_GI50 from the corresponding MeanGI50, the deltas were calculated. As a quality-control measure, sets of cell line deltas for a compound where there were fewer than 35 deltass were excluded. The number 35 was arbitrary, but experience suggested that sets with too few delta values sometimes gave spurious COMPARE matches. It would certainly be better to determine the cutoff value statistically, but this has not been done. Finally, for preparation of this data set for the eventual merging with data sets derived from seed data, this data set was sorted by cell line order (Panelnbr and Cellnbr). The database was ready for use.

Performing a COMPARE Analysis

To run COMPARE analyses, one must first select from a menu a COMPARE program appropriate to the analysis desired. The selection determines which database will be analyzed. The current options are the standard agent database with 171 compounds, the synthetic compound database with >40,000 compounds and growing (this includes synthetic compounds and natural products of known structure), the natural product crude extract database with >20,000 screened extracts and growing, and other special-purpose databases.

The selection also determines which method of comparison will be used (CCM or ADM), what type of seed may be entered (standard agent or any screened compound or extract), and the level of the analysis (GI50, TGI, LC50, or all three simultaneously). The analyst then enters the NSC number of the desired seed compound and decides if all, some, or just one of the experiments performed on the seed should be averaged. Once the choices are made, the analyst executes the program. The execution begins with the collection, at the time of analysis, of the seed data containing GI50 (or TGI or LC50 or all three) values directly from the master Oracle database. The seed data collected from the master Oracle database is converted to an SAS data set. If more than one experiment was collected for the seed, the data are then averaged as described previously for building the database. For the ADM, the log10GI50 values from the seed must be converted to a set of delta values just as deltas were calculated for the COMPARE database.

In the next step, pairs of delta values are created (by using an SAS MERGE data step). Each pair consists of the delta value from the seed for a particular cell line and the delta value of a database compound for the same cell line. For example, the delta value calculated for HL-60 data from the seed is paired with the HL-60 delta value calculated for each database compound. For inclusion in the similarity index calculation, both methods (ADM and CCM) require that the delta value (or optionally the M_GI50 for the CCM) be present in both the seed and the database compound. Thus, if HL-60 was not successfully tested against the seed, no use would be made of any HL-60 data present in the database compounds. If HL-60 data are available for the seed, they will be used only in those cases where HL-60 data are also present in the database compounds. Thus, the seed data determine the maximum number of pairs of delta values that will be used to calculate the index of similarity of each database compound. Variations in the number of cell lines tested against individual database compounds will determine if the number of pairs in the seed data or some lesser number of pairs is used for the particular similarity index calculations.

Average Difference Method

The first step in calculating this index of similarity is to take the difference between paired deltas. An average of these differences, by compound, is computed for each compound. The compounds are sorted by their average difference. The compound with the smallest average difference is the most similar to the seed compound.

Correlation Coefficient Method

A pairwise correlation coefficient ( PCC ) with the seed is calculated for each compound in the database. Those compounds with the highest correlation coefficient are most similar to the seed.

We use a commercial statistical package procedure (the SAS procedure PROC CORR) to obtain PCCs. The PCC provides an excellent index of similarity as judged by the many successful examples provided under Standard Agents and the Standard Agent Database, but other types of correlation coefficients could potentially work as well.

Analysts are usually interested in finding only those compounds in the database that are most similar to the seed. Thus, the list is truncated to 100 of the most similar compounds. Common chemical names, if they are on file, are added automatically to the truncated answer list.

Surprisingly similar, but totally independent, work (97, 98) was published essentially concurrently with the early publications of mean graph and COMPARE in an entirely different area of application.

Return to main page