Custom Software Development and Data Curation

Cases

2. Chemistry and bioassay data integration

3. Lead optimization system based on Pipeline Pilot

4. Pharmaceutical data normalization system based on Pipeline Pilot

5. A platform for predicting and analyzing high-throughput protein interactions

6. A platform for analyzing and mining data from gene-chips

1. Global registration system

A global chemical company needed an enterprise system that can accommodate high performance compound registration by scientists across multiple sites. Registration entity types included simple compounds, mixtures, plants, seeds, insects, spores, containers, and test samples. Registration processes, including batch registrations, were driven by workflows defined externally. The system was completed with monitoring, querying, and reporting functionalities.

Technologies:

JBoss (J2EE) platform, SSO (single sign on), MQ, load balancing
Pipeline Pilot server cluster as compute engine layer
jBPM workflow management, Drools rules engine
jQuery for front end JavaScript support

Top

2. Chemistry and bioassay data integration

A scientific data integration system for structure/activity analysis was developed for a biotech company to help scientists perform real-time bio-activity data analysis on compounds from disparate data sources. Analysis was done on compounds based on their scaffold, structural classification, compound/lot/batch levels against in vivo and in vitro data. Raw biology data were retrieved and aggregated on the fly based on business rules.

Technologies:

JEE platform, Oracle Accord cartridge, distributed cache, ExtJs, BIRT (for reporting)

Top

3. Lead optimization system based on Pipeline Pilot

The R&D department of a pharmaceutical company needs help to speed up their computer-aided lead optimization procedure by integrating some existing software. Based on Pipeline Pilot platform, we developed a system that encapsulates the whole process of lead optimization. Given the initial lead structures and their activity measurements, the system builds QSAR models, assembles a virtual library of compounds, prioritizes candidates complying with user-specified profiles and generates final report.

Functions:

With reasonable default settings, beginners can use the system and obtain meaningful results. Advanced users may change the settings to suit their needs better
A user-defined library of functional groups may be provided to increase the structure diversity of the virtual library
Complex weighting schemes can be used to guide the direction of the optimization
It can be embedded in an algorithm on a higher level to allow multiple rounds of lead optimization.

Technologies:

Pipeline Pilot platform and its built-in components
Perl and Pipeline Pilot script were used to encapsulate third-party programs
R was used to generate statistical models

Top

4. Pharmaceutical data normalization system based on Pipeline Pilot

A large biotech lab had communication problems due to the fact that their data were saved in many placed, from different sources and presented in various formats. Errors or mistakes were frequently introduced when data were transferred. To solve the problem, we developed a data-management system based on Pipeline Pilot for the lab. The system presents users a unified interface that is easy to use, hiding all the details such as communicating with different databases, normalizing data, confirming compound structures, etc. With this system, it is much easier for users to get high-quality clean data efficiently.

Functions:

Data management can be done through one interface
Built-in unit-converters can be used to normalize data. Users may customize unit-converters to meet their domain-specific requirements
Compound structures are checked against public resources, suggesting alternative structures when available

Technologies:

Pipeline Pilot platform and its built-in components
SOAP based Web Service
Perl was used to parse data and to make Pipeline Pilot components
SQL was used for database operations

Top

5. A platform for predicting and analyzing high-throughput protein interactions

Using artificial intelligence algorithm, this platform was developed to help users filter out the candidate interacting proteins by analyzing known samples. With user-specified criteria, it can also generate a list of predicted candidate genes, mark up and score active regions on the protein molecules.

Functions:

Users can specify the range of candidate genes and proteins
Marking the active regions and hot spots on the candidate protein based on predicted scores
Built-in sample protein database containing information of the most-commonly-used protein targets
The prediction subsystem can self-evolve, gradually improving its predictability by learning from past results

Technologies:

MPI (Message Passing Interface) was used to implement the parallel algorithms running on high-performance clusters
Programming language C was used for its superior performance

Top

6. A platform for analyzing and mining data from gene-chips

The primary goal of this platform was to provide users a tool to efficiently process and analyze the mass gene-chip data without worrying about data collection, normalization, and setting/optimizing parameters. Various data analysis and mining methods were used by the system, and final results could be presented to users in a way that was easy to understand.

Functions:

Multiple models were used to perform data analysis and mining
Easy configuration that allows users to present results in user-designed formats
Many advanced options and settings for experienced users

Technologies:

Client/Server architecture using R as the implementation language
The system runs on Linux and Windows. Perl scripts were used to call functional modules in R

Top