|
|
About SCEC |
Major Projects & Research |
Technical Resources |
Education & Preparedness |
|
|
Community Modeling Environment (CME): Proposal Section C.3 C.3. Architecture and Approach Our goal is to develop an integrated environment in which a broad user community encompassing geoscientists, civil and structural engineers, educators, city planners, and disaster response teams can have access to powerful physics-based simulation techniques for seismic hazard analysis. To achieve this goal, the environment must provide a means for describing, configuring, instantiating, and executing complex computational pathways that result from the composition of various earthquake simulation models. The proposed architecture, illustrated in Figure 1, brings together research from several distinct computer science disciplines, with each area addressing one of the requirements (R1-R4) stated earlier:
The resulting environment will consist of an integrated set of services, knowledge bases, databases and tools which together implement the SCEC community modeling environment. In the following sections, we describe the contributions of each computer science technology in more detail. C.3.a. Knowledge Representation and Reasoning: Managing Community Models Earthquake models are very complex due to the complexity of geologic faults, the non-linearity of the underlying physics, uncertainty about initial conditions, computational complexity of the requisite numerical methods, and many other factors. Model implementations have to address additional complexities such as the representation of large and complex data sets, representation and storage of simulation outputs, handling of large sets of input and modeling parameters, execution requirements on high-performance computing infrastructure, etc. For almost all simulation models and codes in use today, these characteristics are completely opaque and implicit, and are at best communicated via textual publications or "word of mouth". To achieve the vision of a virtual collaboratory where distributed and heterogeneous community model components can be rapidly located, assembled, configured, and run to solve an analysis task at hand, we need software tools to support users in the various stages of this process. These tools need to "know" all the pertinent characteristics of model components to facilitate tasks such as automated configuration, constraint checking, input/output translation, execution planning, etc. To support the representation of complex and heterogeneous model characteristics as well as the necessary reasoning with these characteristics and associated constraints, we propose to use a knowledge-based approach that builds upon (1) the creation and use of shared ontologies (2) modeling and inference techniques from the area of knowledge representation and reasoning (KR&R), and (3) ontology translation technology. The heart of our approach is the creation of curated knowledge base for earthquake physics, an activity that parallels knowledge sharing efforts in other scientific disciplines, such as UMLS and GO [16, 17, 18]. Building on existing efforts within SCEC to develop consensus community models, geoscientists and knowledge engineers will collaborate to develop a knowledge base (KB) that will contain (1) terminology about the domain of earthquake science represented as ontologies of relevant terms [19, 20, 21], (2) procedural knowledge models representing pathway templates and idealized descriptions of executable simulation code, and (3) detailed models of the Unified Structural Representation that will capture the surface and subsurface structure of Southern California, including include the fault systems and seismic velocity models that will make possible the physics-based simulations in Pathways 2-4. The structure of the proposed KB is shown in Figure 4. In our approach, each community model component will be annotated with a set of logical descriptions represented in a formal KR language to describe the pertinent characteristics of the component or its profile. Such a profile will represent things such as the type of the model, simulation technique used, modeling assumptions, parameters and their ranges, parameter dependencies, input data requirements, characteristics of the output, data formats used, compute power requirements, etc. Note that the representation of these characteristics requires the use of an adequately expressive KR language. For example, we might need to express for some model that "to simulate ground motions with frequencies greater than 0.5 Hz, a computer with more than 3 Gbytes of RAM is required." We will also need adequate logical reasoning power to allow the system to conclude that a 16-processor computer with 500Mbytes RAM per processor will satisfy this requirement if the model has been appropriately parallelized and can be run on the particular platform. The knowledge base will also contain pathway templates that describe, for example, what combinations of simulation models can be used to obtain intensity measures of a certain type or accuracy. These pathway templates will express the constraints that must be met when individual simulations are combined to configure a specific pathway. The templates are used by interactive knowledge acquisition tools to generate a specification of required data, software and resources for use by the Grid execution environment.
To be able to compare meaningfully the profiles of components developed by different and potentially widely distributed groups, the descriptions of these profiles must draw from the same ontology or representational vocabulary [22]. For example, the previous section mentioned that, in the context of Pathway 1, the fault type used by the earthquake forecasting model must match the fault type controlling the attenuation relationships used to predict intensity measures. This can only be done if both models use the same (or compatible) terms to describe such fault-type restrictions. A major thrust of our work will be the development of such a shared ontology adequate for the area of earthquake science. This ontology needs to cover relevant physical phenomena and events such as types of terrain and faults, types of earthquakes and their occurrence, temporal and spatial relationships, etc. It also needs to cover pertinent aspects of the models describing or simulating such physical phenomena, as well as aspects of their implementation in a particular piece of computer code and their execution on some computer platform. The resulting ontology will not only support the various knowledge-based tools, but it will also drive consensus building among the scientific community -- a very desirable result. To create and manipulate KBs, we will build upon the PowerLoom KR&R system developed at ISI to provide the necessary representation and reasoning services. PowerLoom is designed for high expressivity (it uses a representation language based on first-order predicate calculus) but without sacrificing scalability to large knowledge bases. PowerLoom has been successfully applied within the context of various DARPA research programs [23, 24] and, to date, has been distributed to over 100 sites and universities world-wide. As illustrated in Figure 4, KBs are integrated into the modeling environment via network-enabled knowledge servers, which respond to knowledge and inferencing requests formatted as XML over HTTP connections. As part of this project, we will integrate the knowledge servers into the Grid environment described below, providing for knowledge service discovery via Grid information services and secure knowledge inquiry via the Grid Security Infrastructure. Access to the PowerLoom knowledge-base and inference services will be central to many of the functions performed by other components. For example, its subsumption reasoner can check whether the input constraints of one component subsume (or are logically entailed by) the output description of another component, which will be used by the Pathway Assembly tool (described in §C.3.d) to check whether two components are compatible. Descriptions of available compute and storage capabilities will be used by Grid services to generate adequate execution plans. Finally, digital libraries can use access to the ontology to organize collections and facilitate search and navigation. The distributed and heterogeneous nature of the community modeling environment will make it necessary to deal with issues of translation. For example, it will often be required to translate data formats, change resolution, perform unit conversions, etc., to match the output of one component to the input requirements of another. Some of these translations can be directly supported by the KR&R system; others will need to rely on special-purpose translation components, e.g., if very large data sets or outputs need to be translated. Such translators can become community components themselves and be leveraged in pathway configuration tasks. More difficult translation issues arise from ontology evolution, independent ontology development, or sometimes simply lack of consensus. While ontologies are intended to provide a stable, universally agreed upon vocabulary, experience shows that they often need to change or evolve to handle new situations or capture newly found consensus or understanding. Moreover, communities sometimes cannot agree and develop different ontologies covering the same subject area. This can cause significant maintenance or translation problems for logical descriptions (such as our model annotations) based on different ontology versions. We will build upon our OntoMorph ontology translation system [25] to handle some of these translation and maintenance problems. For example, whenever the ontology evolves, we can provide a set of translation mappings to automatically translate descriptions based on an older version of the ontology into its new format. This mechanism can also be used to export the ontology into a different representation syntax for use by other systems. For example, we can generate Topic Maps [26] to support digital library systems or translate into a different KR language for use with different KR&R technology. The domain of earthquake science provides a variety of interesting KR&R research challenges, such as having to represent and reason at widely different temporal and spatial scales and resolutions and having to deal with a number of difficult translation problems. There also is the need to reason with partial, approximate, and incomplete information; e.g., to handle only partially satisfiable constraints, or to weigh different soft constraints against each other. This is necessary, since the model components developed by different groups of scientists are heterogeneous and will not necessarily be designed to fit together smoothly, nor might their requirements be completely or correctly specified; however, such assumptions are commonly made by traditional configuration approaches [27, 28]. C.3.b. The Grid: High Performance Computing for Pathway Execution Grids [29] are an emerging technology for creating and maintaining virtual organizations -- multi-organizational collaborations that share distributed resources to solve problems of common interest [30]. In many ways, SCEC is a prototypical virtual organization. SCEC participants are drawn from many different organizations, and the computers, earthquake catalogs, simulation codes, and other resources used by SCEC to address problems of geophysical modeling are geographically distributed. Actual simulation code may be stored in repositories under the control of the author, while the computers being used to execute these codes may be located within a national facility such as at San Diego Supercomputer Center, or be part of the soon to be deployed Distributed Terascale Facility. Input data may come from one of the existing earthquake catalogs, or may be the result of a previous simulation run. Modeling in this environment requires the ability to pull all of these disparate resources into a single integrated computation. Grids provide these mechanisms. At its most basic, Grid infrastructure furnishes the fundamental services needed to locate, access, and manage shared resource. These services include:
These services are among those that are provided by Globus Grid toolkit [37]. We will use Globus as the infrastructure on which we build the execution environment of our integrated modeling environment. Globus services are being widely used by a number of different Grid oriented projects, in the US, Europe, and Asia. Much like TCP/IP and DNS provide the basic services on which can be built higher-level capabilities, such as the World Wide Web, Globus services are designed to provide the low-level mechanisms on which higher level, domain specific services are constructed. Within the scope of this proposal, we will not be developing new Globus services, but rather creating higher-level functionality that builds on the core Grid infrastructure defined by Globus. Figure 5 illustrates the basic structure of the execution environment. Within this component of the modeling framework, instantiated computational pathways, specifying the names and configurations of simulation models to be run, must be mapped to an execution plan, which specifies the sequence of execution steps to be taken, the storage systems or collections containing input data, the executables to be used for each simulation model, the computer to be used to execute the model, and the storage system on which to place output data. We will develop a domain specific scheduler to perform this task. To simplify the process, the scheduler will convert the computational pathway into an scripted execution plan, using a set of pre-defined computational structures (such as parameter sweep, processing pipelines, and parallel execution). Information about model components available via the knowledge base will be used to help guide scheduling decisions. Our initial scheduler will be simple, using application-level scheduling heuristics [38, 39, 40] that exploit the structure of the execution templates. More sophisticated mappings can be obtained by querying the knowledge service to determine characteristics of the various simulation models, along with information about the computational environment. In later years of the project, we propose to apply planning techniques such as those used by the knowledge acquisition tools to instantiate computational pathways to create execution schedules. Near the middle of the project, we will integrate pathway instantiation and execution scheduling so that we can better guide pathway instantation based on resource availability and to perform replanning in case of resource failure or unplanned unavailability.
The second major component of the execution environment is the execution manager. The execution manager is responsible for interacting with the scheduler to obtain an execution plan, and then to interact with the underlying Grid services to cause the plan to be followed. The execution manager will actively monitor the progress of the execution plan, dealing with events such as resource failure. Initial versions of the execution manager will employ simple failure recovery strategies, such as restart. In the second half of the project, the execution manager will be augmented to go back to the scheduler and re-plan the execution schedule to work around resource failure. C.3.c. Knowledge-Based Collection Management: Simulation Code and Data Repositories The purpose of performing a simulation is to produce data for further analysis. Given the distributed nature of the SCEC environment and the volume and number of data products that will be produced, we need a mechanism for managing and discovering information that lives in existing catalogs, such as IRIS, as well as new data products that are generated within the simulation environment. Digital libraries are currently being used to manage digital objects for a wide range of application domains [41, 42, 43, 44], and we plan to apply this technology to the problem of managing seismic simulation data within the context of our proposed simulation environment. Specifically, we will use the SDSC Storage Resource Broker to provide mechanisms for building distributed collections of simulation output, extensible metadata catalogs for managing collection attributes, and interoperability mechanisms for accessing data stored in archives, file systems, and databases. A number of interesting research problems arise in the integration of digital library technology with Grids, knowledge bases, and knowledge acquisition, as we propose to do. Specifically, we need to maintain sufficient information about the construction and execution of specific computational pathways so as to facilitate queries based on the means by which the information was produced in order to replicate the production of identical or similar data analyses. This will require coupling information produced by Execution manager and pathway assembly tool (described below) into the managed collection. Another important issue is how to manage interoperability for data and metadata exchange between the multiple data collections accessed by the community models. The research issues include development of mediation interfaces to the storage resources and information catalogs accessed by the community models, integration of the ability to manage collections of simulation output with the Grid execution environment. Finally, we observe that a current area of active research in the digital library community is the incorporation of knowledge into a digital library by mapping domain concepts to the digital library metadata attributes [45]. The goal is to support concept-based queries against the collection in order to identify relevant data sets that have specified relationships described within the knowledge base. We will explore use of the ISO 13250 Topic Map standard as a potential syntax for representing the mapping between collection attributes and ontologies. As discussed in §C.3.a, we propose to use OntoMorph to translate between the knowledge representation used within the knowledge base and topic maps. This will provide us with a mechanism by which the collection management and knowledge bases can be integrated. Another important research topic is the characterization of completeness and closure for the mapping between the knowledge ontologies and the simulation collections. Completeness is the identification of the necessary set of attributes required to represent each of the concepts expressed within the ontology. Without a complete set of attributes, it will not be possible to apply the constraints specified for each pathway. Closure is the identification of the necessary relationships needed to describe the inherent knowledge within the simulation collection. If relationships imposed by a particular choice of input data or simulation algorithms are not captured, then artifacts or anomalies may be introduced into the collection when further processing is done. The development of mechanisms to specify completeness and closure is domain-dependent. Through close interaction between SCEC geoscientists and IT researchers, we will continually evaluate the SCEC Community Modeling Environment for consistency relative to the completeness and closure properties. C.3.d. Interactive Knowledge Acquisition: A Pathway Assembly Tool In the previous sections, we described technology that defines the modeling and simulation infrastructure. In this section, we turn our attention to how the end-user interacts with this infrastructure and describe a Pathway Assembly tool that will enable unsophisticated users to compose computational pathways involving complex simulations. Targeted users might include building designers and engineers, emergency preparedness officials, or insurance companies. Since these users are not programmers or knowledge engineers, the tool needs to provide an easy to use acquisition interface that guides them to (1) select from a library of simulation models those that will address their requirements, (2) setup the input parameters required by the simulation code, taking into account constraints specified by the model developers, (3) coordinate interacting constraints across models, as described in §C.2.c. The major components of this Pathway Assembly tool are shown in Figure 6 and described in the rest of this section.
For the work on this proposal, we will cast the composition of the computational pathway as a planning problem [46, 47]. The user will be prompted for requirements in terms of the desired intensity measures and accuracy of the results of the simulation, as well as the computational resources available. These will be turned into planning goals and resources. Pathway templates will be considered plan decomposition fragments that the user weaves together and instantiates. The Pathway Assembly tool will also include a suite of Pathway Construction Strategies that capture typical requirements and dataflow interactions among models within a pathway. Inputs and outputs of the individual components and their associated constraints will be expressed as required preconditions and expected effects. For example, this will enable the tool to detect that one of the inputs required by a ground-motion model is an earthquake forecast model, which can only be generated by an earthquake forecasting simulation. In contrast with other planning tools, our system will be able to exploit knowledge-rich structures that describe tasks and goals [48, 49, 50, 51]. The Pathway Assembly tool will be developed as a plan-authoring environment, where the tool is helping the user by checking that the preconditions, effects, goals, resources are adequately handled [52, 53]. This planning framework fits well the paradigm of the Grid execution environment with respect to handling resources, scheduling, and bringing up the need for replanning when the authored plan is not feasible. We will build on previous research on the EXPECT architecture for knowledge analysis and interactive acquisition [54, 55, 56, 57], which has been used to acquire planning knowledge. The central thesis of EXPECT is that if the tool understands how individual pieces of knowledge relate to each other, then it can understand how new knowledge fits and what additional knowledge is missing and, as a result, guide users in adding it. To this end, EXPECT analyzes a knowledge base and automatically derives Interdependency Models between problem-solving knowledge (procedures to achieve goals and tasks), ontologies (object models), and factual knowledge (data). Interdependency Models can be used to plan acquisition dialogues that organize the interaction with users into meaningful topics and sequences of related questions. EXPECT has been used to develop an interactive tool for acquisition of planning constraints, ranging from simple value preferences and bounds to complex procedural constraints. For example, geophysicists can express simple constraints such as "compute the Coulomb stress change for all faults parallel to the San Andreas Fault", or more complex constraints such as "compute the maximum Coulomb stress change seen on all faults with a strike within 30 degrees of the local strike of the San Andreas fault, and a dip steeper than 60 degrees, knowing that only through-going faults breaking the entire seismogenic zone contribute significantly to the hazard change." EXPECT has been used for other planning domains including air-campaign planning, special operations, and logistics. EXPECT's approach to knowledge acquisition has been evaluated successfully in diverse domains with users such as Army officers, project assistants, and non-CS students. Building on the ideas in EXPECT, we recently developed KANAL, a prototype tool for checking the correctness of manually developed process models and plans by exploiting Interdependency Models [58]. The main research problems to be addressed by this tool are: (1) how to enable users who are not experts in programming or knowledge engineering to formulate computational pathways that are ready for the execution environment; (2) how to hide from users the details of the models of the software components and their constraints, while enabling them to understand enough to decide which alternative simulation models to use and; (3) how to design plan-authoring tools that take into account scheduling, execution failures, and replanning issues brought up by the Grid execution environment. Although many issues related to software modeling, reuse, and configuration may arise during this work, the tool that we are proposing to build will be highly focused on earthquake applications. Similarly focused research projects for configuring software have been successful for applications such as image processing, engineering design, and planning interplanetary science missions [59, 60, 61, 62]. We believe that the knowledge-rich community models will enable us to develop more sophisticated and capable environments that will make the pathway-assembly process accessible to end-users. C.3.e. Integration Benefits The synergistic collaborations among different computer science disciplines will enable advances in each of them that would not be possible without these collaborations:
In addition to integration across information technology efforts, there must also be a tight coupling between information technology activities and the earthquake science. To help ensure that this takes place, we have structured the effort so that each major institution has participants in both information technology and earthquake science. We have also included funds in the budget to support an Annual SCEC/IT Workshop to bring larger groups within each field together on an annual basis and smaller tutorial workshops to cross-train students in computer science and geoscience.
Section C.4: Milestones and Schedule
Phone 213/740-5843 |
|||||
|
Created in the SCEC |
Last modified: January 11 2008 14:41 |
© 2010 ![]() ![]() ![]()
Privacy Policy and Accessibility Policy
|