ࡱ> LV=>?@ABCDEFGHIJKYWY q bjbjt+t+ RAA΂\ Y ]( $P$PP+4DpD(DDwGRjɱd-4VXXXXXX$|asGwGaa|=DDP===a <DDVaV=\=-e%4|VD|ds HUNiversIty of Southampton An Architecture for Management of Large, Distributed, Scientific Data Volume 1 of 1 Mark Papiani Doctor of Philosophy Faculty of Engineering and Applied Science Department of Electronics and Computer Science This thesis was submitted in May 2000. University of Southampton ABSTRACT Faculty of Engineering and Applied Science Electronics and Computer Science Doctor of Philosophy An Architecture for Management of Large, Distributed, Scientific Data Mark Papiani This thesis describes research into Web-based management of non-traditional data. Three prototype systems are discussed, GBIS, DBbrowse and EASIA, each of which provided examples of new ideas in this area. In 1994/1995, when most Web pages consisted of static HTML files, GBIS (the Graphical Benchmark Information Service)  REF _Ref462113457 \r \h [181]  REF _Ref456343414 \r \h [117] demonstrated the benefits of interactive, dynamic Web pages for visualisation of scientific data. GBIS also highlighted problems with storing the underlying data in a filesystem, which initiated an investigation into the use of databases as the underlying source for dynamic Web pages. In 1996/1997 this research investigated automatic generation of generic Web interfaces to databases to facilitate rapid deployment of interactive Web-based applications by developers with little Web development experience. A prototype system, DBbrowse, demonstrates the results  REF _Ref456345537 \r \h [180]  REF _Ref456345473 \r \h [75]. DBbrowse can generate Web interfaces to object-relational databases with intuitive query capabilities. DBbrowse also demonstrates a method for browsing databases to further support users with little database experience. In 1999 concepts from GBIS and DBbrowse were used as the starting point for examining new architectures for archiving scientific datasets. Data from numerical simulations generated by the UK Turbulence Consortium was used as a case study. Due to the large datasets produced, new Web-based mechanisms were required for storage, searching, retrieval and manipulation of simulation results in the hundreds of gigabytes range. A prototype architecture and user interface, EASIA (Extensible Architecture for Scientific Data Archives)  REF _Ref462113807 \r \h [182]  REF _Ref468508680 \r \h [183] is described. EASIA demonstrates several new concepts for active digital libraries of scientific data. Result files are archived in-place thereby avoiding costs associated with transmitting results to a centralised site. The method used shows that a database can meet the apparently divergent requirements of storing both the relatively small simulation result metadata, and the large, distributed result files, in a unified, secure way. EASIA also shows that separation of user interface specification from user interface processing can simplify the extensibility of such systems. EASIA archives not only data in a distributed fashion, but also applications. These are loosely coupled to the archived datasets via a user interface specification file that uses a vocabulary defined by a markup language. Archived applications can provide reusable dynamic server-side post-processing operations. This can reduce bandwidth requirements for requested data through server-side data reduction. The architecture allows post-processing to be performed directly without the cost of having to rematerialise to files, and it also reduces access bottlenecks and processor loading at individual sites. Table of Contents  TOC \o "1-3" Table of Contents  PAGEREF _Toc487389832 \h 3 List of Tables  PAGEREF _Toc487389833 \h 7 List of Figures  PAGEREF _Toc487389834 \h 8 Acknowledgements  PAGEREF _Toc487389835 \h 10 Authors Declaration  PAGEREF _Toc487389836 \h 11 1 Introduction  PAGEREF _Toc487389837 \h 12 1.1 Outline of Research Areas  PAGEREF _Toc487389838 \h 12 1.2 Structure of this Thesis  PAGEREF _Toc487389839 \h 15 2 Database and Web Developments  PAGEREF _Toc487389840 \h 17 2.1 Database Developments  PAGEREF _Toc487389841 \h 17 2.1.1 Object-Relational and Object-Oriented Database Technology  PAGEREF _Toc487389842 \h 17 2.1.2 SQL:1999  PAGEREF _Toc487389843 \h 23 2.1.3 Parallel Databases  PAGEREF _Toc487389844 \h 26 2.1.4 Java Database Access  PAGEREF _Toc487389845 \h 29 2.1.5 Microsofts Data Access Strategy  PAGEREF _Toc487389846 \h 34 2.2 Web Developments  PAGEREF _Toc487389847 \h 35 2.2.1 The Common Gateway Interface  PAGEREF _Toc487389848 \h 36 2.2.2 Web Server Extensions  PAGEREF _Toc487389849 \h 38 2.2.3 FastCGI  PAGEREF _Toc487389850 \h 39 2.2.4 Java and Java Applets  PAGEREF _Toc487389851 \h 39 2.2.5 Java Servlets  PAGEREF _Toc487389852 \h 43 2.2.6 Java Server Pages and Active server Pages  PAGEREF _Toc487389853 \h 43 2.2.7 Distributed Object Technologies  PAGEREF _Toc487389854 \h 45 2.2.8 XML and Dynamic HTML  PAGEREF _Toc487389855 \h 55 2.3 Multi-tier Web/Database Connectivity  PAGEREF _Toc487389856 \h 64 2.4 Summary  PAGEREF _Toc487389857 \h 68 3 The Graphical Benchmark Information Service  PAGEREF _Toc487389858 \h 69 3.1 Introduction  PAGEREF _Toc487389859 \h 69 3.2 GBIS Overview  PAGEREF _Toc487389860 \h 69 3.3 GBIS Implementation  PAGEREF _Toc487389861 \h 72 3.4 GBIS Result File Structure  PAGEREF _Toc487389862 \h 75 3.5 Updating the Results Database to include additional Machines and Manufacturers  PAGEREF _Toc487389863 \h 77 3.6 Conclusions  PAGEREF _Toc487389864 \h 78 4 Automatically Generating Web Interfaces to Relational Databases  PAGEREF _Toc487389865 \h 80 4.1 Introduction  PAGEREF _Toc487389866 \h 80 4.2 Providing Web Access to the Database  PAGEREF _Toc487389867 \h 82 4.3 Automatically Generating the User Interface and SQL Queries  PAGEREF _Toc487389868 \h 83 4.4 Providing Database Browsing via Dynamic Hypertext Links Derived from Referential Integrity Constraints  PAGEREF _Toc487389869 \h 84 4.5 Example of a Database Browsing Session  PAGEREF _Toc487389870 \h 87 4.6 Conclusions  PAGEREF _Toc487389871 \h 93 5 An Architecture for Management of Large, Distributed, Scientific Data  PAGEREF _Toc487389872 \h 95 5.1 Introduction  PAGEREF _Toc487389873 \h 95 5.2 System Architecture and User Interface  PAGEREF _Toc487389874 \h 99 5.2.1 System Architecture  PAGEREF _Toc487389875 \h 99 5.2.2 XML Specification of the User Interface  PAGEREF _Toc487389876 \h 101 5.2.3 Searching and Browsing Data  PAGEREF _Toc487389877 \h 101 5.2.4 Interface Customisation through XUIS Modification  PAGEREF _Toc487389878 \h 107 5.2.5 Suitable Processing of Data Files Prior to Retrieval: Operations  PAGEREF _Toc487389879 \h 109 5.2.6 Code Upload for Server-side Execution  PAGEREF _Toc487389880 \h 117 5.2.7 Administration Features  PAGEREF _Toc487389881 \h 119 5.3 Implementation and Design Decisions  PAGEREF _Toc487389882 \h 119 5.3.1 Experimental Bandwidth Measurements  PAGEREF _Toc487389883 \h 119 5.3.2 SQL Management of External Data: The New DATALINK Type  PAGEREF _Toc487389884 \h 121 5.3.3 Java Servlets and JavaScript  PAGEREF _Toc487389885 \h 123 5.4 Conclusions  PAGEREF _Toc487389886 \h 127 6 Related Work  PAGEREF _Toc487389887 \h 129 6.1 Related Work on User Interfaces to Databases  PAGEREF _Toc487389888 \h 129 6.1.1 Introduction  PAGEREF _Toc487389889 \h 129 6.1.2 Stand-alone Graphical Query Interfaces to Databases  PAGEREF _Toc487389890 \h 130 6.1.3 Web-based User Interfaces to Databases  PAGEREF _Toc487389891 \h 137 6.2 Related Work on Web-based Management of Scientific Data  PAGEREF _Toc487389892 \h 140 6.3 Discussion  PAGEREF _Toc487389893 \h 147 7 Summary  PAGEREF _Toc487389894 \h 149 7.1 Contributions to the Field  PAGEREF _Toc487389895 \h 149 7.1.1 GBIS  PAGEREF _Toc487389896 \h 149 7.1.2 DBbrowse  PAGEREF _Toc487389897 \h 150 7.1.3 EASIA  PAGEREF _Toc487389898 \h 150 7.2 Future Work  PAGEREF _Toc487389899 \h 153 7.2.1 Gathering Operation Statistics and Caching Results  PAGEREF _Toc487389900 \h 153 7.2.2 Providing a Multidatabase Capability  PAGEREF _Toc487389901 \h 154 7.2.3 Can Codes other than Java be Uploaded for Execution?  PAGEREF _Toc487389902 \h 155 7.2.4 Runtime Monitoring of Post-Processing Operations  PAGEREF _Toc487389903 \h 156 7.2.5 XML as a Scientific Data Standard  PAGEREF _Toc487389904 \h 156 7.2.6 Other Enhancements to the EASIA Architecture  PAGEREF _Toc487389905 \h 158 7.3 Concluding Remarks  PAGEREF _Toc487389906 \h 160 Appendix A : Publications and Presentations  PAGEREF _Toc487389907 \h 161 Appendix B : Client/Server Ping Benchmark Results  PAGEREF _Toc487389908 \h 163 References  PAGEREF _Toc487389909 \h 167  List of Tables Table 1: Experimental bandwidth measurements for file transfer between two UK universities PAGEREF _Ref476060808 \h 120 List of Figures  TOC \c "Figure" Figure 1: A 2-tier architecture using a Java Applet and JDBC for database access.  PAGEREF _Toc487389910 \h 64 Figure 2: A 3-tier architecture using a Java Applet, CORBA and JDBC  PAGEREF _Toc487389911 \h 65 Figure 3: A 3-tier architecture using HTML/HTTP, Java Servlets and JDBC  PAGEREF _Toc487389912 \h 67 Figure 4: Graph showing results of the Multigrid Benchmark.  PAGEREF _Toc487389913 \h 71 Figure 5: Graph showing results of the LU Simulated CFD Application Benchmark.  PAGEREF _Toc487389914 \h 72 Figure 6: GBIS manufacturer list page.  PAGEREF _Toc487389915 \h 73 Figure 7: GBIS machine list page.  PAGEREF _Toc487389916 \h 74 Figure 8: GBIS change defaults page.  PAGEREF _Toc487389917 \h 74 Figure 9 Example contents of a GBIS result data file.  PAGEREF _Toc487389918 \h 76 Figure 10: Interconnection strategy for providing Web accesses to a database.  PAGEREF _Toc487389919 \h 83 Figure 11: Employee Activity database schema and relationships between entities.  PAGEREF _Toc487389920 \h 85 Figure 12: Selecting tables of interest.  PAGEREF _Toc487389921 \h 87 Figure 13: Selecting columns and specifying conditions.  PAGEREF _Toc487389922 \h 88 Figure 14: Results from querying the DEPARTMENT table.  PAGEREF _Toc487389923 \h 89 Figure 15: Browsing to find all employees in department number D11.  PAGEREF _Toc487389924 \h 89 Figure 16: Browsing to show project activities for each employee.  PAGEREF _Toc487389925 \h 90 Figure 17: Browsing to inline full project details.  PAGEREF _Toc487389926 \h 91 Figure 18: Refining a query during the browsing stage.  PAGEREF _Toc487389927 \h 92 Figure 19: Query results after refinement.  PAGEREF _Toc487389928 \h 92 Figure 20: Displaying the SQL that generated the result.  PAGEREF _Toc487389929 \h 93 Figure 21: System architecture.  PAGEREF _Toc487389930 \h 99 Figure 22: Login screen.  PAGEREF _Toc487389931 \h 100 Figure 23: Table selection screen.  PAGEREF _Toc487389932 \h 102 Figure 24: Searching the archive.  PAGEREF _Toc487389933 \h 103 Figure 25: Result from querying the SIMULATION table.  PAGEREF _Toc487389934 \h 104 Figure 26: Sample database schema for UK Turbulence Consortium.  PAGEREF _Toc487389935 \h 105 Figure 27: CLOB browsing.  PAGEREF _Toc487389936 \h 105 Figure 28: DATALINK browsing.  PAGEREF _Toc487389937 \h 106 Figure 29: Customised display of results from a query on the SIMULATION table.  PAGEREF _Toc487389938 \h 108 Figure 30: Result table showing operations available for post-processing datasets.  PAGEREF _Toc487389939 \h 112 Figure 31: Operation description and parameter input form.  PAGEREF _Toc487389940 \h 113 Figure 32: Output from operation execution.  PAGEREF _Toc487389941 \h 114 Figure 33: NCSAs SDB [243] has been specified as an operation in the XUIS and invoked on a dataset managed within the EASIA architecture.  PAGEREF _Toc487389942 \h 116 Figure 34: User administration screen.  PAGEREF _Toc487389943 \h 119 Figure 35: Security mechanism employed for uploaded post-processing codes.  PAGEREF _Toc487389944 \h 125 Figure 36: The client/server ping benchmark.  PAGEREF _Toc487389945 \h 163 Figure 37: Client/server ping benchmark results.  PAGEREF _Toc487389946 \h 164  Acknowledgements The UK Turbulence Consortium provided data for the EASIA research prototype. IBM's DB2 Scholars programme provided DB2 licenses. Thanks to Tony Hey for employing me as a research assistant for 5 years. Thanks to Ed Zaluska for putting me in touch with Tony after reading my initial speculative employment enquiry to the University. Thanks to Denis Nicole for putting me in touch with the UK Turbulence Consortium. Thanks to David Walker and Kirk Martinez for acting as external and internal examiner for my viva. I would like to thank my parents Rolando and Jenny for all their support during my on/off 34-year reign as a student! Thanks to my sisters Sandra and Lisa for their support and for helping me to buy birthday presents. Thanks to the Lads in Bournemouth (Brett Colley, Dayle Colley, Andy Foote and Paul Brady) for dragging me out at weekends and accepting partial responsibility for this thesis taking so long. Thanks to Dave and Fleur (and Callum) who were with me at the start of my University days back in 1987. Thanks for your friendship over the years, and please be patient - I promise to ring soon. My colleague at Southampton University (and partner in crime/gym), Alistair Dunlop, made the 5 years at Southampton the most fun I have ever had in a job. Thanks to Jasmin Wason for working with me after Alistair had left the University for the lure of industry. Thanks also to Jasmin for helping get my viva together. Finally, special love and thanks to the special people who had to put up with me during this project - Sara Gibbs and Tanya Smith. Authors Declaration This work is almost entirely my own work with the following caveats. The DBbrowse prototype of Chapter  REF _Ref468535698 \r \h 4 was conceived and implemented in collaboration with my colleague Dr Alistair Dunlop at the University of Southampton. Jasmin Wason implemented some of the software for the EASIA prototype of Chapter  REF _Ref468535710 \r \h 5 under my direction. Introduction This thesis describes research (during the period 1994 to 2000) into Web-based management of non-traditional data. In this thesis traditional data is defined as simple datatypes including integers, floating-point types, characters, dates, times and timestamps (effectively datatypes that are associated with the traditional relational data model (see Chapter  REF _Ref475446169 \r \h 2)). Non-traditional data is characterised by complex multimedia datatypes including text, audio, image and video, as well as binary files used for other purposes such as multidimensional scientific data (effectively datatypes that are associated with newer object-oriented and object-relational data models (see Chapter  REF _Ref475446169 \r \h 2)). The Internet (particularly the Web) is having a dramatic effect on all walks of life, from commerce to education and research to leisure. Over the last few years the nature of the Web has been changing from a file based, textual, static, insecure environment with dumb browsers to a database based, multimedia, dynamic, secure, environment with smart browsers. Three prototype systems are discussed in this thesis, GBIS (the Graphical Benchmark Information Service)  REF _Ref462113457 \r \h [181]  REF _Ref456343414 \r \h [117], DBbrowse  REF _Ref456345537 \r \h [180]  REF _Ref456345473 \r \h [75] and EASIA (Extensible Architecture for Scientific Data Archives)  REF _Ref462113807 \r \h [182]  REF _Ref468508680 \r \h [183], each of which has provided exemplars of new ideas for non-traditional data management in the fast evolving Web environment. Outline of Research Areas This thesis describes research into Web-based management of non-traditional data concentrating on the following areas. The importance of dynamic, interactive Web-based visualisation for scientific data repositories. GBIS was an early system (1994/95) employing CGI (Common Gateway Interface) scripting  REF _Ref468619282 \r \h  \* MERGEFORMAT [48] combined with standard application programs, to provide Web-based management of scientific data. GBIS was designed to manage non-traditional data in the form of textual output files from multiprocessor benchmark results. At a time when most Web pages consisted of static HTML (Hypertext Markup Language)  REF _Ref462494177 \r \h [122] files, GBIS demonstrated dynamic Web pages for visualisation of scientific data. GBIS employed some of the first technologies available for dynamic Web pages with user interaction (such as the CGI and associated scripting using the Bourne Shell and PERL). User interfaces that integrate the Web and object-relational databases. At the ACM SIGMOD Conference in 1996, Manber suggested that one of the main lessons to be gained from the success of the Web was the importance of browsing  REF _Ref475963302 \r \h [147]. He went on to say that an important step would be to find a way to browse even relational databases. The early part of this research involved a survey of database user interfaces. Existing techniques for browsing databases were studied. Existing methods for Web/database connectivity were studied in detail. At the time, most existing systems required programming effort. One aim was to find an automated technique for connecting databases to the Web and for searching and browsing the data via a Web-based user interface. DBbrowse (1996/1997) was the result of research into automatic generation of generic Web interfaces to object-relational databases, to facilitate rapid deployment of interactive Web-based applications by developers with little Web development experience. DBbrowse demonstrated a novel method for browsing databases using the Web. Architectures for active digital archives that can manage large, distributed scientific data. The Internet allows for fast, effective scientific collaboration on a scale that has previously been impossible. It is now possible to transfer research results, in the form of scientific papers, result files or metadata describing experiments, in seconds or minutes to worldwide locations. Advances in computing technology, such as larger, cheaper storage and faster processing, have affected the type of data that can be manipulated, allowing, for example, much larger raw result data to be generated and exchanged. Additional motivation for this research came from the Caltech Workshop on Interfaces to Scientific Data Archives  REF _Ref475550235 \r \h [235], which identified an urgent need for infrastructures that could manage and federate active libraries of scientific data. Hawick and Coddington  REF _Ref458236089 \r \h [112] define active data archives as follows: An active data archive can be defined as one where much of the data is generated on-demand, as value-added data products or services, derived from existing data holdings. They also state that the information explosion has led to a very real and practical need for systems to manage and interface to scientific archives. Treinish  REF _Ref459694413 \n \h [217]  REF _Ref459694423 \n \h [218] presents ideas for interactive archives for scientific data. He believes that the capability to produce data is growing much faster than the ability to manage data. Typical storage and communications protocols are not suitable for current archives and there must be a fundamental change from providing static archives that provide bulk data access to dynamic, interactive systems. Data volumes are too large for practical examination and the starting point for locating relevant data should consist of searching metadata that provides abstractions of the archive. Treinish believes that browsing is extremely important for data selection. He suggests that large datasets can be represented by much smaller visual representations that allow browsing to identify features of interest. He concludes that future research should include experimentation with compression techniques to aid visual browsing and secondly the design of architectures for interactive archives. These architectures could include the integration of metadata, data servers, visual browsing and existing data analysis tools. The final architecture and prototype implementation described in this thesis, EASIA, is an architecture for an active digital archive that can reduce bandwidth requirements for Web-based management of large, scientific datasets. The first aim for this architecture was to extend the automated Web/database connectivity techniques, developed in the DBbrowse, to the types of scientific data management problems described above. Browsing techniques from DBbrowse could be used to facilitate navigation through metadata associated with scientific datasets to identify potential datasets of interest. Beyond this, EASIA provides features that meet Treinishs requirements for the architecture of future interactive scientific archives. Namely, integration of metadata, data servers, and existing data analysis applications. Scientific data management introduces new problems associated with storage and retrieval of large, often unformatted, files in an environment where bandwidth is limited. EASIA provides new mechanisms for Web-based storage, searching, retrieval and manipulation of scientific datasets in the hundreds of gigabytes range. EASIA demonstrates several new concepts for active digital libraries of scientific data. EASIA archives data in a distributed fashion so that large datasets can be archived at (or close to) the sites where they are generated in order to eliminate the costs associated with transfer to a centralised repository. EASIA also archives applications. Archived applications can provide reusable dynamic server-side post-processing operations that can reduce bandwidth requirements for requested data. Post-processing can also be achieved by allowing users to upload code to be run securely on the file servers hosting the datasets. Research and development of the GBIS, DBbrowse and EASIA prototypes has required comprehensive knowledge of Web and database technologies, and of related work. This thesis therefore contains significant critical review in these areas. Finally, it is worth noting that although successive prototypes use different technologies due to emergence of improved solutions in this fast changing field, wherever possible the prototypes have been implemented using commodity components, technologies and open standards. Structure of this Thesis The rest of this thesis is structured as follows: Chapter  REF _Ref473287549 \n \h  \* MERGEFORMAT 2:  REF _Ref473287549 \h  \* MERGEFORMAT Database and Web Developments This chapter describes the state-of-the-art in database and Web technologies. The purpose of this research was to provide a critical survey of available technologies and to assess the best ways to implement the architectures developed during this research. Chapter  REF _Ref475785217 \n \h  \* MERGEFORMAT 3:  REF _Ref475785306 \h  \* MERGEFORMAT The Graphical Benchmark Information Service This chapter provides a detailed description of the research surrounding GBIS. Chapter  REF _Ref468535698 \n \h  \* MERGEFORMAT 4:  REF _Ref468535698 \h  \* MERGEFORMAT Automatically Generating Web Interfaces to Relational Databases This chapter provides a detailed description of the research surrounding DBbrowse. Chapter  REF _Ref468535710 \n \h  \* MERGEFORMAT 5:  REF _Ref468535710 \h  \* MERGEFORMAT An Architecture for Management of Large, Distributed, Scientific Data This chapter provides a detailed description of the research surrounding EASIA. Chapter  REF _Ref475427724 \n \h  \* MERGEFORMAT 6:  REF _Ref475427724 \h  \* MERGEFORMAT Related Work This chapter reviews related work on user interfaces to databases. Web based interfaces to databases are compared to DBbrowse, particularly in terms of techniques for data browsing. The second part of this chapter discusses related research in the area of Web-based management of scientific data, providing comparisons with the EASIA architecture. Chapter  REF _Ref475785273 \n \h  \* MERGEFORMAT 7:  REF _Ref475785357 \h  \* MERGEFORMAT Summary This chapter provides a summary, ideas for future related research and closing remarks. Database and Web Developments This chapter provides an overview of developments in database and Web technology over the last few years. This survey was carried out to understand the current state-of-the-art in these fields and then to assess the suitability of different technologies for implementation of the systems to be developed during this research. The chapter is split into three main sections covering database developments, Web developments, and multi-tier Web/database connectivity. Database Developments This section describes why object-relational database technology was preferred to object-oriented database technology. Sections on SQL:1999, parallel databases and Java database access follow. Object-relational databases and SQL:1999 were used in the DBbrowse and EASIA architectures, and Java database access mechanisms were used in the implementation of EASIA. A separate section on Microsofts data access strategy is included for completeness since Microsoft promotes different technologies to those being used by most of the other database vendors. Object-Relational and Object-Oriented Database Technology Reasons for Choosing Object-Relational Technology for this Research A primary requirement for the database technology selected for this research was that it could support both traditional data types and non-traditional data, such as the large binary data files often used for scientific datasets. Object-oriented database management systems (OODBs) and object-relational database management systems (ORDBs) both provide this capability. ORDBs were chosen for this research for a number of reasons that are described below. Firstly, ORDBs support the standardised Structured Query Language (SQL)  REF _Ref468595213 \r \h [59] as well as providing metadata, which defines amongst other things, the database schema. These two features make it possible to build generic, schema driven, dynamic interfaces to databases. This was a requirement for both the DBbrowse (chapter  REF _Ref468535698 \r \h 4) and EASIA prototypes (chapter  REF _Ref468535710 \r \h 5). Second, ORDBs support a security model as a fundamental function. This feature is lacking in OODBs. Third, parallel versions exist for all the major ORDBs, allowing migration to high performance parallel architectures if necessary. Fourth, despite around ten years of marketing and attempts to standardise their features, OODBs remain a niche technology with a corresponding increased risk associated with their usage. All of the market leaders associated with relational database management systems (RDBs) now offer ORDB products. These include IBM DB2 Universal Database  REF _Ref460653919 \r \h [65]  REF _Ref456345652 \r \h [42], Oracle8  REF _Ref472065038 \r \h [174], Informix Dynamic Server  REF _Ref460654085 \r \h [128], Sybase Adaptive Server  REF _Ref460654100 \r \h [214] and Microsoft SQL Server 7  REF _Ref460654121 \r \h [206]. With all the major database vendors supporting object-relational (OR) technology it seems likely that this will remain the dominant database technology. The Progression of Object-Oriented Database Standardisation Carey  REF _Ref456338475 \r \h [30] discusses the progress that OODBs have made and why their impact has not lived up to expectations. He concludes that we are on the verge of an era where ORDBs will begin taking over the enterprise. Carey gives some of the reasons for the early success of RDBs (the foundation of all ORDBs) and contrasts these with the progression of OODBs. In the early days of RDBs there was a single, clearly defined data model based on sets of tuples with simple attributes. Similarly, SQL, emerged early on as the query language for RDBs. Development of OODBs has been very different, with no initial agreement on the details of the data model and no query model or language. In the early 1990s a consortium of OODB vendors formed the Object Data Management Group (ODMG, formerly known as the Object Database Management Group)  REF _Ref460656172 \r \h [168] to address these problems. An Object Database Standard, ODMG-93, was released in 1996  REF _Ref483040307 \r \h [37]. This contains several chapters which define; the ODMG object model (which is an extension of the Object Management Group (OMG)  REF _Ref462502615 \r \h [170] data model); an object definition language (ODL); an object query language (OQL); a binding to C++ and Smalltalk for all functionality, including object definition, manipulation, and query. Release 2.0 of the standard, ODMG-97, also included a Java language binding  REF _Ref456338617 \r \h [38] (refer to Section  REF _Ref483209747 \r \h 2.2.4 for information on Java). (Release 3.0 of the Standard was published this year  REF _Ref483629291 \r \h [39]. Unfortunately, as with the previous version, this standard is available for purchase in book format only and is not available for free download. This probably hinders widespread knowledge of the standard.) Although the standard has been out in some form for about 6 years, there are still differences between many of the OODB products in terms of their programming interfaces and query support. Many vendors conform to parts of the specification corresponding to individual chapters in the standard. Variations in the level of support for standards across different OODBs made this technology unsuitable for underpinning the standards-based, vendor independent prototype scientific data archiving systems that were investigated during this research. The lack of sufficient standardisation between OODBs is not solely an implementation issue. The standard itself is still not comprehensive in some areas. For example, the Java binding does not yet support certain features of the ODMG Data Model such as, extents, keys, relationships and access to metadata (see for example  REF _Ref458173116 \r \h [43]). A further barrier to portability amongst Java bindings from different vendors is the fact that the mechanism for identifying persistence-capable classes is not specified. A Java OODB application can contain both persistent and transient objects of the same class (this is known as orthogonal persistence where persistence is independent of class). The standard states that a transient object belonging to a persistence-capable class can be made persistent if it is bound to a name within the database or if it is referenced by another persistent object. This is known as persistence by reachability. As an example of the divergence in this area, the OODB from POET  REF _Ref460673936 \r \h [188] currently uses a configuration file to indicate persistence-capable classes. This works in conjunction with a pre-processor for the Java source files, which extracts required information from the files and stores it in a dictionary, and then calls the standard Java compiler. The Objectivity/DB OODB  REF _Ref460673914 \r \h [169], on the other hand, determines persistence-capability by requiring that such classes inherit from a specified superclass. OODBs are designed to add persistence to objects within an object-oriented (OO) programming language. Despite an OODB standard, language differences make it very difficult to port OODB applications between languages. Even for non-database applications, written in the same language, portability between compilers from different vendors is sometimes an issue, which gets magnified once the OODB environment is added. Architectural differences associated with page servers (dumb server with intelligent fat clients) versus object servers (thin clients with an intelligent server) also complicate portability issues amongst OODBs  REF _Ref458173116 \r \h [43]. Object-Relational: Combining the Benefits of Object-Oriented and Relational Technology Kim  REF _Ref461169712 \r \h [139] is an advocate of OR technology. He reviews the promises of OODBs, examines the reality of these systems, and concludes by discussing how their promises may be fulfilled through unification with relational technology. Kim begins by highlighting the following advantages that OODBs have over RDBs. Within the relational model, the responsibility for defining and displaying the structure of the data typically lies with the application. The application must impose the object structure on the data from the flat generalised relational structure. OODBs, on the other hand, can directly model data as structured objects. For example, hierarchical data (or complex nested data) must be represented as tuples in multiple relations in an RDB. OODBs allow the data type of an attribute to be a primitive type or an arbitrary user defined type (UDT). Row (or composite types) and multi-valued collection attributes (sets, bags, arrays and lists) are also available. This nested object representation allows hierarchical data to be naturally represented. This could for example, be used for an engineering bill of materials where an item list for an assembly may contain several sub-assemblies. The ability to represent this structure as nested objects avoids the need for tuples in multiple tables, which involves expensive joins for retrieval. Reference types can allow simpler query paths that avoid complex value-based joins, using instead, pointer-based navigation. RDBs offer a set of primitive, built-in data types with no means of adding UDTs. OODBs allow complex unstructured data to be stored as UDTs. Furthermore, new data types may be created as new classes, possibly even as subclasses of existing classes, inheriting their attributes and methods. Although RDBs offer stored procedures (a program written in a procedural language and stored in the database for later loading and execution) these are not encapsulated with data. Further, since RDBs do not have the inheritance mechanism, the stored procedures cannot automatically be reused. OODBs overcome these problems and have the potential to reduce the difficulty of designing large complex databases and applications. Inheritance and encapsulation make database design and application program reuse possible. However, most OODBs still lack basic database features that the users of RDBs have come to expect. These features include a full non-procedural query language, views, dynamic schema changes and parameterised performance tuning. Added to this, the robustness, scalability and fault tolerance of OODBs does not meet that of more mature RDBs. Kim recommends combining features from the OO and relational models in support of the OR model. This combination makes it possible to support UDTs, dynamic schema changes, SQL, triggers, constraints, automatic query optimisation, and views as a unit of authorisation. Current OR products support features such UDTs, user-defined functions (UDFs), triggers and enhanced integrity constraints. UDTs allow the set of built-in types to be extended with new data types such as text, image, audio, video, time series, line, point, polygon, etc. There are variations in the capabilities of UDTs amongst the different vendors. The most basic form of UDT is a renamed or distinct base type, which can be used to add stronger typing. Beyond this UDTs can represent row types or full abstract data types (ADTs) that encapsulate arbitrarily complex structures and attributes. UDFs can apply to both base types and UDTs to define methods by which applications create, access and manipulate data. Vendors are beginning to market UDTs in type extension packages, for example, Informixs Datablades and IBMs Relational Extenders provide packages for spatial and time-series data. The features and advantages of the OR model made it the natural choice for the EASIA scientific data archive (Chapter  REF _Ref468535710 \r \h 5). Ferreira et al.  REF _Ref458172183 \r \h [92] also support OR technology for scientific data management. They state that an important subset of scientific applications fall into the complex data with queries category (as defined in Stonebrakers classification matrix for DBMS applications  REF _Ref476234981 \r \h [211]) and can therefore be supported by object-relational database management systems. Support for Object-Oriented Databases Celko and Celko  REF _Ref458172480 \n \h [40] and Bloom  REF _Ref458172609 \n \h [20] provide an alternative viewpoint to that given so far. Both are supportive of OODB technology. Bloom believes that the trend towards increasingly complex OO applications, particularly multi-tier distributed object systems, will require more database functionality and that RDBs will increasingly give way to OODBs. Celko and Celko believe that both models have a place. They state that rather than trying to fit data to a database model, a database model should be chosen according to the type of data and expected access patterns. For simple data, RDBs provide a proven high performance solution for both simple queries (such as those associated with transaction processing systems) and complex queries (such as those associated with data mining). For complex data, or data involving complex relationships, OODBs provide a better solution. Internet based multimedia solutions tend to fall into this second category, suggesting an increasing demand for OODBs. OODB vendors have been particularly quick at enhancing their existing products, or producing new product ranges with XML (Extensible Markup Language, see Section  REF _Ref462492587 \r \h 2.2.8) capabilities. They argue that OODBs are a natural match for structured, nested, richly linked information and are therefore capable of storing XML data in native form as objects rather than having to dissassemble the data into tabular data  REF _Ref461353697 \r \h [120]  REF _Ref461353836 \r \h [205]. (In fact, the market leading OODB vendor, until now known as Object Design, has recently rebranded itself, and is now known as Excelon  REF _Ref476909756 \r \h [88] after one of its XML products, and is concentrating on dynamic XML-based business-to-business (B2B) commerce.) Despite these claims, during this reserarch ORDBs proved entirely adequate for representing complex data from numerical simulations. This data was stored in native binary formats in non-traditional data types. Advantages of the OO data model might have been more apparent if the scientific datasets needed to be stored at a lower-level of granularity, for example, in order to make individual elements of a multi-dimensional array directly accessible through a standard database query. However, a complex representation of relationships between data items, using for example, nested objects, carries with it its own set of disadvantages. For examle, once an object is nested inside another object, then to maintain efficient delivery a mechanism of separating these objects may be required when the whole object is not required. Also, if an existing nested class is needed in a new class definition, then the nested class definition needs to be repeated in the new class. Often, the solution to both of these problems is to separate objects, and to store links between them in the form of references. However, this leads increasingly to a one-to-one correspondence between classes in OODBs and tables in ORDBs. Furthermore, Ensor and Stevenson  REF _Ref468544385 \r \h [85] report that they have rarely (if ever) experienced performance problems with the use of foreign key (see Section  REF _Ref468595093 \r \h 4.4) links to navigate a parent child-relationship. Indeed, if the actual value of the foreign key is required then a reference based model exhibits the disadvantage that the reference must be navigated to obtain that value. Additional Resources The book by Ullman and Widom  REF _Ref483627821 \r \h [220] covers the latest database standards (for OO and OR technology) including OQL, ODL, SQL2, and SQL3 (now SQL:1999, see Section  REF _Ref461091637 \r \h 2.1.2) with explanations of how to design databases for both models using ODL and entity-relationship modelling  REF _Ref461091512 \r \h [44]. Currently ORDB products do not provide complete object capabilities in terms of encapsulation, inheritance, polymorphism, object IDs and pointer based navigation. However, the future direction of ORDBs is to achieve many of the benefits of the object model. The emerging SQL:1999 standard, described in the next section, incorporates these features. For ORDBs that do support the majority of the new object features, such as row types, inheritance, references, path expressions and UDTs the BUCKY (Benchmark of Universal or Complex Kwery Ynterfaces) Object-Relational Benchmark  REF _Ref458172944 \n \h [31] was designed to test the performance of these new features. The benchmark provides an OR version of BUCKY and a semantically equivalent relational schema and queries so that the performance of the OO features can be compared with traditional RDB functionality. SQL:1999 The prototype systems created during this research benefited in particular from several new OR features that form part of the emerging SQL:1999 Standard (formerly known as SQL3, see for example  REF _Ref458173450 \n \h [79]). For example, DBbrowse and EASIA both made extensive use of non-traditional, Large OBject (LOB) data types described in Part 2 of the Standard. EASIA also uses JDBC which is discussed in Part 10 of the Standard (being the technology upon which SQLJ is layered, see Section  REF _Ref461177865 \r \h 2.1.4.2). EASIA also used SQL/MED, described in Part 9 of the Standard, at the foundation of its architecture (see Section  REF _Ref461177844 \r \h 5.3.2). SQL:1999 enhances SQL2 (also known as SQL-92)  REF _Ref468595213 \r \h [59] into a computationally complete language for the definition and management of persistent, complex objects. SQL:1999 includes generalisation and specialisation hierarchies, multiple inheritance, user defined data types, triggers and assertions, support for knowledge based systems, recursive query expressions, and additional data administration tools. It also includes the specification of ADTs, object identifiers, methods, inheritance, polymorphism, encapsulation, and all of the other facilities normally associated with object data management. In 1993, the ANSI and ISO development committees decided to split future SQL development into a multi-part standard. Currently there are 9 parts: Part 1: SQL/Framework (ANSI/ISO/IEC 9075-1-1999) - A non-technical description of how the document is structured. Part 2: SQL/Foundation (ANSI/ISO/IEC 9075-2-1999) - The core specification, including all of the new ADT features. Part 3: SQL/CLI (Call Level Interface) (ANSI/ISO/IEC 9075-3-1999) Part 4: SQL/PSM (Persistent Stored Modules) (ANSI/ISO/IEC 9075-4-1999) - The stored procedures specification, including computational completeness. Part 5: SQL/Bindings (ANSI/ISO/IEC 9075-5-1999) - The Dynamic SQL and Embedded SQL bindings taken from SQL-92. Part 6: SQL/Transaction - An SQL specialisation of the popular XA Interface developed by X/Open. Part 7: SQL/Temporal - Adds time related capabilities to the SQL standards. Part 9: SQL/MED - Management of External Data (see Section  REF _Ref461177844 \r \h 5.3.2). Part10: SQL/OLB - Object Language Bindings (see Section  REF _Ref461177865 \r \h 2.1.4.2). Part 8 existed at one time under the informal name SQL/Object, but its material got incorporated into Part 2. ISO also accepted a recommendation to cancel the project under which Part 6 was being developed. The rationale for the cancellation was that the working draft had not been changed since about 1995 and nobody seemed to be interested in publication of the material in question any more. In the USA, the entirety of SQL:1999 is being processed as both an ANSI domestic project (the X3H2 committee covers Database and includes SQL) and as an ISO project (ISO/IEC JTC1/SC 21/WG3 DBL). The ISO standards lifecycle requires that every proposal for a standard starts life as a Working Draft (WD), progresses to Committee Draft (CD), then to Final Committee Draft (FCD), followed by Draft Internal Standard (DIS), and finally International Standard. Eisenberg and Melton  REF _Ref483041807 \r \h [81] report on the status of each part of SQL:1999 as at March 2000. Parts 1 to 5 are International Standards with Part 10 expected to reach International Standard in 2000 (currently at DIS ballot stage), Part 9 in 2001 (currently at FCD ballot stage) and Part 7 in 2003. For the next generation of the SQL standard (after SQL:1999), Part 5 has been eliminated by merging its contents into SQL/Foundation, and a new Part 11, SQL/Schemata, has been created to hold the Information and Definition Schema specifications that were removed from SQL/Foundation  REF _Ref483041807 \r \h [81]. In addition to the SQL:1999 work, a number of additional related projects are being pursued, including, SQL/MM. Approved in early 1993, this is a new ISO/IEC international standardisation project (within WG3) for development of an SQL class library for multimedia applications. This multi-part standard will specify packages of SQL ADT definitions using the facilities for ADT specification and invocation provided in the emerging SQL:1999 specification. SQL/MM intends to standardise class libraries for science and engineering, full-text and document processing, and methods for the management of multimedia objects such as image, sound, animation, music, and video. The object management features of SQL:1999 incorporate many of the features of OODBs. In view of this a merger group  REF _Ref461181363 \r \h [150] was formed with participation from ANSI X3H2 and the ODMG, with intent to merge the ODMGs OQL query language with SQL:1999. OQL would then form a read-only subset of SQL:1999, since OQL does not include INSERT, UPDATE and DELETE, preferring to implement these through method invocation. Obtaining status information on the SQL Standards, and copies of the Standards themselves can be difficult. In the past the standards were available for purchase only, with SQL-92 costing around $295. Now the first 5 Parts of SQL:1999 (that have reached the level of International Standard) are available for electronic download from both the ANSI Electronic Standards Store  REF _Ref483128711 \r \h [12] and the NCITS Standards Store  REF _Ref483128733 \r \h [161], at a price of $20 per part. There is however, no official Web site that details the status of the emerging Parts of the SQL Standard. However, a document repository for ISO/IEC JTC1/SC21/WG3 is available at  REF _Ref461184861 \r \h [130], although it is very difficult to navigate. Books are also beginning to emerge. Gulutzan and Pelzer  REF _Ref458186470 \r \h [106] provide coverage of the first 5 parts of the standard and Fortier  REF _Ref458186559 \r \h [97] provides information on SQL/Foundation. The Web site at  REF _Ref461184130 \r \h [207] is outdated but still contains some useful information. Parallel Databases The availability of powerful, relatively inexpensive commodity CPU chips and other computer components now means that multi-processor computing offers mainframe or better than mainframe performance at a much lower cost than traditional hardware. As such, few large-scale data-processing projects are undertaken without first evaluating parallel technology. For database vendors, it is essential that their database management systems exploit multi-processor hardware platforms if they are to survive in the commercial database market place. Indeed, all the major vendors have parallel relational database products including, Oracle, IBM, Informix, Sybase, Tandem and Teradata. A comprehensive features based comparison of parallel database management systems and hardware platforms for these systems is available in a report by Bloor Research carried out in 1995  REF _Ref456342522 \r \h [166]. Parallel database systems are often categorised by the way they share hardware resources such as memory or disk. Three categories can be distinguished; shared-memory (SM), shared-disk (SD) and shared-nothing (SN). In SM systems all processors share all disks and all memory. In SD systems, each processor has its own private memory but all processors have access to all disks. Oracle 7 is an example DBMS that uses a SD environment. In SN systems each processor has its own private memory and disks. The majority of commercial parallel databases fit into this category. There have been many debates as to which architecture is most suited to parallel databases. In 1992 DeWitt and Gray  REF _Ref483627958 \r \h [67] asserted that the shared-nothing architecture had emerged as the consensus for parallel and distributed system architecture. In an earlier paper Stonebraker  REF _Ref456342659 \r \h [210] concluded that SN systems would have no apparent disadvantages when compared to alternative systems. Baru et al.  REF _Ref456342758 \r \h [17] also expound the virtues of the SN architecture, perhaps not surprisingly, since they were the team responsible for IBMs DB2 Parallel Edition (now DB2 Universal Database Extended Enterprise Edition). Arguments in favour of the SN architecture include a theoretically lower hardware cost due to commodity components and the ability to scale-up to higher numbers of processors. Disadvantages include data skew where data is not balanced across disks (and processors), the need for distributed deadlock detection and a multiphase commit protocol. Complex software is required to split SQL statements into many subtasks to be executed on different processors and then to merge the results. Where possible this approach uses function shipping, that is, operations are performed where the data reside to reduce inter-processor communications. This architecture has an availability problem in the case of a disk or processor failure. In practice, multiply attached disks and replication, are used, much the same as in a SD environment. Rahm  REF _Ref483628074 \r \h [191] and Valduriez  REF _Ref456342870 \r \h [224]  REF _Ref483628108 \r \h [223] have more recently advocated the benefits of a shared-something and shared-disk respectively. The software required to provide parallel database processing is considerably less complex for SM due to the global memory address space. SM provides easy load balancing. Fast inter-processor communications are possible since this is carried out in memory. Potential disadvantages include a memory bottleneck, especially as the speed of the processors increases and the number of processors is increased. Maintaining availability is also more of a problem than with SN and SD, in the case of a memory fault. Hardware cost is potentially higher due to the need to link each processor to each memory module. SD provides the possibility of easier load balancing, less communications overhead than SN, and nodes can more easily be partitioned for different functions e.g. for complex query or transaction processing. Software is more complex than SM due to the need for coordinated global locking and two-phase commit. Access to shared disk can become a bottleneck due to limited bus capacity. Norman and Thanisch  REF _Ref456343041 \r \h [165] propose that developments in technology now mean that distinctions based on hardware architecture are no longer so relevant when comparing performance of parallel architectures. Important factors include the way processes and threads are organised to cooperate in transaction and complex parallel query organisation and the sophistication of the optimiser. Nearly all of the commercial parallel database products now run on both Symmetric Multiprocessor (SMP) platforms in which processors share memory or Massively-Parallel Processor (MPP) platforms in which processors have private memory. This is a necessity since the trend in hardware design is to combine the two architectures so that an MPP platform can include multiple SMP nodes. Both MPP and SMP parallel relational database products are based on three simple techniques: Table partitioning - this involves distributing the rows of a table across multiple disk drives. Three basic methods are used; hash, range and round robin. Pipelining - Operators are overlapped so that the results of one operator are incrementally sent to the next operator in the execution plan. Partitioned execution - relational operators are replicated to increase I/O bandwidth available through partitioned tables. In a 1996 Object-Relational summit presentation, Dewitt  REF _Ref456343084 \r \h [66] described extending parallelisation of RDBs to include ORDBs as a challenging area for current database research. The techniques used to parallelise RDBs are not adequate for parallelising ORDBs. Problem areas include row valued attributes and collection attributes which can lead to skewed data distributions and storing and retrieving multimedia data e.g. partitioning individual images. It remains to be seen how commercial DBMSs will meet these challenges. Parallel databases provide transparent parallelism from the users point of view. Applications that run on sequential DBMSs can run unmodified on parallel versions of the products. For efficient performance database administrator effort is needed to adjust performance tuning parameters (such as memory allocation) and for deciding on the best data partitioning strategy. Whilst a parallel database was used as the underlying database for the DBbrowse prototype (described in Section  REF _Ref468608432 \r \h 4.2), the capabilities of parallel databases were not exploited for any significant part of this research. Both the DBbrowse and EASIA prototypes were designed to support relatively simple queries that might be associated with access to metadata relevant to archived scientific data. EASIA stores the actual scientific datasets external to the database, which frees the database of the resource intensive post-processing of scientific data (refer to Section  REF _Ref468608869 \r \h 5.2.5). Java Database Access This section focuses on JDBC and SQLJ. Respectively, these provide dynamic and static SQL interfaces to relational databases from within the Java programming language. JDBC is used in the EASIA prototype due to the ad-hoc (i.e. dynamic) nature of the queries, posed by the scientific users, aimed at locating datasets of interest. EASIA also requires JDBC to discover schema information about the database at runtime. Java Database Connectivity The Java Database Connectivity (referred to as JDBC, although according to Sun this is a trademarked name not an acronym  REF _Ref461204785 \r \h [137]) specification  REF _Ref459908568 \r \h [232] is supported by all the major database vendors and allows open database connectivity directly from within Java. JDBC was added to Java version 1.02 in 1996. It consists of an API (found in the java.sql package of the standard Java API  REF _Ref461205980 \r \h [132]) that contains a few implemented classes and many database neutral interface classes that specify behaviour without any implementation. Database vendors or other third parties provide the actual implementation of these interfaces in the form of JDBC drivers. Usually two initial statements in a JDBC application are used to firstly, register a JDBC Driver and secondly, to open up a database connection (using the DriverManager class from the JDBC API) using the previously registered driver. These two statements are often the only two that need to be changed to run the application against a DBMS from a different vendor. The JDBC API is similar in concept to Microsofts Open Database Connectivity (ODBC)  REF _Ref456339416 \r \h [172]. The JDBC standard is based on the X/Open SQL CLI (Call Level Interface)  REF _Ref475633983 \r \h [62], the same basis for ODBC. Applications talking ODBC to relational servers have reduced the need for writing embedded SQL. A CLI application does not require precompilation or binding but instead uses a standard set of functions to execute SQL statements and related services at runtime. Traditionally, precompilers have been specific to a particular database product. This requires source code to be written and compiled for each database product. Also embedded SQL applications have to be bound to a specific database before use. The CLI allows for portable applications that are independent of the database product and can be distributed in binary form. ODBC is however not appropriate for direct use from Java since it is a C interface. Indirect usage of ODBC from Java, using calls from Java to native C code, has many disadvantages in the areas of portability, security, implementation and robustness. JDBC is designed to reduce these problems, and will not only allow applications which are independent of the database product but will also allow machine independent applications to be written. Whilst ODBC and JDBC are designed to provide database independence, true portability still resides with the application designer. A driver must support at least ANSI SQL-92 (also known as SQL2) Entry Level to be called JDBC Compliant. This gives applications that want wide portability a lowest common denominator. However, JDBC allows any query string to be passed to the underlying database, so that an application can use any database specific commands available, at the expense of reduced portability. Currently, the JDBC specification also requires that selected semantics from the ANSI SQL-92 Transitional Level must be supported by drivers written for databases that support the SQL-92 Transitional Level. In view of this, JDBC supports a DBMS independent escape syntax for stored-procedures, scalar functions, dates, times and outer joins. A driver must convert the escape syntax into a DBMS specific syntax. The escape syntax is generally different to the SQL-92 syntax for the same functionality. In cases where all of the targets DBMSs for an application support SQL-92 syntax, the application designer can use this syntax. Finally, on the subject of portability, there are a number of JDBC metadata interfaces that provide information on the functionality of the target database. The application designer can use this metadata to provide different execution paths for databases that support different levels of SQL compliance. For example, the application designer can query the metadata to find out if the target database supports some form of outer join, and implement this in a different way if the database does not. JDBC drivers may also support the JDBC Standard Extension API  REF _Ref462324169 \r \h [233]. These include support for the Java Naming and Directory Interface (JNDI), connection pooling, distributed transaction support and rowsets. The JNDI can be used in addition to the JDBC driver manager to manage data sources and connections, which allows the application to be independent of a particular JDBC driver and JDBC URL. A rowset encapsulates a set of rows and may or may not keep an open database connection. The specification discusses several different types of rowsets. A disconnected rowset allows off-line updates to be performed and propagated to the underlying database using an optimistic concurrency control algorithm. Rowsets also add support for the Java Beans component model. A rowset object is a Java Bean and may be serialised. It is therefore a suitable container for tabular data that can be passed between different components of a distributed application. JDBC can be used for any database system for which a driver exists. This is not restricted to RDBs but includes ORDBs and even non-relational technology such as IBMs IMS. At the present time there are different ways to implement drivers that fit into one of four categories  REF _Ref461283308 \r \h [136]: The JDBC-ODBC Bridge provides JDBC access via most ODBC drivers. Note that some ODBC binary code and in many cases database client code must be loaded on each client machine that uses this driver, so this kind of driver is most appropriate on a corporate network, or for application server code written in Java in a 3-tier architecture. A native-API partly-Java driver converts JDBC calls into calls on the proprietary client API. Note that, like the bridge driver, this style of driver requires that some binary code be loaded on each client machine. A net-protocol all-Java driver translates JDBC calls into a DBMS-independent net protocol, which is then translated, to a DBMS protocol by a server. This net server middleware is able to connect its all Java clients to many different databases. The specific protocol used depends on the vendor. In general, this is the most flexible JDBC alternative. It is likely that all vendors of this solution will provide products suitable for Intranet use. In order for these products to also support Internet access they must handle the additional requirements for security, access through firewalls, etc., which the Web imposes. A native-protocol all-Java driver converts JDBC calls into the network protocol used by DBMSs directly. This allows a direct call from the client machine to the DBMS server and is a practical solution for Intranet access. Since many of these protocols are proprietary the database vendors themselves will be the primary source for this style of driver. Several database vendors have these in progress. IBMs DB2 JDBC driver was used in the development of the EASIA prototype. IBM provides both type 2 and type 3 JDBC drivers for DB2. For Java applications the DB2 Client software must be installed on the client making this a category 2 implementation (this driver was used in EASIA). For Java Applets, no DB2 code is installed on the client and a category 3 implementation is applicable. A DB2 server (or client) must be installed on the Web server machine along with DB2s JDBC Applet Server. When an Applet is downloaded, additional class files are downloaded associated with DB2s JDBC Driver. The Applet calls the JDBC API to connect to DB2, and the driver establishes communications with the database via the DB2 JDBC Applet Server on the Web server machine. (Although advertised as a type 3 driver this is not strictly the case since the Applet Server can only connect to IBMs DB2 database. Type 4 classification is also not strictly applicable either, since JDBC calls are not passed directly to the database server but to the Applet server.) SQLJ JDBC is becoming ubiquitous for relational and object-relational database access from Java. JDBC is primarily a dynamic SQL interface and does not require pre-compilation or binding to a particular database in advance. Dynamic SQL query plans are determined at run-time. This can be advantageous for DBMSs subject to frequent updates since the latest database statistics can be used for query plan optimisation. However, for DBMSs in which databases statistics do not change significantly, static SQL can provide a performance advantage because query plans are determined ahead of execution, during pre-compilation. JDBC can, however, offer some of the potential performance advantages of static SQL through prepared SQL statements. An SQL statement containing host variables can be prepared before execution and can then be executed multiple times, with different values for the host variables. This can give a performance advantage because the database will compute the execution plan for the query only once, when the statement is prepared and subsequent execution of the query will use the same plan. JDBC method calls to prepared SQL statements can only be executed at run-time. Therefore, although the execution plan will only be computed once this computation will still occur at run-time. True static SQL allows the embedded SQL statements to be pre-compiled and the program can be bound to a particular database ahead of execution. The optimised query plan can therefore be determined at this time. Static SQL offers several other advantages to the code developer: Syntax checking - of SQL statements. Type checking - to ensure that data exchanged between the host language and SQL have compatible types. Schema checking - to ensure that the SQL statements are compatible with the target database schema. ANSI and ISO standards exist for Embedded SQL within the C, COBOL, FORTRAN and ADA languages amongst others, and in April 1997, Oracle, IBM and Tandem jointly proposed, SQLJ - Embedded SQL for Java (known at the time as JSQL). SQLJ Part 0 (Embedded SQL for Java) has been accepted as an ANSI standard and will form Part 10 of SQL:1999 known as SQL/OLB  REF _Ref483126067 \r \h [61] (see also Section  REF _Ref461091637 \r \h 2.1.2). Although SQLJ is aimed at providing a static SQL binding from Java, the standard layers SQLJ upon JDBC such that an SQL/OLB compliant implementation also provides access to JDBC features. SQLJ Part1 (Java Stored Routines) and Part2 (Java Data Types) are also undergoing standardisation, though not as part of the SQL standard. For further details, see for example  REF _Ref462492748 \r \h [208]  REF _Ref458237467 \r \h [80] and  REF _Ref483041807 \r \h [81]. Microsofts Data Access Strategy The previous section mentioned that most commercial database vendors provide JDBC drivers for their products. One notable exception is Microsoft. (However, some third parties provide JDBC drivers for Microsoft databases, and the JDBC-ODBC Bridge can be used to access a Microsoft DBMS). Microsoft has a strategy, known as Universal Data Access (UDA)  REF _Ref461946214 \r \h [222], for providing access to database and non-database information across the enterprise. UDA provides data access services to Windows Distributed interNet Application (DNA) Architecture  REF _Ref461946727 \r \h [236], which is Microsofts overall strategy for building scalable, distributed, multi-tier Internet based, client/server applications. UDA provides access to a variety of information sources, including relational and non-relational, and a programming interface that can be used with many (Windows based) languages and tools since it is based on Microsofts COM (Component Object Model) component technology  REF _Ref462299133 \r \h [56]  REF _Ref476916197 \r \h [50]. UDA is implemented through the following technologies: OLE DB  REF _Ref462299558 \r \h [173] (a system level interface), ActiveX Data Objects (ADO)  REF _Ref462299575 \r \h [5] (an application level interface that is easier to use than to OLE DB and which can be used by any language or tool that can use COM) and ODBC  REF _Ref456339416 \r \h [172]. Microsoft uses yet another acronym, MDAC (Microsoft Data Access Components), to describe the packaged release of these technologies. Whilst ODBC has been very successful for Microsoft, it was designed for relational databases. OLE DB, on the other hand, defines a collection of COM interfaces for accessing relational data, ISAM/VSAM mainframe data, hierarchical databases, email, file system stores, and more. ADO will eventually replace ODBC. However, currently an OLE DB/ODBC bridge is available for databases that do not have a native OLE DB driver. Microsoft has produced a paper comparing ADO with JDBC  REF _Ref462326458 \r \h [7]. The paper (not surprisingly) criticises JDBC in a number of areas including the fact that JDBC is a low-level API really only suitable for relational data sources. OLE DB, on the other hand, can access many different data sources. Also OLE DB can be used in many different languages and tools since it is based on COM components. Whilst other vendors are building universal databases with new datatypes to centralise non-traditional datatypes, Microsoft is building OLE DB components to interface to the data in its original form. A useful overview of Microsofts data access strategy is available in  REF _Ref461946133 \r \h [144]. OLE DB was rejected as an implementation technology for this research as it is vendor and platform specific (requiring an underlying COM-based architecture). Also, although JDBC is a low-level API it is very easy to use. During the course of this research JDBC proved to be extremely versatile for connecting to different vendors databases on different platforms. Web Developments The three prototype systems described in this research all provide Web-based management of non-traditional data. Each system demonstrated new mechanisms and architectures. These 3 systems chart a progression in the usage of increasingly sophisticated technologies for creating dynamic/interactive Web pages and for providing Web/database integration. The previous section discussed the variety of available database technologies. This section discusses the Web technologies that were available, and describes why different technologies were used to implement the prototypes. Initially, data for Web pages was stored in conventional data files containing static links to other files. There are still many Web sites that are constructed in this way. However, increasingly Web sites now produce dynamic Web pages in response to users requests. There is huge effort both from industrial and research institutions devoted to developing tools and techniques for dynamic Web-based client/server systems. Often Web pages are no longer based on conventional files but are built from data extracted from database management systems. This section discusses the technologies that have enabled the transition from the static Web to the new dynamic, interactive Web. The Common Gateway Interface Initially the Web consisted of pre-written HTML pages, containing text, images and fixed links to additional pages. The CGI transformed the static Web by providing one of the earliest techniques for generating dynamic Web page content. With CGI, a Web server passes certain Hypertext Transfer Protocol (HTTP)  REF _Ref468778675 \r \h [93] requests to an external program residing on the Web server. The output of this program is then returned as an HTML page to the clients browser. In addition to providing dynamic content, CGI also allowed for user interaction via HTML forms. CGI is still the most widely used mechanism for server-side processing  REF _Ref468874648 \r \h [121]. This is because the CGI approach has a number of benefits including ease of implementation, portability of server software, the use of standard Web browsers as clients and the existence of a wealth of existing tools and sample code. A CGI program receives input from the clients browser, via interaction with the Web server, by reading environment variables and or standard input, and provides HTML page output, via interaction with the Web server, by writing to standard output. This simplicity allows CGI programs to be written in any language, although PERL has become the predominant choice  REF _Ref459905905 \r \h [124]. The simplicity of CGI leads to a number of well-known limitations (see for example  REF _Ref459905905 \r \h  \* MERGEFORMAT [124]  REF _Ref468620056 \r \h  \* MERGEFORMAT [74]): Sessions Problem: In most client/server systems the client stays connected to the server through multiple transmissions. However, with CGI, once a request is handled the CGI program terminates, closing down any communications channel with the server. The underlying HTTP protocol is stateless and is not designed to maintain state between multiple requests from the same client. Crude mechanisms can be employed to maintain state, including hidden variables in HTML forms and Netscape's client state cookies  REF _Ref468631537 \r \h [184]. Server Load/Scalability: The CGI mechanism starts a new process every time a request is made that accesses a CGI program. For CGI programs written in PERL, this performance degradation was magnified in initial CGI implementations since each request also had to start a new PERL interpreter. A further consequence of server-side processing is that work that might usefully be done by the client (such as form validation) has to be emulated by CGI programs on the server, further increasing the workload on the server. Slow: Web browsers cannot send requests to the server asynchronously, performing other work while the request is processed. Clients wait for the sever response which includes the time for the server to start the CGI process. Limited Presentation Capabilities: The presentation of the user interface and results from queries are restricted by the limitations of HTML. Both the GBIS and DBbrowse prototypes were implemented using CGI technology largely because this was the emerging mechanism for implementing interactive HTML pages at the time. GBIS used UNIX shell scripting for implementing the CGI programs, whilst DBbrowse was written using PERL CGI programs. DBbrowse performed a new database connection for each query, even for subsequent queries submitted by the same user. This leads to performance degradation. Connecting to a database is typically very slow, often requiring several seconds to login a user  REF _Ref459906169 \r \h [175]. Repeated database login also limits the client/server functionality, precluding session-oriented database-applications. DBbrowse did not allow the use of an SQL cursor to retrieve a subset of the rows in the answer table, with user interaction to request further rows from the database. This problem is inherited from the statelessness of the HTTP protocol. Lack of state also lead to repeated queries for metadata in DBbrowse. The EASIA prototype does not use the CGI mechanism. Instead EASIA uses Java Servlets (see Section  REF _Ref483200764 \r \h 2.2.5) to remove the process per request overhead associated with CGI, to maintain state between requests and to maintain database connections for a complete client session. EASIA also uses some basic Dynamic HTML (DHTML) features to allow some server side-processing and more advanced presentation. Before discussing the reasons behind choosing Java Servlets for implementing EASIA, a few other alternatives to the CGI mechanism are discussed. Many technologies have been developed to overcome the limitations of CGI, but in turn these can exhibit other disadvantages. Web Server Extensions To overcome the performance problems associated with process creation using CGI, Web server vendors provided APIs to allow extensions to the Web server itself. Microsofts Web server extension API is known as ISAPI  REF _Ref462398472 \r \h  \* MERGEFORMAT [129] (Internet Server Application Programming Interface), Netscapes is known as NSAPI  REF _Ref462398201 \r \h  \* MERGEFORMAT [167], whilst Apache provides the Apache API  REF _Ref468693743 \r \h [13]. Applications built using these APIs run in the same process as the Web server so that communication between the application and the Web server is very fast. Additionally, applications remain in memory once loaded. However, there are a number of problems associated with Web server extension APIs. Firstly they are proprietary, tying the application to a particular Web server vendor. Also, development with these technologies is complex. ISAPI, for example, is only accessible from C++. Wizards are available to help create the framework for ISAPI code, however building on the framework is not simple  REF _Ref462386185 \r \h  \* MERGEFORMAT [189]. To simplify development, whilst improving the performance of CGI programs, a number of products have been created on top of the Web server APIs. A module called mod_perl  REF _Ref468639884 \r \h  \* MERGEFORMAT [186] is available for the Apache Web Server, which embeds a PERL interpreter into memory so that this only has to be done once at initialisation time. In addition, each PERL CGI program is only compiled once and then kept in memory to be used each time the program is run. This module is only available for the Apache Web Server. ActiveStates PerlEx  REF _Ref468686808 \r \h [4] also improves the performance of PERL CGI programs running on several popular Web servers (including those from Microsoft, Netscape and OReilly). It uses the Web servers native API to accomplish this. PerlEx is, however, only available for Web servers running on the Windows NT platform. Due to Web-server and language portability issues, Web server extensions were not considered a suitable technology for implementation of the prototype systems created during this research. FastCGI Open Market's FastCGI  REF _Ref468631359 \r \h [90] is another attempt to deal with the performance limitations of CGI. It consists of a specification  REF _Ref468631388 \r \h [26] along with freely available source and object code for extending Web server products. FastCGI uses persistent CGI processes that are reused to handle multiple requests to remove the overhead of creating a new process for each CGI request. Although processes are reused there is still at least one process for each FastCGI program, and in order to handle multiple concurrent requests for the same program, a pool of processes is required. Another problem with FastCGI is that it is not implemented for some of the most popular Web Servers, including Microsofts Internet Information Server  REF _Ref459905905 \r \h [124]. Java and Java Applets In 1995 Sun Microsystems launched the Java programming language. Described as write once, run anywhere Java portability relies on compiling source code to byte codes for a virtual machine (specified by Sun Microsystems). The byte code is then interpreted on any platform using a platform specific Java Runtime Environment (JRE). Before discussing the implications of this for Web development a brief note on the ownership of Java follows. Java is owned by Sun Microsystems and is licensed to third parties. In April 1999  REF _Ref461209996 \r \h [135], Sun Microsystems proposed to standardise formally Java technology through ECMA  REF _Ref483628466 \r \h [77], an internationally recognised standards developing body with strong ties to ISO. The proposed submission would consist of the Java 2 platform, Standard Edition (J2SE), Version 1.2.2 specifications. These consist of the following technology specifications: Java Language Specification (with Clarifications and Amendments)  REF _Ref456337580 \r \h [105]. Java Virtual Machine Specification  REF _Ref476907795 \r \h [145]. Java 2 Platform API Specification  REF _Ref461205980 \r \h [132]. However, on December 7, 1999 Sun issued a press release announcing the withdrawal of the proposal from ECMA  REF _Ref476907384 \r \h [213] which stated, Sun is withdrawing from the process in order to protect the integrity of the Java technology and the investment made in it by the worldwide community using Java technology. The article goes on to state that Sun is committed to maintaining compatibility across implementations of the Java platforms and that they encourage the community to compete on implementation, not on standards. The press release also states that Sun noted that ECMA has formal rules governing patent protections but that there are currently no formal protections for copyrights or other intellectual property. Currently, therefore, Java remains a de facto standard. Two fundamental categories of Java programs are Java applications and Java Applets. Java applications are standalone programs that can be run using a JRE and which have similar properties to programs written in other languages. Java Applets on the other hand, are designed to be downloaded and run inside a Web browser using an embedded JRE. Java Applets introduced a fundamentally new concept to dynamic Web pages. Instead of generating the pages server-side it is now possible to download executable content, in the form of Applets, to be run within the users Web browser. Java Applets therefore provide an alternative philosophy to the proprietary and non-standard CGI workarounds discussed in the previous sections. A Java Applet can be downloaded and run on the client. A Java Applet can make its own network connections using Java sockets, or can employ technologies such as JDBC and/or distributed object technologies (such as CORBA, see Section  REF _Ref483209607 \r \h 2.2.7) for communicating with server-side databases and application logic. These technologies can eliminate the bottleneck imposed by the CGI on the server and provide session-oriented communications. Additionally, sophisticated graphics are available via Java's Abstract Window Toolkit (AWT) class library. Due to the security implications of allowing downloaded executable content to be run on a users machine, Java Applets do however, run in a sandboxed environment. Generally speaking, Web browsers restrict Java Applets by: Preventing an Applet from running any external executable program. Preventing an Applet from reading or writing to the local file system. Preventing an Applet from communicating with any server other than the host from which they were downloaded (the originating host). Ensuring that an Applet attaches an Applet warning message to any window that it creates (to prevent, for example, the user from inadvertently typing in a password in what appeared to be a window from a local standalone application, but which might, in fact, be sent to the originating host). Preventing an Applet from accessing any local information except for the Java and operating system version, and the characters used to separate files, paths and lines. In addition, a bytecode verifier ensures that all class files obey the rules of the Java language, which enforces, for example, memory protection (this is to prevent damage from Java classes bytecodes constructed by hand or by a non-compliant compiler). It is now possible to relax some of these restrictions through available options in current Web browsers. Also, digital signatures provide authentication of the provider of Java Applet classes thereby allowing a user to grant very specific extended privileges to individual classes or signers. During the construction of the DBbrowse and EASIA prototypes Java Applets were evaluated as a suitable implementation technology. However, Java Applets were rejected for a number of reasons. Applets have restrictions associated with network access to multiple hosts, and access to the local environment as discussed above. These restrictions were too limiting for the EASIA application, which needs to connect to multiple file server hosts, and which is required to allow users to save and results to the client machine. Paepke et al. review technical challenges faced during construction of the Stanford Digital Library  REF _Ref459906105 \r \h [178]. They state that Java security managers and their interaction with browsers were a constant source of trouble. The speed at which Java is developing leads to constant revisions and incompatibilities. It was difficult to build robust Java Applet software for an environment where users will have browsers from different vendors and at different release levels. Applets really need to be tested on all possible client platforms, and even then new versions of browsers may introduce new incompatibilities  REF _Ref459906105 \r \h [178]  REF _Ref459905905 \r \h [124]. The Java Abstract Windowing Toolkit (AWT), the basis for graphical user interfaces in Java, is, according to Hunter and Crawford  REF _Ref459905905 \r \h [124], the most error-prone and inconsistently implemented portion of the Java language. Although SWING now provides an all Java class library for user interface components, which may prove to be more robust, SWING support in users browsers cannot be relied on. Often the response times of JAVA GUIs proved to be slow in execution and slow to download. This is an important consideration when there is no control over the client machines of the users. Although Java Applets were not used for the prototypes, Java was used server-side. The portability of Java code and the extensive APIs (or class libraries, for example JBDC) provided great benefits. Java Servlets were used to implement server-side logic in the EASIA prototype. Servlets are discussed next. Java Servlets Version 1.0 of the Java Servlet API was released in mid 1997. A Java Servlet  REF _Ref458235147 \r \h [63] is a dynamically loaded Java class that extends the capabilities of a Web server. Once loaded a Servlet remains in memory, and is handled by separate threads within the Web server process. This provides a much better performance than CGI, which creates a new process for each request. Orfali and Harkey  REF _Ref459906169 \r \h [175] found that Servlets performed over an order of magnitude better than CGI in a ping benchmark. Servlets are written in Java and, unlike the previously discussed non-standard CGI workarounds, are supported by all major Web servers. Servlets were unavailable when the GBIS and DBbrowse prototypes were implemented. However, Servlet technology was used to implement EASIA. The decision to use Servlets allowed the system to be portable across operating systems and Web servers, and allowed access to the full range of Java APIs such as JDBC and the security APIs (see the discussion of code upload for server-side execution Section  REF _Ref483214248 \r \h 5.2.6). Servlets also provided a mechanism to maintain state for the duration of a users login session and to invalidate a users session and log a user out after a period of inactivity (by storing a stateful last modified variable). Since Servlets are a server-side technology they do not provide a means of enhancing the user interface presented in the clients Web browser. EASIA still uses HTML forms as the main user interface technology, enhanced with some browser independent DHTML. Java Server Pages and Active server Pages Sun has also released a technology called JavaServer Pages (JSP)  REF _Ref462411933 \r \h  \* MERGEFORMAT [134]. This is very similar to Microsofts Active Server Pages (ASP). One main difference is that JSP uses Java as the language that is embedded within the HTML pages. However, unlike Java Applets, which execute client side, the embedded Java code is executed server-side (as with ASP) and a standard Web page is returned to the client. Also JSP is designed to work with different Web servers and on different platforms. Sun state that JSP technology was designed to try and provide an industry-wide solution for creating Web pages with dynamic content. JSP is designed to be simpler than Suns Servlet technology, thereby reducing the level of expertise required to build applications. JSP, like ASP, can separate application logic from Web page content. So, for example, the appearance of a page can be changed without modification to the application logic. A Servlet application, on the other hand, requires that the entire Servlet be edited and recompiled for such a change. Behind the scenes, JSP pages are converted to Servlets and compiled the first time that they are requested to improve performance for subsequent requests. Since JSP pages are compiled into Servlets they benefit from Java security and memory management facilities. Finally, JSP pages can interact with Java Beans Components to perform complex processing. JSP was not employed during the implementation of EASIA. This was partly due to the fact that EASIA was started before JSP became widely available and partly because it proved to be straightforward to implement EASIA using Servlets directly. ASP  REF _Ref462385599 \r \h  \* MERGEFORMAT [3] technology was developed to run transparently on top of ISAPI. ASP can provides memory resident, multi-threaded server-side applications whilst offering simpler development than ISAPI applications. ASP is a similar technology to JSP, allowing Web pages to include embedded scripting instructions (using for example, Microsofts JScript and VBScript languages) along with other HTML tags. When Microsofts IIS (Internet Information Server) Web server first gets a request for a particular ASP page it compiles the script in the Web page and loads the compiled code into memory. The script then performs some server-side processing and writes a standard HTML page back to the client. Underneath the covers, ASP uses a Microsoft supplied ISAPI DLL that runs within the same memory space as the IIS Web server to process the Web pages. Originally ASP simply provided server-side scripting. Now however, ASP is one of Microsofts mainstream technologies for Web-based application development. It is now tightly integrated with other Microsoft technologies (see for example  REF _Ref462386185 \r \h  \* MERGEFORMAT [189]) such as ADO, COM and MTS (Microsoft Transaction Server)  REF _Ref476925080 \r \h [151]. Third parties have ported ASP technology to other platforms. As with other Microsoft technologies, ASP was not used for any prototypes in this thesis as it generally restricts the architecture to the Microsoft Windows platform. Some attempts have been made to port ASP to other operating systems, for example, Chili!softs Chili!ASP  REF _Ref462392002 \r \h  \* MERGEFORMAT [58] provide a comprehensive commercial product range which allows ASP to operate on Web servers other than IIS and on alternative operating systems to Windows NT. Distributed Object Technologies At the same time that Applets were investigated as a possible implementation technology for client-side processing, distributed object technologies were evaluated as a means for the clients to communicate with server-based parts of the applications. Distributed objects have the potential to allow Java Applets to communicate with the server using a higher level of abstraction than alternative mechanisms such as a raw socket connection employing a user-defined protocol, or HTTP (and CGI) based communications. The result of the evaluation was to reject distributed object technologies as a possible implementation technology for EASIA, largely due to the previously discussed problems associated with the client-side Java Applets. However, this section provides a brief description of some of the issues presented by the distributed object technologies. The need for better client/server performance in the Web environment lead to resurgence in the popularity of distributed object technologies, particularly in combination with Java. During the evaluation of these technologies a simple client/server ping benchmark was run which showed distributed object technologies to perform two orders of magnitude better than the CGI mechanism with a performance level within the same order of magnitude as raw socket programming. The details of the experiment are given in  REF _Ref483485771 \r \h Appendix B. The major database and middleware vendors have all been active in the distributed object arena. The principal distributed object technologies are CORBA (Common Object Request Broker Architecture) from the OMG  REF _Ref483219677 \r \h [53], DCOM (Distributed Component Object Model) from Microsoft  REF _Ref476920993 \r \h [27] and RMI (Remote Method Invocation) from Sun  REF _Ref483219353 \r \h [133]. Their promise is to provide an infrastructure for distributed computing which allows the invocation of methods on objects just as if they were part of the local application, whereas the objects can actually be located anywhere on a network. That is, they are intended to provide local/remote transparency so that the developers do not have to worry about factors such as transports, server locations, object activation, target operating systems, etc. A brief overview of the features of these technologies follows (concentrating on CORBA as this formed the bulk of the evaluation) along with a general discussion of why they have not succeeded in the Web environment and a mention of the new direction these technologies are taking in middleware component architectures. CORBA CORBA is an open standard overseen by the OMG, a consortium of over 700 companies within the computer industry. CORBA's architecture is built around three key building blocks: OMG Interface Definition Language (IDL) The Object Request Broker (ORB) The standard Internet Inter-Orb protocol (IIOP) The key to CORBA is that CORBA objects have well defined interfaces that can be expressed in OMG Interface Definition Language (IDL). IDL is a simple declarative language used to define object types by specifying their interfaces. IDL syntax is similar to C++ or Java, but it does not contain any programming constructs. An interface definition consists of: operations - method signatures parameters - arguments of operations - in (client to server), out (server to client), inout (both ways) attributes - instance or class variables - 'get' and 'set' methods must be supplied for each attribute (get only for read-only attributes) exceptions - exceptions that operations may raise An IDL interface can also specify inheritance from parent interfaces, and typed events that it emits. An important feature of CORBA is that it is language neutral. Clients and servers can be written in a number of different languages, including Java, C, C++, Smalltalk and ADA, and bindings for other languages such as COBOL are in the process of being specified. Clients written in one language are able to communicate with servers written in a different language. It is possible to implement the same interface in multiple objects. IDL is used to map CORBA objects into particular programming languages through the use of IDL compilers. An IDL compiler for the Java language automatically produces a stub class for the client, a skeleton class for the server and a Java interface class that corresponds to the IDL. The stub and skeleton classes glue the actual client and server code to the ORB. The ORB is the middleware that handles the client/server interaction. An API is defined to allow client/server object interaction with the ORB. The stub class provides local proxy objects that the client can invoke methods on. The methods in the stub proxy object in turn invoke operations on the real object implementation, via the skeleton on the server. Once the client stub has been instantiated on the client using an object reference to a CORBA server object, standard Java code is used to invoke methods on that object. During a client request the stub automatically builds a block of information that identifies the object and method to be used, and which contains the parameters to be sent. This block of information is packaged in a device-independent manner. This is known as parameter marshalling. The skeleton class at the server unmarshalls the parameters. It directs the operation request to the appropriate method of the correct object implementation. The skeleton class then captures the return value or exception and sends this, in marshalled form, back to the stub on the client. Two classes usually have to be written for the server side of the request. These are referred to as the server and servant. A servant object is an instance of the object implementation, i.e. code that implements the methods specified in the IDL interface. The programmer will add the body of the operations defined in the IDL (and Java interface) and also constructors for the object. A server object can be started manually (at the command line) and it then instantiates a servant object (or multiple servant objects, possibly of different types). The server code also initialises the ORB environment and registers the available servant objects with the ORB. The CORBA1.1 specification (introduced in 1991) concentrated on producing portable object applications by defining CORBA IDL and the CORBA API. However CORBA implementations from different vendors were not interoperable. CORBA2.0  REF _Ref483219677 \r \h [53] (first adopted in 1994) includes the specification of inter-ORB interoperability, known as the General Inter-ORB Protocol (GIOP). This protocol defines the message format for invoking operations on CORBA objects. One mandatory Inter-ORB protocol (IOP) that must be implemented by CORBA2.0 compliant ORBs it the Internet Inter-ORB Protocol (IIOP) which uses TCP/IP as its transport protocol. CORBA2.0 also allows an object reference to be used by a client using any compliant ORB, through the use of Interoperable Object References (IORs). In the Web environment CORBA clients are subject to security restrictions. Applets that invoke operations on CORBA objects are limited to opening network connections back to the host from which they were downloaded. Another restriction occurs in the case of firewalls which do not permit TCP/IP based IIOP communications to cross them. A method often used to overcome these problems is known as HTTP Tunnelling. IIOP messages are placed in a HTTP wrapper which enables them to pass through firewalls. The messages are sent back to the originating host of the applet and a daemon process on the host machine forwards the CORBA request to the machine that hosts the required object. This completes a brief overview of some of the important features of CORBA. This overview provides details that allow comparisons with DCOM and RMI in the next two sections, followed by a discussion of why these technologies have not superseded HTTP as the protocol for Web communications, and how they might, instead, succeed in middleware component architectures. For more detailed comparisons of these distributed object technologies, see for example,  REF _Ref483544778 \r \h [45]  REF _Ref459906169 \r \h [175]  REF _Ref483544849 \r \h [187]  REF _Ref483544654 \r \h [192]  REF _Ref465414206 \r \h [215] and for excellent links to information on CORBA in general, refer to  REF _Ref483544902 \r \h [41]. DCOM DCOM is Microsoft's distributed object technology to compete with CORBA. DCOM extends Microsofts COM from the desktop to the network allowing objects to communicate over the Internet. It is best to consider COM and DCOM as a single technology that provides a range of services for component interaction, from services promoting component integration on a single platform, to component interaction across networks. In fact, COM and its DCOM extensions are merged into a single runtime. This single runtime provides both local and remote access. COM is both a specification and an implementation that specifies a binary standard for implementing objects. The implementation part is a dynamic link library that includes API function calls. These can be used to instantiate an object and give it a unique ID. COM specifies how the objects can be instantiated and how they can communicate locally using the predefined interfaces that they implement. DCOM uses object-oriented remote procedural calls to extend COM, and is built on top of the Open Software Foundation (OSF) Distributed Computing Environment (DCE) Remote Procedural Call (RPC)  REF _Ref483486685 \r \h [177]. DCOM, like CORBA, uses an interface definition language. However, these are not the same. A DCOM IDL file contains interface definitions, which are divided into an interface header and an interface body. The interface header contains details applicable to the whole interface. The interface body contains items such as function prototypes and pre-processor directives. CORBA IDL is far more succinct and self explanatory than DCOM IDL. A DCOM Interface is a group of related functions held in an array of function pointers known as a virtual table or vtable. The table points to the implementations of the interface functions. A client receives a pointer to the interface, but cannot instantiate an instance or create a unique DCOM object and hence cannot maintain state between connections. A client can only reconnect to an interface pointer of the same class. Hence DCOM has transient stateless objects compared with CORBA's persistent objects and object references. DCOM is primarily a Windows based technology. Language bindings are available for all Microsoft development environments including Visual C++, Visual Basic and Visual J++. In contrast to CORBA, which is an open standard overseen by the Object Management Group (a consortium over 700 companies from within the Computer Industry), DCOM is proprietary. In 1996 Microsoft announced that it would hand over its object technology specification to the Active Group  REF _Ref476919681 \r \h  \* MERGEFORMAT [2]. The Active Group consisted of software vendors and system vendors and was to be directed by a steering committee including. The aim of the group was to determine a process to transfer ActiveX specifications and appropriate technology to a standards body. However, subsequently Microsoft has quietly abandoned its 1996 commitment to hand over ActiveX to an independent body and the Active Group has also vanished along with this promise. COM/DCOM has, however, been ported to a number of other operating systems by Microsoft and other organisations (see for example  REF _Ref476920071 \r \h [51]  REF _Ref476919936 \r \h [52]  REF _Ref476919887 \r \h [86]) although support for many of the technologies built on top of COM/DCOM is very limited on these other platforms. A report by the OMG states that Microsoft Windows is always likely to be the reference platform for DCOM and that non-Windows versions will always have second-class status  REF _Ref476920551 \r \h [54]. The report quotes Bob Muglia, the Vice-President of Microsoft's Developer Tools Division, as saying, Microsoft unapologetically will make sure that ActiveX works best on Windows. (ActiveX is a core Microsoft Internet technology based on COM/DCOM.) RMI RMI is a set of Java Classes that Sun first included in JDK1.1. When introduced, RMI was a non-CORBA compliant ORB restricted to use only from within the Java Language. This allows RMI to be optimised for Java usage. However, it prevents RMI from being directly incorporated into existing software written in other languages. It may also be a problem in applications where the execution time of interpreted Java byte code does not provide sufficient performance. The main ways in which RMI differed from CORBA when it was first introduced are as follows: No IDL -- RMI uses Java interfaces, directly, to specify the interfaces of remote objects. These interface definitions contain the method signatures of all remote methods. On the server side a class is created to implement this interface. An RMI compiler is used to generate client stub and server skeleton classes using the byte code of this interface implementation class. This class can then be instantiated by a server program to create remote objects. Objects can be passed by value -- Since RMI uses Java interfaces and not IDL to define interfaces it is possible to include method invocations that reference local objects. Clients and servers can therefore pass one another objects for which no stub and skeleton classes exist. Additionally, it is not possible to simply pass an object reference, as would be the case for an object passed as a parameter to a local method, because object references are memory locations of objects which are valid only within the local Java virtual machine. RMI solves this problem by passing the actual value of the object instead of just a reference. There is therefore no further connection to the original object. URL based object names -- RMI allows remote object references to be stored using an URL-based naming scheme which can be very useful in the Internet environment. Dynamic stub downloading -- Clients can download stub classes from the server, dynamically as required, if they have a reference to a remote object and do not have the stub locally. Over the last couple of years the differences between RMI and CORBA have become fewer. RMI can now be implemented on top of IIOP thereby allowing a level of interaction between RMI and CORBA objects. Also, the CORBA specification now supports objects by value. Distributed Objects Have Not Succeeded in the Web Environment Despite the promise of distributed objects they have not succeeded in the Web environment. HTTP is the protocol of the Web, not IIOP. There are a number of reasons for this lack of penetration on the Web. Firstly, these technologies are complex to use and the idea that applications can be built without regard to whether objects are local or remote is currently false (see for example  REF _Ref483542510 \r \h [227]). Second, they require significant runtime support to operate properly. Third, these technologies were not designed explicitly for the Web, with consequences such as their inability to traverse firewalls without specific workarounds. Fourth, the mechanism by which they tightly couple applications to interfaces is too rigid to cope with change. For example, CORBA IDL is used to produce binary programs for use as stubs and skeletons at the client and server. A change to an IDL defined interface requires regeneration of the stub and skeleton and consequential changes (code update, recompilation, redeployment) to the programs that use them. If these technologies are being used to bridge applications between different organisations, then exact agreement is required on the interfaces being used. The rest of this section reviews a number of books and papers that add weight to the above argument. In 1997, Orfali and Harkey  REF _Ref459906169 \r \h [175] asserted that the Web was on the verge of a distributed object revolution, with ubiquitous deployment of Java and CORBA. In their Object Web environment, clients would dynamically discover and use objects made available through CORBAs trading service. Clearly this Object Web has not materialised. The authors also quoted Marc Andreessen, the cofounder of Netscape, making the following prediction in 1996: The next shift catalyzed by the web will be the adoption of enterprise systems based on distributed objects and IIOP (Internet Inter-ORB protocol). We expect to distribute 20 million IIOP clients over the next 12 months and millions of IIOP-based servers over the next couple of years. If these distribution targets have been achieved then it is solely due to CORBA software being bundled with Netscape products. However, in terms of actual deployment, the numbers are orders of magnitude out. Tallman and Kain  REF _Ref465414206 \r \h [215] quote Gary Voth, Microsoft Group Manager for Marketing as saying in 1998, There are only 50,000 or 60,000 deployments of CORBA around the world. What Voth did not mention, is that there are even fewer enterprise solutions using COM. Tallman and Kain state that they tried to find references for COM projects but were unable to identify much comparable experience. Whilst significant COM development occurs at the desktop and departmental level, it is currently an unproven enterprise technology. Box  REF _Ref476023671 \r \h  \* MERGEFORMAT [22] compares the suitability of Java, CORBA, COM/DCOM and XML as technologies for enabling component software that can provide collaboration and cooperation amongst software development organisations. He believes that it is unlikely that Java, CORBA or DCOM will dominate the Internet. Box states Ironically, while Microsoft and the Object Management Group (OMG) were arguing over whether the Internet would be run on DCOM or CORBA, the Hypertext Transfer Protocol (HTTP) took over as the dominant Internet protocol. A news report in Computing  REF _Ref469472388 \r \h  \* MERGEFORMAT [91] mentioned the heavy bias towards XML in an early preview of Microsofts Developer Studio 7 and made the following statement concerning DCOM: "In essence, Microsoft is replacing the DCOM RPC messaging technology with an XML/HTTP technology that allows for remote method invocation". CommerceNets eCo System initiative  REF _Ref465401510 \r \h  \* MERGEFORMAT [103] originally used CORBA to define business services. However, this was abandoned in favour of XML. The authors suggest that while CORBA appears workable within organisations that control APIs, it is not practical for inter-enterprise integration. Business documents, defined in XML provide a simpler, human readable, intuitive way for businesses to interoperate. Businesses already exchange information via documents on which they largely agree, whereas programming APIs for business system interfaces almost certainly differ. The role that XML has assumed in the Web environment is discussed in more detail in Section  REF _Ref462492587 \r \h 2.2.8, but first a brief description of emerging middleware component architectures follows as these could be the environment in which distributed objects will succeed. Server-side Component Architectures Before leaving the discussion of distributed object technologies it is worth noting that new middleware component architectures have arisen in the last couple of years that may provide a lifeline for these technologies. Therefore, although distributed objects have not proliferated in the Web environment, they may still succeed as the communication mechanism of choice within organisations. Suns Enterprise Java Beans (EJB) architecture  REF _Ref483547754 \r \h [148] (based on CORBA and RMI) and the Microsoft Transaction Server architecture  REF _Ref476925080 \r \h [151]  REF _Ref483630656 \r \h [216] (based on COM/DCOM) promise plug and play enterprise computing (which effectively translates as distributed computing) features. (Note that EJB is an entirely different concept to JavaBeans technology  REF _Ref483554534 \r \h [110] which allows visual program composition without the need to access source code. Programs built from JavaBeans are designed to be run in a single java virtual machine.) The idea is to provide a level of abstraction, which can virtually free the developer of any middleware expertise when building scalable, transactional, distributed applications. Rather than writing to middleware APIs (as was previously the case with CORBA and DCOM applications) components gain middleware services implicitly and transparently. These transparent services include transactions, persistence, security, state management, component lifecycle, threading, resource-sharing and more. EJB and MTS also fit within wider enterprise computing strategies from Sun and Microsoft, embodied in Java Enterprise Edition  REF _Ref483554473 \r \h [202] and Windows Distributed interNet Application (DNA) architecture  REF _Ref461946727 \r \h [236] respectively. Java Enterprise Edition incorporates EJB, RMI, JDBC, Java Naming and Directory Interface (JNDI), Java Transaction API (JTA), Java Transaction Service (JTS), Java Messaging Service (JMS), Servlets, JSP, Java IDL, JavaMail, Connectors and XML. The Windows DNA architecture incorporates many Microsoft technologies, centred on its COM, for building scalable, distributed, multi-tier Internet based client/server applications. The DNA approach involves plugging together COM components for developing all tiers of distributed applications, including, user interface and navigation, business logic, and storage. Developers can combine many different COM aware languages (C++, Java, Visual Basic, JScript, VBScript), tools and services to create applications. DNA services include, amongst other things, transactions (MTS), component management, Dynamic HTML, Web browser (Internet Explorer (IE)) and server (Internet Information Server (IIS)), scripting (JScript, VBScript), message queuing (Microsoft Message Queue Server (MSMQ)), security, directory services (Active Directory), data access (UDA, SQLServer), systems management, and user interfaces. This list appears to be fairly flexible in different Microsoft publications, but in general it is fair to say that it incorporates most of Microsoft's core technologies. Further comparison of EJB and MTS is available in  REF _Ref483544698 \r \h [193] and Microsofts view of the advantages they see in MTS over EJB is available in  REF _Ref483552768 \r \h [55]. It is too early to say if either of these strategies will be successful, but the usual Microsoft Windows only caveats apply to MTS/DNA. Chang and Harkey  REF _Ref458173116 \r \h [43] provide words of scepticism aimed at the transparency that middleware component architectures may provide (from very complex underlying technologies and interactions), encapsulated in their statement, It will be quite a feat to figure out the reason when something goes wrong. It will likely be a feat to explain when something goes right in terms of expected results and performance. XML and Dynamic HTML Currently XML (Extensible Markup Language) is an important and rapidly expanding technology area associated with the development of the Web. XML is discussed in this section along with DHTML, the DOM (Document Object Model) and JavaScript since there is significant overlap between these technologies. For example, two of the components of DHTML include the DOM and JavaScript. Also the DOM specification defines a JavaScript binding that can be used to manipulate XML (and HTML) documents. In this research XML was used in the EASIA architecture, firstly as a means of specifying the content of the user-interface without specifying its representation, and second, to define interfaces to external server-side applications that can be incorporated into the EASIA scientific dataset post-processing capabilities. JavaScript is used in the EASIA user interface to provide a more dynamic feel than can be achieved with HTML forms alone. XML Until now HTML has been the language of the Web. HTML is a fixed markup language that is, in fact, an application of the Standard Generalized Markup Language (SGML)  REF _Ref462478164 \r \h [127]  REF _Ref462478176 \r \h [57]. However, HTML has a number of limitations. XML was designed to provide a new markup language for the Web without the limitations exhibited by HTML. XML arrived in late 1996 and finally reached maturity as a World Wide Web Consortium (W3C)  REF _Ref462986291 \r \h [238] Recommendation in February 1998  REF _Ref458232160 \r \h [23]. (The W3C is an international industry consortium founded in October 1994 to guide the development of the Web, by providing a repository of information, specifications, reference code implementations, prototypes and sample applications, amongst other things A Recommendation is the highest level a specification can attain within the W3C.) XML is a subset of SGML. It is a metalanguage for defining other markup languages, which can be used for describing and exchanging structured data in the Internet environment. XML provides most of the functionality of SGML but without the complexity. Essentially XML documents are text-based and resemble HTML documents with user-defined elements. An element consists of an opening and closing tag (indicated by angular brackets similar to HTMLs standard tags) which surrounds content. XML removes the limitations of HTML by providing: Extensibility The ability to define custom tags and attributes. XML tags can use meaningful names to describe their content, thereby providing self-describing documents. Text-based XML documents with meaningful tags can facilitate data reuse across different platforms and applications. Structure XML can describe data in a nested structure. Description XML supports a metadata description (a schema) for the structured data. Currently the mechanism for this is to associate a Document Type Definition (DTD) with the XML document. Validation If a DTD is included, XML supports verification to ensure the data is valid (according to the supplied DTD) and well-formed. An XML document is well-formed if: It contains one or more elements. It has exactly one root element that has a unique opening and closing tag that surrounds the whole document. All other elements within the document are nested with no overlap between elements. A well-formed XML document may additionally be described as valid if it conforms to a DTD. These are expressed in Extended Backus-Naur (EBNF) notation (see for example, the XML Specification  REF _Ref458232160 \r \h [23]). Although DTDs are currently the recognised way to associate a schema with an XML document (in line with the current XML specification), there are a number of efforts underway to develop alternative schemas for XML to overcome some of the problems associated with DTDs. DTDs can be difficult to write and are limited in their descriptive power. DTDs cannot specify data types, default element content, or relationships within the data. DTDs also require separate parsers to XML, and different authoring tools. The main contender is now XML Schema  REF _Ref462494096 \r \h [240]  REF _Ref462494109 \r \h [241], a two-part draft specification for a schema language, which provides a superset of the capabilities, found in DTDs. This working draft draws heavily on a number of other proposals. Since XML consists of user-defined elements, an application usually needs to parse an XML document and specialised APIs have been created for this purpose. There are two major types of XML parsers: Tree-based and event-based. A tree-based XML parser compiles an XML document into an internal tree structure and then provides an API to navigate that tree. The DOM (Section  REF _Ref462993170 \r \h 2.2.8.2) specifies such an API. Event-based parsers report parsing events (such as start and end tags) directly to an application via call backs without building an internal tree. Whilst this is a lower API it has the advantage that it is not as memory hungry as a tree-based API which builds an in-memory parse tree. SAX (the Simple API for XML)  REF _Ref462477013 \r \h  \* MERGEFORMAT [203] provides a standard interface for event-based parsing. Most parsers will also validate an XML document against a specified DTD. Having described the origins of XML and a brief overview of some of its features, the following two sections discuss two diverse applications of XML. Firstly, as a mechanism for presentation content management on the Web, and second, as a data standard and enabling Technology for document-driven distributed computing. XML for Presentation and Content management XML was originally seen as a way to overcome the shortcomings in HTML by providing a markup language with user-defined tags that could separate presentation from content. One application of XML is therefore as a markup language for Web pages. However, XML does not provide any information about how data should be displayed. One mechanism to associate display information is to write a server-side application, which parses and transforms a requested XML document to an HTML format prior to it being served to the client. Such an application can be custom built or can be based on generic Style Sheets processors. Style sheet processors are available that can be run-server-side or can be embedded within Web browsers such that they can be sent an XML document (which includes a link to a style sheet) directly. In this case the style sheet contains the rules that define how a document should appear and the Web browser can process the style sheet along with the XML document it is retrieving in order to display it. The two style sheet languages that are currently being used for this purpose are Extensible Style Language (XSL)  REF _Ref462493748 \r \h [89]  REF _Ref462494544 \r \h [14] and Cascading Style Sheets (currently a two level specification, CSS1  REF _Ref462493848 \r \h [33], CSS2  REF _Ref462493864 \r \h [34], with a third level in progress). Separation of content from presentation, in this way, has a number of advantages. Different style sheets can be used to display the same XML document in different ways to different users, or in different formats to different devices, for example, to provide simpler presentation for low-resolution screens, such as those found on hand-held computers. Also, the same style sheet can be used to display different XML documents in the same style. It is also beneficial for content management since fragments of XML content can be retrieved from multiple sources and rendered into a single document. Conversely, a single fragment of XML content can appear as a component of many different Web pages. XML can also enhance searching and indexing mechanisms used for Web pages. Pages containing XML markup facilitate metadata discovery through their use of machine recognisable tags (and possibly complete, standardised XML vocabularies used by particular industries or groups). Despite this initial focus for XML, a major current focus is now on its role as a vendor, platform and application independent data format that can be used to connect autonomous, heterogeneous applications. This is the subject of the next section. XML as a Data Standard and as an Enabling Technology for Document-Driven Distributed Computing Phipps  REF _Ref464813926 \r \h [185] discusses how XML can complete the picture for a paradigm shift in computing. Historically, computing solutions have consisted of complex systems containing mutually dependent hardware, operating systems, software packages, network software and data formatting amongst other things. However, it is now possible to provide a simpler framework, which breaks the traditional dependencies. Phipps suggests that there are four parts to a modern computer solution and also defines the technologies for each part: Network - indisputably TCP/IP is the solution. Desktop - a space to load solutions probably browsers, but the key feature is that solutions can be instantiated without requiring additional installed software or proprietary operating system features. Programs In the Web environment Java is now established as the de-facto standard for code development. Data until now there has been no obvious generic data format. XML fills the final gap by providing an open data formatting system. Whilst Java provides a platform independent application development language, XML provides an application independent data standard. XML is often described as self-documenting because it consists of named tags and an optional schema that defines the language represented by these tags. Currently such schemas are constructed as DTDs. A DTD defines, amongst other things, valid elements, attributes and rules for their use. XML can specify new markup languages for a particular purpose, sometimes referred to as vocabularies. Industry-specific XML vocabularies are beginning to proliferate (see for example, the repositories at OASIS  REF _Ref476152103 \r \h [176] repository at XML.org  REF _Ref476152639 \r \h [242]). As touched on in Section  REF _Ref483563387 \r \h 2.2.7.4 XML could become an enabling technology for document centric computing consisting of loosely coupling heterogeneous applications on the Web. One of the reasons for the success of the Web is that it is based on a simple stateless protocol. XML could be used as the syntax for request-response message exchange between applications. In this scenario XML DTDs are used to define interfaces between services. XML parsers are used to marshal data, which is sent to a server via an HTTP POST method, and an XML message is returned to the client. Services can be made available by exposing XML DTDs or vocabularies. Since the underlying protocol is HTTP, messages are not blocked by firewalls. Furthermore, this framework benefits from work has been dedicated to, and continues to be done to optimise the performance, scalability, and reliability of HTTP servers. The loose coupling of client and server also make it possible to complete requests even if a client uses an old version of a DTD. XML is a mechanism for representing data and does not provide a transport protocol. As mentioned above, the simplest way to transport XML between services is to use the HTTP protocol. A number of proposals are being put forward for standardising XML remote procedural call mechanisms to remote processes or objects. These include the Simple Object Access Protocol (SOAP)  REF _Ref483564444 \r \h [204] and XML-RPC  REF _Ref483564503 \r \h  \* MERGEFORMAT [239]. These proposals each describe basic object invocation mechanisms with varying features, all of which use HTTP as the transport and XML for message syntax. They do not require the complex run-time support of distributed object technologies such as CORBA and DCOM, and they provide the major benefit of being Web-native and thus able to pass through firewalls that accept HTTP requests. Conversely, these protocols do not currently support such features as metadata discovery and objects-by-reference (the latter would require bi-directional HTTP and distributed garbage collection). The Document Object Model The DOM specification  REF _Ref462476861 \r \h [71] from the W3C defines an interface that allows programs and scripts to dynamically access and update the content, structure and style of HTML and XML documents. The Document Object Model provides a standard set of objects for representing HTML and XML documents, a standard model of how these objects can be combined, and a standard interface for accessing and manipulating them. The DOM is designed to be language and implementation independent and as such the core of the specification consists of interfaces defined using OMG IDL  REF _Ref462502615 \r \h  \* MERGEFORMAT [170]. However, the specification also contains language bindings for Java and JavaScript. Programmers can use the DOM to build documents, navigate document structure, and add, modify, or delete elements and content. The W3C Document Object Model Working Group carries out Work on DOM. Work by this group covers: Modelling new parts of XML: the DOM is an API to an XML document. As new features are added to XML, the DOM API should model these. Namespaces is an example. CSS Object Model: an object model for modifying and attaching a CSS style sheet to a document. Event model: a model for allowing user and application events. Traversal interfaces: an interface for selectively processing parts of the document according to user-specified criteria. Content Models and Validation: an object model for modifying and attaching a Content Model to a document. Load and Save interfaces: loading XML source documents into a DOM representation and for saving a DOM representation as a XML document. Views and Formatting Object Model: physical characteristics and state of the presentation. The DOM is a multi-level specification, with Level 1  REF _Ref462476861 \r \h [71] a full recommendation and Level 2  REF _Ref483566112 \r \h [72] currently a candidate recommendation of the W3C. A list of the current and envisaged DOM specifications along with timescales is available at  REF _Ref483565846 \r \h [70]. These include: Functionality equivalent to that evident in Netscape Navigator 3.0 and Microsoft Internet Explorer 3.0, which is referred to as Level 0. The model builds on this existing technology. Level 1. This concentrates on the actual core, HTML, and XML document models. It contains functionality for document navigation and manipulation. Level 2, which is at Candidate Recommendation stage, includes a style sheet object model, and defines functionality for manipulating the style information attached to a document. It also enables traversals on the document, defines an event model and provides support for XML namespaces  REF _Ref462494338 \r \h [160]. Level 3 will address document loading and saving, as well as content models (such as DTDs and schemas) with document validation support. In addition, it will also address document views and formatting, key events and event groups. Further levels. These may specify some interface with the possibly underlying window system, including some ways to prompt the user. They may also contain a query language interface, and address multithreading and synchronisation, security, and repository. The DOM is u