1. 1) Information Integration Framework

    1. (1-1) High-Availability Scheme for Distributed Stream Processing

      Distributed stream processing engines (DSPEs) have recently been studied to meet the needs of continuous query processing. Because they are built on the cooperation of several stream processing engines (SPEs), node failures cause the whole system to fail. High-availability schemes such as Active Standby (AS), Upstream Backup, and Passive Standby have proposed to address this issue. Active Stanby is a processing-pairs model, and the secondary receives processing results from the upstream and processes them in parallel with the primary. If the primary is alive, the secondary does not send its processing results downstream; it logs them in the output queue. After the stop-failure of the primary, the secondary starts to send processing results from the head of the output queue downstream, and takes over the primary. In Upstream Backup (UB), the upstream primary acts as a backup SPE and each secondary is merely waiting while the primary is alive. If the primary stops, the upstream primary resends all processing results, and the secondary reprocesses them to take over the primary’s state. In AS, the upstream must always send processing results to the primary and secondary in parallel. Therefore, AS’s bandwidth overhead becomes the same as between primaries. However, the recovery time of AS is less than UB, because the secondary always updates its state just as the primary does. In UB, the upstream sends processing results to the secondary only if the primary has stopped. That means bandwidth overhead of UB becomes 0 before failures. However, the recovery time of UB is longer than for AS, because the secondary must recover the primary’s state from the initial state.

      We have proposed a new high-availability scheme Semi-Active Standby (SAS). SAS has a tunable parameter called the batch size. As the batch size changes, SAS is able to adjust the trade-off between bandwidth overhead and recovery time. SAS behaves as UB until the size of the output queue achieves the batch size. After the output queue size reaches the batch size, all data in the queue is sent to the downstream secondary. If we set a smaller batch size, SAS shows larger bandwidth overhead and shorter recovery time. On the other hand, if SAS is set to a larger batch size, SAS shows smaller bandwidth overhead and a longer recovery time. In addition to SAS, we have proposed Adaptive Semi-Active Standby (A-SAS). A-SAS enables adaptive balancing of the bandwidth overhead and recovery time by the cost model-based automatic tuning of the batch size.

    2. (1-2) R-Tree for Update-Intensive Applications


      With the rapid advances in positioning systems - such as Global Positioning System (GPS) and Radio-Frequency Identification (RFID) - and mobile computing technologies, managing up-to-date information about the locations of massive moving objects has become a critical area of research.

      In this work, we have proposed an R-tree-based index structure (called Rsb-tree, R-tree with semibulk loading) for efficiently managing frequent updates from massive moving objects. Basic concepts of Rsb-tree are threefold. 1) Update Buffering: Newly incoming updates will be buffered at in-memory buffer to reduce disk I/O. 2) Semi-bulkloading: Exploiting bulk-insertions of buffered updates into the disk-resident R-tree, total I/O cost will be reduced drastically. 3) Deferring I/O Intensive Operations: I/O intensive operations (split, removal, overflow treatment, and underflow treatment) are needed to maintain a proper R-tree structure. We defer these operations by exploiting the in-memory buffer as much as possible.

      In-memory buffer structure consists of Object Registry, Histogram, and Destruction List. Object Registry is a set of object-tuple hashed based on oid. Histogram is a two-dimensional histogram, where each cell contains OID list of OR entries for spatial search over OR. Destruction List is a set of object-tuple hashed on oid to defer the deletion of obsolete entries. All kinds of I/O intensive operations such as insert, delete, split, and removal are buffered/deferred as much as possible. Insert/delete/update are fully operated in the memory buffer. Flush is invoked when it is necessary to make room for the newly incoming update. FlushAll writes every OR entry in OR into R-tree. FlushCell writes the maximum histogram cell. Overflow/underflow treatments are deferred by relaxing the minimum entry threshold of R-tree nodes. Garbage clean is performed once every K updates and visits every leaf node in a DFS manner.

      With a reasonable memory overhead (typically only 1 percent of the whole data set), the proposed approach far outperforms the previous works in terms of update and query performance in a realistic environment. Extensive experimental evaluation reveals that the proposed approach is far more efficient than previous approaches for managing frequent updates under various settings.

    3. (1-3) Transactional Stream Processing

      Stream Join

      A recent trend in data stream processing shows the use of advanced continuous queries (CQs) that reference non-streaming resources such as relational data in databases and machine learning models. Since non-streaming resources could be shared among multiple systems, resources may be updated by the systems during the CQ execution. As a consequence, CQs may reference resources inconsistently, and lead to a wide range of problems from inappropriate results to fatal system failures. We addressed this inconsistency problem by introducing the concept of transaction processing into data stream processing. We introduce CQ-derived transaction, a concept that derives read-only transactions from CQs, and illustrate that the inconsistency problem is solved by ensuring serializability of derived transactions and resource updating transactions. To ensure serializability, we proposed three CQ-processing strategies based on concurrency control techniques: two-phase lock strategy, snapshot strategy, and optimistic strategy. Experimental study showed our CQ-processing strategies guarantee proper results, and their performances are comparable to the performance of conventional strategy that could produce improper results.

    4. (1-4) Privacy-preserving Data Management on Clouds

      To address privacy requirements of the Database-as-a-Service(DBaaS) users, we have been proposed privacy preserving data management techniques. In the proposed system, a server hosts the data which is encrypted by searchable encryption, and encrypted queries are processed on the server so that the service provider cannot read any data and the queries. To handle this trade-off relationship between query performance and confidence of privacy requirements, we focus on an Oblivious Index Traversal framework, which is proposed by Hu et al. This framework uses indexes such as B+-tree and R-tree which are generally used in the most database management systems, and the query algorithm traverse the index without decryption the node values by using oblivious technique such as GT-SCOT.

      However, the security of the framework is not strong because the order of records is revealed from the leaf nodes of the index. Therefore, in our proposed framework, the index is separated to the index structure and the node entries, and only the node entries are stored in the server while the index structure is sent to the client. By separating the index, the order of records is not revealed to the DaaS service provider. In addition, we investigate efficient index structures from the viewpoint of the number of communication between the client and the server to traverse the index, and we apply binary search algorithm for processing range queries by using oblivious transfer protocol. By our experiments, the query processing time of our proposed framework is 300 times faster than the existing framework which uses B-tree index.

  2. 2) Data Mining and Knowledge Discovery

    1. (2-1) Outlier Detection

      Outlier Detection

      In recent years many new techniques for collecting data have resulted in an increase in the availability of uncertain data. For example, many new hardware technologies such as sensors generate data which is imprecise, many scientific measurement techniques are inherently imprecise, and in many applications such as privacy-preserving data mining, the data is modified by adding perturbations to it.

      In this work, we focus on outlier detection on uncertain data. We focus on distance-based approach because it is the simplest and the most basic one, and can be used as preprocessing before applying more sophisticated application dependent outlier detection techniques. We focus on uncertainty of attribute values and the uncertainty is modelled by the Gaussian probability density function. Specifically, the three problems are addressed: 1) Distance-based outlier detection on uncertain static data, 2) Top-k outlier detection on uncertain static data, 3) Continuous outlier detection on uncertain data streams.

      For the first problem, we gave a novel definition of distance-based outliers on uncertain data. Then, since the probability computation is expensive, a cell-based approach was proposed to speed up the outlier detection process. The proposed approach identifies and prunes the cells containing only inliers based on its bounds on outlier score, and detects the cells containing only outliers. An approximate approach of outlier detection using the bounded Gaussian uncertainty was also proposed. The basic idea is that the bounded Gaussian distribution is a good approximation of the Gaussian distribution and can increase the outlier detection efficiency at a small loss of accuracy.

      In addition, we presented an efficient top-k distance-based outlier detection approach for the second problem, and continuous outlier detection approach on uncertain time series data streams.

    2. (2-2) Web Page Ranking Based on Social Bookmark Analysis

      Social Bookmarking

      Web services called Social Bookmarks (SBMs) are attracting attention. A web page registered on SBM is refereed by user interests, preferences, convenience and a variety of individual review points. If a page is bookmarked by many users, it seems more informative, and also seems to have reliable content. So a bookmark from an SBM user to a web page is a vote, and the number of bookmarks to web pages in SBM is a barometer of web page values.

      We proposed a ranking method called S-BITS using bookmark relationships between bookmarkers and web pages. S-BITS ranks web pages based on a bipartite graph between bookmarked web pages and bookmarking users like the HITS algorithm. It estimates a web page value based on who bookmarks the page, and also estimates a bookmarker value based on what pages that person bookmarks. When a bookmarker is knowledgeable and/or trustworthy, a page gets a high score if that person bookmarks the page. In the experiments, our method yielded better results than an existing web search engine and another bookmark-based page ranking method SBRank.

      In SBM, there are many users who do not update or delete bookmark information pointing to obsolete web pages. Actually, there are many obsolete web pages: old news and press releases that are not worth a look. To judge the current value of a web page based on SBM, we should look at how the page has been bookmarked until now, and how frequently the page has been bookmarked. One way to observe the degree of obsolescence is to compare page bookmarking rates between the past and present. Based on this idea, we proposed S-BIT* improving S-BITS. We estimate the “Activation Level.” of each page, which shows how attractive a page is on a particular date. Analyzing time-series variation of social bookmarking frequency, we can estimate the activation level of a web page the Hidden Markov Model (HMM). S-BITS* incorporates this activation level concept. Our experiments showed that S-BITS* yields better ranking. These experiments suggest that our estimation method of activation levels is effective and is similar to the human sense of worth.

    3. (2-3) Microblog Analysis

      Recently, microblogs, such as Twitter and Sina Weibo, have been attracting considerable attentions as a new type of information in the Internet. They have some interesting characteristics that many users exchange short messages in a real-time manner, while presenting similar features as other (ordinary) social medias, like SNS. For this reason, it is becoming increasingly important to extract useful information from microblogs. To address this problem, we conducted several researches:

      TURank: Twitter User Ranking Based on User-Tweet Graph Analysis: In Twitter, the number of users is usually quite large, and it is therefore hard for the users to find out useful users. There are some related works in that the number of followers or follow graph (Twitter social graph) is used to quantify the users' authority. However, they are not sufficient in the sense that they do not take into account the posts from the users, though content analysis over short messages often results in noisy and unreliable results. In this work, we proposed TURank (Twitter User Rank), which is an algorithm that measures the Twitter users' authority scores considering both a Twitter social graph and how tweets actually flow among users by taking into account Retweets. More precisely, we construct a graph consisting of 1) Twitter users, 2) posts, 3) follow/followed edges, 4) post/posted edges, and 5) Retweet/Retweeted edges, and apply ObjectRank, a graph analysis method for graphs containing heterogeneous nodes and edges, to get Twitter user's ranking considering the above mentioned aspects. The experimental results showed that the proposed scheme can accurately rank the users than other competitive methods.

      Twitter Social Graph

      Tag-based User Topic Discovery using Twitter Lists: There are growing needs to find useful information from Twitter, because an enormous amount of information is transmitted in real time. Twitter users play an important role as information sources and transmit information based on their interests or preferences. Therefore, to identify useful information, it is important to know what topics users tend to transmit information about. In this work, we proposed a method to estimate the user topics by tagging users using Twitter lists. Twitter lists are the official functionality to make a "user list" and share it. The list members (users included in the list) often relate to a certain topic described as the list name. Experimental results show the effectiveness of the proposed method.

      Tag Generation from Twitter Lists

      Recommending Fresh URLs using Twitter Lists: Recommender systems for social media have attracted considerable attentions due to its inherent features, such as a huge amount of information, social networks, and real-time features. In microblogs, which have been recognized as one of the most popular social media, most of URLs posted by users are considered to be fresh (i.e., shortly after creation). Hence, it is important to recommend URLs in microblogs for appropriate users because users become able to obtain such fresh URLs immediately. In this work, we proposed a URL recommender system using Twitter user lists. Twitter user list is the official functionality to group users into a list along with the name of it. Since it is expected that the members of a list (i.e., users included in the list) have similar characteristics, we utilize this feature to capture the user interests. Experimental results showed that our proposed method achieves higher effectiveness than other methods based on the follow-followed network which does not offer user interests explicitly.

      Landmark-Based User Location Inference in Social Media: Location profiles of user accounts in social media can be utilized for various applications, such as disaster warnings and location-aware recommendations. In this work, we proposed a scheme to infer users' home locations in social media. A large portion of existing studies assume that connected users (i.e., friends) in social graphs are located in close proximity. Although this assumption holds for some fraction of connected pairs, sometimes connected pairs live far from each other. To address this issue, we introduced a novel concept of local landmarks, which are defined as users with a lot of friends who live in a small region. Local landmarks have desirable features to infer users' home locations such as providing strong clues and allowing the locations of numerous users to be inferred using a small number of landmarks. Based on this concept, we proposed a landmark mixture model (LMM) to infer users' location. The experimental results using a large-scale Twitter dataset showed that our method improves the accuracy of the state-of-the-art method by about 27%.

      Performance of the Proposed Scheme

    4. (2-4) GPU-based Acceleration of Data Mining

      GPGPU (General-purpose computing on graphic processing unit) has been gaining much public attention as a new computational platform. The idea is to exploit GPUs for not only graphical processing, but also general-purpose computation. GPGPU now covers diverse range of applications, such as physical simulation, audio processing, video processing, and cryptography processing, etc. We have exploited GPU to accelerate various data mining tasks.

      Efficient Probabilistic Latent Semantic Indexing using Graphics Processing Unit: In this work, we proposed a scheme to accelerate the Probabilistic Latent Semantic Indexing (PLSI), which is an automated document indexing method based on a statistical latent semantic model, exploiting the high parallelism of Graphics Processing Unit (GPU). Our proposal is composed of three techniques: the first one is to accelerate the Expectation-Maximization (EM) computation by applying GPU matrix-vector multiplication; the second one uses the same principles as the first method, but deals with the sparseness of co-occurrence of words and documents; and the third one is to use the concurrent kernel execution, which is available on NVIDIA Fermi architecture, in order to speed up the process. We compared the performance of the proposed scheme with the non-parallelized implementation. The results showed that our method could be more than 100 times faster than the CPU-based implementation in our environment. By dealing with the sparseness of the data, we could not only process more documents and words using GPU, but we could also keep more data on the device memory so that we can avoid massive data copy transfer between the host and the device susceptible to reduce the execution performance.

      Computing SPMF on GPU

      GPU Acceleration of Probabilistic Frequent Itemset Mining from Uncertain Databases: Uncertain databases have been widely developed to deal with the vast amount of data that contain uncertainty. To extract valuable information from the uncertain databases, several methods of frequent itemset mining, one of the major data mining techniques, have been proposed. However, their performance is not satisfactory because handling uncertainty incurs high processing cost. In order to address this problem, we utilized GPGPU. In this work, we proposed a method for frequent itemset mining from uncertain databases using GPU. The main idea is to speed up probability computations by making the best use of GPU's high parallelism and low-latency memory. We also employed an algorithm to manipulate a bitstring and data-parallel primitives to improve performance in other parts of the method. Extensive experiments showed that our proposed method is up to two orders of magnitude faster than existing methods. Moreover, we proposed multi-GPU methods for GPU-based clusters. First, we developed methods on a single node with multiple GPUs, and then we extended the methods to use a GPU cluster. The proposed methods reduce data dependencies by distributing candidates of probabilistic frequent itemsets among GPUs. In addition, the methods consider load balancing, which is also an important issue to achieve scalability. Experiments showed that the single-node methods realize near-linear speedups, and the methods on a GPU cluster of eight nodes achieve up to a 7.1 times speedup.

    5. (2-5) Software Repositories Mining

      Support for Naming Identifiers

      Logical Coupling Detection

      Through software development process, various data are recorded in repositories. We have proposed methods to support software development by leveraging data in the repositories.

      An example is support for naming identifiers in program. Identifier names and corresponding entities (i.e. naming cases) are collected from a large amount of source code, and then association rule database is built from the naming cases. When a software developer is wandering to decide an identifier name, candidate of names for the identifier is provided as a list using the rule database.

      Another example is detection of logical couplings across software repositories. Logical coupling is a relationship between entities, such as classes or files, that means the entities have tendency to be changed at same time. Logical coupling is used for evaluating software modularity, or preventing regression. Existing methods can obtain logical couplings between entities stored in single VCS repository, however, cannot obtain couplings between entities stored in different repositories. We proposed a novel method to detect logical coupling across repositories using heuristics. In evaluation experiment, couplings between an application and a used library, or between forked software projects are detected by the proposed method.

  3. XML and Web Programming

    XML (Extensible Markup Language) is a standardized data format for semi-structured data and documents, and growing amount of information is being accumulated in the form of XML in various application domains. Besides, RDF (Resource Description Framework) is gaining its popularity due to the emergence of LOD (Linked Open Data). With the aim of investigating techniques for managing massive XML/RDF resources, we have conducted the following researches.

    1. (3-1) Online Analytical Processing of XML Data

      Given a set of XML data, it is desirable if it is possible to extract valuable information from them by analytical processing, rather than by simple query retrieval. For this purpose, in this work we developed a technique called XML-OLAP (Online Analytical Processing), which enables us to perform multidimensional analytical processing over XML data. More precisely, we proposed a novel (formal) definition of XML data cube and related operators, efficient algorithms and indexing structures dedicated to the model and operators to achieve interactive analysis of large XML datasets.

    2. (3-2) Parallel Processing of XML Data using PC-Clusters/Multi-core Processors

      XML Data Partitioning in GMX

      For a large collection of XML data, it is highly required to process queries over them as fast as possible. In the meantime, due to the commoditization of PCs and multi-core CPUs, it is becoming increasingly important to develop parallel processing methods for database systems, including XML databases, for achieving better query performance. In this work, we addressed the problem of parallel XML query processing using PC-clusters/multi-core CPUs.

      First, we proposed a scheme of data partitioning scheme for XML data based on a grid metadata model for XML data that gives a conceptual view to partition XML data, specifically for holistic twig joins processing (Fig. 11. XML Data Partitioning in GMX.). The proposed model adopts a cost-based model and facilitates a set of partition refinement methods for workload balancing purpose. The model has features of reducing the workload variance significantly on the cluster system; duplicating XML data necessarily to avoid data dependency among cluster nodes, and exploiting inter query parallelism and intra query parallelism. We evaluated the effectiveness of our proposed model in the experiment that our data partitioning method has better workload balance and has an impact on better parallel speed up performance as well.

      Second, we proposed a parallel TwigStack algorithm executed on a shared-memory multi-core system for achieving scalable query performance against large XML data. Our proposed scheme explores the following features. Firstly, we perform on-the-fly partitioning on input streams of XML nodes for subsequent parallel execution and, thereby, ensure that query solutions in a partition can be obtained by the TwigStack algorithm without being dependent on other partitions. Secondly, we propose a scheme for estimating the optimal partition size for a given system configuration by taking L2-cache size into account. Finally, we introduce a partition prefetching technique to alleviate the overheads of performing the on-the-fly partitions. The experimental results demonstrated that our proposed parallel algorithm works effectively and efficiently. The parallel speedup scales up to the number of available CPU-cores.

    3. (3-3) A Framework for Faceted-navigation of XML Data

      With the growth of XML data, it is required to provide a quick solution to search for desired information out of massive XML data. However, conventional query languages for XML data, such as XPath and XQuery require a user to be familiar with both the languages and the structure of XML data, which becomes a barrier for many (non-expert) users.

      In this paper, we proposed a framework of faceted navigation over XML data, in which users can navigate a collection of XML data using a simple query interface without issuing complicated queries. General faceted navigation schemes are used to browse objects (or records) containing multiple properties. However, because XML is semi-structured in nature, it is not straightforward to apply faceted navigation to XML data. More precisely, we formulate faceted navigation over XML data by giving dentitions of class, property, object, and facet in XML data. We then formulate typical user interactions in faceted navigation as operations over aforementioned concepts (class, object, and facet). We also proposed a system based on the proposed framework. Finally, we showed experimental evaluations using the prototype system to show the effectiveness of our proposed scheme.

      Faceted-navigation over XML Data

    4. (3-4) Energy-efficient XML Stream Processing

      The rapid growth of energy consumption by computer systems has drawn a considerable public attention in relation to the emerging global/environmental phenomena, such as global warming, carbon dioxide emission, etc. Consequently, there have strong demands for computer systems to reduce energy consumption, while maintaining high computational power. In this work, we proposed a scheme of energy-efficient XML stream processing through element-skipping parsing. Although parsing is one of the most computationally heavy part in the process of XML stream processing, existing techniques do not focus on the efficiency of XML parsing. To this problem, in our scheme, we proposed element-skipping parsing where the parsing of such XML elements that do not contribute to the final query result is skipped with the coordination of XML parser and query processor. We showed that the scheme is effective in reducing the execution time as well as energy consumption of stream-based XML query processor.

    5. (3-5) RDF/LOD Data Processing

      Space-efficient Representation of RDF Data using RDFPackages

      LOD (Linked Open Data) has rapidly been diffusing in accordance with the Open Data movement. Consequently, there is a global trend for governments, industries, and institutes to publish their data in the form of RDF (Resource Description Framework) using SPARQL endpoints. To better manage such massive information resource in RDF format, we conducted the following researches.

      Efficient Reasoning and Querying over Large-Scale RDF Data: When querying RDF and RDF Schema (RDFS) data, for improving the performance, it is common to derive all triples according to RDFS entailment rules before query processing. An undesirable drawback of this approach is that a large number of triples are generated by the RDFS reasoning, and hence considerable amount of storage space is required if we materialize the RDFS closure. In this work, we proposed RDF packages, which is a time and space efficient format for RDF data. In an RDF package, a set of triples of the same class or triples having the same predicate are grouped into a dedicated node named Package. Using Packages, we can represent any metadata that can be expressed by RDF. An important feature of the RDF packages is that, when performing RDFS reasoning, the same rules can be applied without any modification, thereby allowing us to use existing RDFS reasoners. In this work, we discussed the model of RDF packages and its rules, followed by the transformation between RDF and RDF packages. We also discussed the implementation RDF packages using an existing RDF framework. Finally, we demonstrated the performance of the proposed scheme in triple size, reasoning speed, and querying speed.

      An ETL Framework for Online Analytical Processing of Linked Data: In this work we proposed an ETL framework for the online analytical processing of linked data. More and more data are published online in machine-readable formats, and linked data is one of the major ways to do it. Such data often contain numerical data, as well as text, and there is a strong demand to perform analytical processing over them using existing OLAP systems. Our proposed framework is to streamline the ETL process from Lined Data to multidimensional data models for OLAP. Unlike other related approaches, our framework does not require any dedicated RDF vocabularies for multidimensional analysis. Instead, we exploit the relationships among entities, inherent hierarchical structures, and external references to derive multidimensional schema and semantic hierarchies. This allows us to perform OLAP analysis over general Lined Data even if they are not based on RDF vocabularies for OLAP. In the experiments, we demonstrated that our scheme can be applied to existing datasets.

    6. (3-6) Privacy-preserving Database Querying

      Multi-valued Order-preserving Encryption

      Privacy-preservation techniques are gaining considerable attentions from public due to the global diffusion of the Internet and various services in it, such as SNS. In particular, in the cloud-computing environment where computational resources, including database contents, are hosted at the cloud-service provider, it is crucial to protect the data privacy.

      Encryption can provide strong security for sensitive data against inside and outside attacks. This is especially true in the "Database as Service" model, where confidentiality and privacy are important issues for the client. In fact, existing encryption approaches are vulnerable to a statistical attack because each value is encrypted to another fixed value. We presented a novel database encryption scheme called MV-OPES (Multivalued-Order Preserving Encryption Scheme), which allows privacy-preserving queries over encrypted databases with an improved security level. Our idea is to encrypt a value to different multiple values to prevent statistical attacks. At the same time, MV-OPES preserves the order of the integer values to allow comparison operations to be directly applied on encrypted data (Fig. 14. Multi-valued Order-preserving Encryption.). Using calculated distance (range), we proposed a novel method that allows a join query between relations based on inequality over encrypted values. We also presented techniques to offload query execution load to a database server as much as possible, thereby making a better use of server resources in a database outsourcing environment. Our scheme can easily be integrated with current database systems as it is designed to work with existing indexing structures. It is robust against statistical attack and the estimation of true values. MV-OPES experiments showed that security for sensitive data can be achieved with reasonable overhead, establishing the practicability of the scheme.

  4. 4) Database in Science Domains

    As a group in the Center for Computational Sciences (CCS), University of Tsukuba, we actively collaborate with other research divisions/groups in CCS, as well as with those in other research institutes, such as AIST and ISAS/JAXA.
    1. (4-1) Faceted-Navigation System for QCDml Ensemble XML Data

      he International Lattice Data Grid (ILDG) is a data grid for sharing and exchanging lattice QCD gauge configurations among researchers in particle physics. To facilitate the domain researchers to find desired data, we have developed a faceted navigation system for QCDml ensemble XML data, which is an XML-based metadata format for the dataset. A faceted navigation system allows a user to search for one's desired information in an exploratory way, thereby enabling the user to browse a set of XML data without using specialized query languages such as XPath and XQuery. However, designing a faceted navigation interface for XML data is not straightforward due to the tree and flexible, tree-like nature of XML.

      In this work, we attempted to design and implement a dedicated faceted navigation system for QCDml on top of an XML database. The interface is designed by taking the domain experts' usability into account. We also care about the system's performance. In general, the process of faceted navigation is computationally expensive because of the need for aggregate computation of each available facets. In order to alleviate this, we make use of a relational database system as the engine to speed up the aggregate computation. We finally demonstrate the implemented faceted navigation system, which has been made available on the Web.


    2. (4-2) X-ray Outburst Detection from X-ray Astronomy Data

      X-ray outburst is a phenomenon that an X-ray object emits strong X-ray in a short period of time. Prof. Ebisawa (ISAS/JAXA) and his group have found that some X-ray objects happen to present similar light-curves in shape. To exhaustively survey such similar light-curves out of massive X-ray astronomy data, we applied similarity search techniques for time-series data.

      Detected Light Curves