2008.07.18 Friday 18:08
The Stanford Data Stream Management System
Speaker: Jennifer Widom
talkがエンドレス。。。
○Data Stream
・Continuous, unbounded, rapid, time-varying streams of data elements
・Occur in a variety of modern applications
Network monitoring and traffic engineering
Sensor networks, RFID tags
Telecom call records
Financial application
Web logs and click-stream
Manufactureing processes
○Using Conventional DBMS
・Data stream as relation inserts, continuous queries as triggers or materialized views
・Problems with this approach
Inserts are typically batched, high overhead
Expressiveness: simple conditions(triggers), no built-in notion of sequence(views)
No notion of approximation, resource allocation
Current systems don't scale to large # of triggers
Views don't provide streamed results
○STREAM System Challenges
・Must cope with:
Stream rates that may be high,variable,bursty
Stream data that may be unpredictable, variable
Continuous query loads that my be high,variable
○STREAM System Features
・Aggressive sharing of state and computation among registered queries
・Careful resource allocation and use
・Continuous self-monitoring and reoptimization
・Graceful approximation as necessary
○CQL Example Query1
Two streams, contrived for ease of examples:
Orders (orderID,customer,cost)
Fulfillments(orderID,clerk)
Total cost of orders fulfilled over the last day by clerk "Sue" for customer "Joe"
Select Sum(O.cost)
FROM Orders O, Fullfillments F [Range 1 Day]
Where O.orderID = F.orderID and F.clerk ="Sue" And O.customer = "Joe"
○CQL Example Query2
Using a 10% sample of the Fulfillments stream, take the 5 most recent fulfillments for each clerk and return the maximum cost
Select F.clerk, Max(O.cost)
FROM Orders O, Fullfillments F [Partition By clerk Rows 5] 10% Sample
Where O.orderID = F.orderID
Group By F.clerk
○CQL Example Query4
Relation CurPrice(stock,price)
Select stock,Avg(price)
From Istream(CurPrice)[Range 1 Day]
Group By stock
・Average price over last day for each stock
・Istream provides history of CurPrice
・Window on history (back to relation), group and aggregate
○Memory Overhead in Query Processing
・Queues+State
・Continuous queries keep state indefinitely
・Online requirements suggest using memory rather than disk
But we realize this assumption is shaky
○Reducing Memory Overhead
1,Exploit constraints on streams to reduce state
2,Enable state sharing within and across queries
3,Specialized operator scheduling to reduce queue sizes
○Operator Scheduling
・Global scheduler invokes run method of query plan operators with "timeslice" parameter
・Many possible scheduling objectives: minimize latency, memory use, computation, inaccuracy, starvation
1,Round-robin
2,Minimize queue sizes
3,Minimize combination of queue sizes and latency
4,Parallel versions of above
talkがエンドレス。。。
○Data Stream
・Continuous, unbounded, rapid, time-varying streams of data elements
・Occur in a variety of modern applications
Network monitoring and traffic engineering
Sensor networks, RFID tags
Telecom call records
Financial application
Web logs and click-stream
Manufactureing processes
○Using Conventional DBMS
・Data stream as relation inserts, continuous queries as triggers or materialized views
・Problems with this approach
Inserts are typically batched, high overhead
Expressiveness: simple conditions(triggers), no built-in notion of sequence(views)
No notion of approximation, resource allocation
Current systems don't scale to large # of triggers
Views don't provide streamed results
○STREAM System Challenges
・Must cope with:
Stream rates that may be high,variable,bursty
Stream data that may be unpredictable, variable
Continuous query loads that my be high,variable
○STREAM System Features
・Aggressive sharing of state and computation among registered queries
・Careful resource allocation and use
・Continuous self-monitoring and reoptimization
・Graceful approximation as necessary
○CQL Example Query1
Two streams, contrived for ease of examples:
Orders (orderID,customer,cost)
Fulfillments(orderID,clerk)
Total cost of orders fulfilled over the last day by clerk "Sue" for customer "Joe"
Select Sum(O.cost)
FROM Orders O, Fullfillments F [Range 1 Day]
Where O.orderID = F.orderID and F.clerk ="Sue" And O.customer = "Joe"
○CQL Example Query2
Using a 10% sample of the Fulfillments stream, take the 5 most recent fulfillments for each clerk and return the maximum cost
Select F.clerk, Max(O.cost)
FROM Orders O, Fullfillments F [Partition By clerk Rows 5] 10% Sample
Where O.orderID = F.orderID
Group By F.clerk
○CQL Example Query4
Relation CurPrice(stock,price)
Select stock,Avg(price)
From Istream(CurPrice)[Range 1 Day]
Group By stock
・Average price over last day for each stock
・Istream provides history of CurPrice
・Window on history (back to relation), group and aggregate
○Memory Overhead in Query Processing
・Queues+State
・Continuous queries keep state indefinitely
・Online requirements suggest using memory rather than disk
But we realize this assumption is shaky
○Reducing Memory Overhead
1,Exploit constraints on streams to reduce state
2,Enable state sharing within and across queries
3,Specialized operator scheduling to reduce queue sizes
○Operator Scheduling
・Global scheduler invokes run method of query plan operators with "timeslice" parameter
・Many possible scheduling objectives: minimize latency, memory use, computation, inaccuracy, starvation
1,Round-robin
2,Minimize queue sizes
3,Minimize combination of queue sizes and latency
4,Parallel versions of above
ゼミや講演 | - | -