questions on aggregation of data sources
Hi, I would like to know if the Astoria team could comment on how aggregation of Astoria feeds will work. I'm more interested in the scenario where two data services are seen as one dynamically. I'm not as interested in synchronization or replication. (replication requires political consent from both sides, and it may be more challenging from my experience with DBs). Is it SSE?
In practice, suppose there are 2 Astoria data sources and I want a client to consume and view them as 1 source. How to do that elegantly with minimal plumbing?
How is Astoria going to maintain the identity of entities from disparate sources? Do the keys we see encoded in URIs have the notion of namespaces to qualify the entities uniquely accross data sources?
Finally, a request: please keep and enhance the support for RDF!
And please add a SPARQL (data) adapter.
Regards,
Gustavo Frederico
I am interested in hearing what the Astoria team has to say. Personally, I dislike the SPARQL syntax and I find the default Astoria syntax to be clear, concise, and powerful enough for what I need.
I am trying to figure out what kind of purpose would be involved in aggregating two Astoria services that both have the same schema (they would have to have the same schema in order for them to be able to be aggregated). Basically if you know you've got the same schema, you've got the same mapping, so you have the same underlying model... I've encountered this same issue once before where I had two databases with the same data in two places. My first go-round of design involved exposing each database (let's call 'em West Coast and East Coast, for simplicity's sake) as a service. I then wrote a wrapper service that basically did unions of each coast data and then performed the appropriate filtering and such.
There's two ways I would go about this. If you write the downstream aggregrator as a service consumer and publisher, then you run into the issue as the original poster mentioned - how do you maintain the identities across multiple services, etc? Unfortunately, the issue of maintaining identity across two databases is beyond what Astoria can do.. in my case I used GUIDs for the identities instead of autonumbering fields. This made it so that an order from the west coast database and an order from the east coast database could never collide no matter how transactionally heavy the DBs were.
In your scenario, I think assuming both your back-end data stores are SQL Server 2005, rather than exposing them both as individual Astoria services, you can get more bang for your buck by doing server-side aggregation. Use some SQL Server magic to expose the combination of both databases as a single connection string. From that connection string, you then allow your Astoria service to connect. This way, SQL handles the combination of rows, SQL does all the unions on the server (rather than you wasting bandwidth by sucking down two entire tablesets from two remote locations only to do a union that might result only on a single row of data for the end user), and you get your single Astoria service that can still do all CRUD (Create, Retrieve, Update, Delete) operations.
In short, rather than using Astoria to aggregate, use SQL Server to aggregate and just connect your entity model to the aggregation point. This way, you can even dynamically add multiple aggregation sources (note that Aggregation is also a function of replication in SQL server...) without your Astoria service having to ever be modified. Take the classic "branch office" example. You've got an "orders" database in NYC and Los Angeles. You're aggregating these into a single point that can then be consumed by Astoria/EDM. Now you bring up your Hong Kong office. This is a trivial SQL administration task in this architecture and requires no additional coding on your part, and your bandwidth consumption will still be fixed (still doing server-side filtering instead of sucking entire tables to do aggregation-point-filtering at the Astoria level).
If this made sense, fantastic. If not, let's blame it on my lack of coffee today.
Thanks for you insights, Kevin.
As for SPARQL, I couldn't quite understand what you view as a drawback in SPARQL. Astoria's "syntax" is really a URL encoding scheme. SPARQL on the other hand is much closer to SQL. 'clearness' and 'powerfullness' are somewhat subjective. We could compare the 'conciseness'. But I don't think that's the point. The point is that SPARQL implementations from the architectural point of view integrate well with RDF data. And I would say this criterion is much more important than syntax.
As for the uniformity of schemas, I don't see the necessity for the underlying model to be exactly the same. The more useful scenarios are when there are foreign keys in the different data sources.
I do not assume the underlying databases are SQL Server. For that matter, I don't assume even there is a database binding. But for the sake of simplicity, even if you assume there are 2 SQL Server databases, I still foresee drawbacks and challenges in agreggating data below Astoria in certain scenarios.
1. Database ownership. Often databases are owned by different groups. Or different companies.
2. Extensibility. What if the company decides to migrate one database to SAP?
3. Not really SOA. Integrating databases directly may bypass business logic, security rules, etc.
4. Caching problems. You may have an external system updating the database 'behind the back' of the original system that made a 'closed world assumption'.
5. This resembles the pre-browser debate: "Why do I need a browser if I can browse CERN's data in some client system or internal system?"
6. Not the ideal tool for the job. I know this is dangerous ground, but all of a sudden the database grew from providing reliable data storage to an application server. Is the database an ideal place to host services? I know these are data services, but still... Having said that, if you told me tomorrow that Katmai will host (native) Astoria services I would be very glad!
I suppose EDM would be a good place to aggregate and provide a higher level conceptual model. But again that's not the point.
I like the services part of Astoria, and the standard web protocols it employs. In the future, companies will have multiple internal (Astoria) data services, or may integrate with external services. It is perfectly reasonable not to be a detailed answer at this stage. I'm just looking for the assessment of the candidate alternatives.
Cheers,
Gustavo
The thing I wouldn't want about SPARQL is I really like being able to clearly define what I want on the URL. If I have to resort to SQL-like syntax, then Astoria stops being so appealing to me.
I agree with all your points as well... I think a service that aggregates Astoria services would be pretty fantastic... I think my point was that Astoria may not necessarily be the best tool suited to Astoria aggregation..maybe some tool that just sits on top of WCF and is an enhancement to the Astoria client library.
Here are some thoughts on the topic.
SPARQL and query in general:
My take on this is that the URI syntax should remain "resource oriented", meaning that it has not been designed to be a general purpose query language. It's definitely far from the expresivity of SQL (or SPARQL for that matter), and that is by design. I think that the URI syntax should not go much further than it is now (I'll elaborate about this shortly, watch the Astoria Team blog for details); if we find ourselves in the need of a fully-featured query language, I think that they way to address the requirement would be to have an optional, closed-by-default, entry point that can receive and execute SQL. I don't see a point on inventing a new query language that can do all that SQL can do but it's just syntactically different. Just to be clear, in my opinion in most cases you do not want to have an open-SQL channel in a web-facing system...but if the scenario becomes relevant, that's what I would suggest to do.
As for SPARQL in particular, well...it would be interesting in the context of RDF, but I'm not sure it would fit well with the other formats. Independently of how it works, one key thing to understand is that so far in Astoria we have treated RDF more as a "serialization" or "presentation" format rather than the underlying data model (which is EDM). So executing RDF queries would be challenging both from the functional perspective (needs some form of mapping of the data models) and technically.
Integrating data sources:
This largely depends on the nature of the data to be integrated and the software that manages that data. If you have several data sources, all in relational databases, them may be using SQL Server's distributed query ("linked servers") support is the answer. If you have data coming from data sources of varying nature, then using a middle-tier piece of software would make sense. Whether that is Astoria or not depends a lot on the scenario. The way we are architecting Astoria will allow you to bring your own data sources (also need to elaborate on this...will also post to the team blog), so in some cases you'll be able to use it for integration, although it has not been the primary design point for the system.
In both cases it's still early in the cycle and we're learning from customers about requirements and expectations, so this is a great discussion thread.
Pablo Castro
Technical Lead
Microsoft Corporation
http://blogs.msdn.com/pablo