How can you model this without a circular reference?
This might be in the wrong forum; sorry. I have a really tough data modeling problem, and I'm dumbfounded. I hope there's somebody with an idea.
The problem really deals with aquatic habitat and vegetation, but I'll frame it in a more familiar context.
It's a very common scenario. An employee works for a company. The employee is assigned to a project.
--
Company --> Project --> Client
--
| |
| |
-
Employee --> Consultant
-
If you add this "consultant" relation between employee and project, isn't that a circular reference? Doesn't that violate like 4th normal form or something?
Is there anyone with any ideas?
Thanks!
[991 byte] By [
missle] at [2007-12-25]
There are a bunch of ways to go on this. Are we talking about database design or object model? It doesn't really matter in the abstract I guess...
You've defined your entities just fine from what I can see but the magic isn't in the entities. The real work happens in the relationships between entities and I think that's what you may be missing. The best description I can give is the classic: "Is-A v/s Has-A". Here goes my 101 explanation:
Company - HAS A - Project
Project - HAS A - Client
Company - HAS A - Employee (semantics aside...)
Consultant - HAS A - Project
Employee - IS A - Consultant
HAS A implies an ownership or property relationship. So the Company object would have a Project object property or Projects collection. Project would then have a Client object property etc.
IS A implies an inheritance relationship. So a Consultant object would inherit from the Employee object because an Employee could end up fulfilling the role of Consultant. Vice President would also inherit from Employee for the same reason but the base Employee functions would always be available. Also, does a Consultant have a Project or does a Project have a Consultant? Perhaps there is a context that you must come in with depending upon the situation. e.g. I am a Consultant and I have many Projects but each Project has one or more Consultants. Depending upon which direction you are viewing from, you could have a Project with a collection of Consultants or a Consultant with a collection of Projects. Both are correct depending upon what question you are planning to ask of the model.
The language is important here because you may decide that Client - HAS A - Project and this would dramatically change the way you created your model.
Hope this helped a bit.
This is a great explanation, Jim, especially your last point. I can't see where the fourth normal form is violated and it looks fine, but it really depends on (as Jim states) the intended relationships -- especially the cardinality. For example, can a project be associated with more than one company? Can more than one consultant work on a project? I'm wondering if there is a concrete way to express "IS A" and "HAS A" in terms of database PK/FK assignments? Does "IS A" mean having a FK to the parent in the PK?
There shouldn't be a problem. An important thing to look at in this schema is the direction of the arrows.
You could write it down like this:
- Employee refers to CompanyId
- Project refers to CompanyId
- Project refers to ClientId
- Consultant refers to EmployeeId
- Consultant refers to ProjectId
So Company breaks any circular reference as it doesn't refer to anything else.
I thought about this a little more and I don't think I was clear. Circular references themselves aren't automatically a problem, nor do they necessarily break normalization. Think about a "department" table with a column "parent" which has a FK constraint referring to the same table's PK columns. Tables that simply reference each other don't violate normalization so long as the relationships are 1:1.
What I was trying to get at with applying the "IS A" vs. "HAS A" concepts from object-oriented programming is that you need to decide how those concepts map to PK/FK declarations. I would think "IS A" means (in this case) that the PK of the consultant table is an FK to the employee table.
Thanks for all the great feedback. Maybe my original post was not clear enough. I am looking just from a data model standpoint rather than OO. I wish I could find a way to create a better ERD in text, but you seemed to have documented the relationships we are talking about.
Here is the problem, at least from an academic standpoint, which is what I am concerned with. Using the relationships in the database, I can navigate from employee to company to project to consultant and then to employee again.
I think that I should not be able to do that. I think that any type of a closed loop in the ERD indicates a logical modeling error. It indicates some sort of redundant relationship. So, my question is, am I right about that?
ASIDE: I actually am a mathematician, but I seem to be not fully understanding the relational algebra. Like, when you do a projection join on those relationships, you get bad data or something. The only "out" I can think of is that while walking those relationships, you lose some sort of specificity. That is, going from company to project, you lose the specificity of which employee you are talking about. But, it seems more likely that the answer lies in the relational algebra.
tough one...
To your question, going from Company to Project has no bearing on Employee. Its a different question. IF Consultant is a kind of Employee, there is no need to navigate between them. Projects to not get Employees, they get Consultants. When you look at a Consultant, you are esentially looking at a flavor or Employee. All Employee information should be available to you.
I'm not a biologist and I know zip about marine plant life but if you have a flower, you get petals. Its what makes it a flower. If you have a petunia, you get specific kinds of petals and other properties that make that flower a petunia. So when you look at a petunia, are you looking at a petunia or a flower? The answer is "yes", you're looking at both. When you look at a Mexican petunia, are you looking at a flower, a petunia or a Mexican petunia? You get the point.
So here's what I'd do for the database. Think of it as pseudo-ORM.
Tables (each with TableNameID as the PK)
Company
Project
Client
Employee
Consultant
ProjectConsultant
Company.CompanyID = Project.CompanyID
- Company has many Projects
- Project has only one Company
Company.CompanyID = Employee.CompanyID
- Company has many Employees
- Employee has only one Company
Project.ClientID = Client.ClientID
- Project is for only on Client
- Client may have many Projects for a Company
Employee.EmployeeID = Consultant.EmployeeID (1 - 1 relationship)
- Employee is a Consultant because he works on Projects
- Consultant is an Employee becuase Company signs his check
Consultant.ConsultantID = ProjectConsultant.ConsultantID
- Consultant is on many Projects through time
- ProjectConsultant contains a list of Project/Consultant mappings
Project.ProjectID = ProjectConsultant.ProjectID
- Project has many Consultants
- ProjectConsultant contains a list of Project/Consultant mappings
Notes:
With this normalization approach you can also store critical information about the assignment of a Project to a Consultant like the date of assignment, duration, rate index or role). Information in this table is informative only in the context of a Consultant's assignment to a Project.
You will store in Consultant only the information related to an Employee acting as a Consultant. Likewise, Employee contains only generic information related to all employees. This is referred to as a horizontal partition to your tables. One cannot exist without the other in the context of Projects.
You could theoretically make EmployeeID the primary key of the Consultant table. It doesn't matter in any way I can think of and makes the navigation a little easier.
P.S. I googled the petunia thing...
If you actually have a 1:1 relationship between Consultant and Employee, you should at least consider putting them in a single table and add a field like EmployeeType that can indicate an employee as a consultant. The reason for this is exactly what axshon explained about the flowers. A consultant is an employee, so unless you have a good reason for storing it in another table, you should store it in the employee table. The fact that it is a consultant is just another peace of data.
The reason for this is scalability / flexibility. This applies even more when you move to things like storing species, because it is just inpractical to make a new table for every singe one you encounter, and even more inpractical to program for that datamodel.
Hey Jonathan,
Great point! I probably didn't emphasize that enough. I agree that in many (most?) cases you DO want to just store that data in a single table but an app I worked on recently had some pretty compelling reasons for horizontal partitioning;
A CRM system for a manufacturer needs prospects and customers. Prospects have a load of information, much of it required, related to demographics and marketing. Customers have a different set of data, much of it also required, related to sales. Both customers and prospects share common data, which goes in the Manufacturer table. Prospects have a ManufacturerProspect table while Customers a ManufacturerCustomer table. There is a 1:1 relationship between Manufacturer and the Prospect as well as Manufacturer and the Customer tables. Because its a CRM system, there are a LOT of columns in all these tables and there are significant quantities of import jobs that do not run against object models to get fresh data into the system. Without separate tables, we could not enforce a major portion of the data integrity for imports becuase a required field in a customer import would fail when you tried to import a prospect. We could have introduced an object model that enforced these rules to import against but that would have prevented a general-knowledge data guy with a huge learning curve that he would probably have ignored anyway. We could also have created temporary tables for imports and had stored procedures for pulling the data but documentation for that kind of thing tends to get lost when a VP says you can't leave tonight until we get this D&B data loaded...
In the end, the best business reason I know of for horizontal partitioning (row size limits being a physical reason) is to control your business logic closer to the database. This is really another discussion and the subject of many blogs.
This is a good example of when it should be done. I know a lot of people would say that a prospect is not a customer and a customer is not a prospect and that's the reason for storing them in to separate tables, but this is incorrect. You could have a customer for one product being a prospect for another product. This way a customer can be prospect and a prospect can be a customer.
Might be a little off-topic, but can be handy in trying to grasp the concept in this case.
I generally agree with Jonathan. However, if you have any columns in your customer table that only apply to prospects, then you should split them into separate tables according to the first normal form (I think). You are supposed to segregate sets of related data even if they share an identical primary key. Let me know if recall this incorrectly.