I wanted to take a moment to add my two cents. Though I certainly believe estimating should be more science than art, I look at estimates from a different perspective. As a disclosure, I’m not the one doing the estimating on this project, therefore I’m not going to say I agree or disagree with any one technique. Depending on your situation, one estimating technique may provide more accurate results than the other.
What I would like to add, from my perspective, is the need for expert judgment. If you are an expert in a given estimating technique and it gives you the results you and your customer(s) need, does that not validate it as an acceptable estimating choice?
If the estimating technique does not produce the desired results, wouldn’t it fail the metaphorical sniff test?
Recently, I questioned a vendor’s estimate based on a different technique. I used a parametric estimate to see if the vendor’s estimate would pass or fail my sniff test.
What exactly is a parametric estimate?
An estimating technique that uses a statistical relationship between historical data and other variables to calculate an estimate for activity parameters, such as scope, cost, budget, and duration. Source: PMBOK Page 439
So, why did the vendor’s estimate not pass my sniff test? As part of a standard estimating practice, software vendors should include time for fixing bugs. Upon review of a recent status report, I noticed the vendor reporting half as many bugs were discovered in a current build than had been estimated. When asked about this, the vendor was very excited to confirm that they indeed found half as many defects in the code they originally estimated and predicted a cost savings of several hundred thousand dollars to the project. Going into the current build, I knew what the standard deviation was and considered the possible variance. This fell way below that.
So, why were they discovering so few bugs? At first glance, I would predict two possible reasons.  Quality through development improved.  Quality through testing worsened. Either way, you get the same initial result of fewer defects identified.
We’ll know the true answer once initial user acceptance testing begins. If there were no baselines to compare the actuals to, I might not have given it a second thought.
Graphic source via Flickr: pump