I may have realized what I was missing. These "documents" are generated on the fly so I guess you wouldn't be able to reference a static table of document contents. I was led to this by running the following query on the first 15k rows or so.
SELECT DocumentId, HomeCommunityId, RepositoryId, count(AggregationId)
FROM HS_IHE_ATNA_Repository.Document
GROUP BY DocumentId, HomeCommunityId, RepositoryId
These are the numbers of duplicates
| 1 | 9804 |
| 2 | 1746 |
| 3 | 475 |
| 4 | 163 |
| 5 | 2 |
| 6 | 24 |
| 7 | 0 |
| 8 |
2 |
ie, only 2 documents were duplicated 8 times. The vast majority only had a single entry. Still, if we extrapolate this out, that's a lot of duplicated data in my opinion. If my math is correct, in this sample, 21% of the documents records are exact duplicates.
- Log in to post comments


