When compiling hosting company stats, what are the sane limits that should determine size? currently has 1,506,486 feeds in the Index. A large percentage of it is garbage/old.

A lot of Spreaker garbage also.

So, what should a realistic hosting stats page include? Only feeds that have published a new episode in the last X days? That doesn't seem right since some podcasts are evergreen.

Maybe podcasts that have more than X number of episodes?

Advice welcome.

I'd like to give some realistic stats alongside the raw ones.

@dave This is what I debate with Podcast Industry Insights.

It's up to you to define your own labels. I consider it a "valid podcast" if it has downloadable media via an RSS feed and it's listed in a podcast catalog.

But I'm working on some new taxonomy and have joined the Podcast Taxonomy project to help guide industry-standard labels for stuff like this.

Even "active" and "inactive" are up for debate, too. That's why I changed some of my labeling.

@dave I would label that "inactive."

BTW, "podfade" probably can't be measured, since it's more about intention and communication (or lack thereof). So I think it's very important to avoid derogatory terms to describe inactive podcasts.

Except "podflash." That one's easy. :)

@theDanielJLewis When I say "junk" and "garbage" I just mean stuff like one episode anchor feeds where someone was just messing around on a lark and it's now in the catalog forever. Or clone/fraud feeds.

@dave Yeah. That's where we need a good term that includes new podcasts, ongoing podcasts, and retired but timeless podcasts.

Everything in podcast pollution is, technical, still a podcast. But probably not someone anyone would want to follow.

@dave To say, "It's not a podcast until you reach N episode," is like telling a baby they're not a human until they reach a certain age.

@theDanielJLewis @dave well you are not an adult in most countries until you reach 18. The analogy holds.

@dave In a way, "active" could still describe a podcast like Serial. Even though it's not current/ongoing/publishing, it's still getting lots of followers. But a third-party can't measure that.

@dave I've thought about terms like "living," "sustained/maintained," "matured," "seasoned," "relevant," and more.

@theDanielJLewis I like "sustained" and "relevant". Those are good terms.

@dave Or redefine "valid." And all the podcast pollution would be "invalid."

@dave @theDanielJLewis I think fresh is another word that can be used too, in addition. Serial, as of this week is fresh again, but it wasn't for a while. A podcast with an episode within the last 3 months is fresh.

@dave You could simply say, "hosted podcast feeds," which is far more accurate, and also would prevent Blubrry from being incorrectly ranking low. Unless you're looking at the enclosure URLs.

@theDanielJLewis I'm looking at both enclosure urls and feed urls to make it accurate. I plan to open source the compilation code when I feel good about its accuracy. Blubrry was my test case because they are so wrongly reported. If I can get them right then I feel good about the logic.

@dave Libsyn will be slightly undercounted, too, if looking at only feed URLs, since many people use Libsyn with PowerPress.

@theDanielJLewis BTW, this is super ugly code. I'm just prototyping things.

@theDanielJLewis @jamescridland Yes, I'm using some of those patterns. This is mostly for our use internally, to make some sense out of what we have, and so that I can speak intelligently about it rather than being super accurate to the public. That's your job. 😉

@dave Whew. I was worried you were pushing me out of a job. 😉

@theDanielJLewis I really have zero interest in "stats" other than to help me dedup this mountain of feeds. Also, the question comes up often about why we have a million and a half more feeds than Apple. And, I'd like to know how much of that is "junk".

@dave @theDanielJLewis
Is this what they lovingly day skip logic? Or some people call AI 😄

@dave you have access to a stat, I don't know if you use it yet. What is being requested /fetched from the index? Surely something that is active is fetched more than dead (though if it is in someone's podcatcher or may get fetched till the end of time) in isolation it might be noisy, in aggregation I think you'll find a signal.

