If each aggregator agent gets a slice of the database, how many feeds to you think each can reasonably handle in a timely manner without chewing up bandwidth and processor? 10k maybe?
I'm thinking that each node would cycle through their list every 20 minutes. That'd be roughly 72 checks per day, per feed.
If the average feed size is 150k, that puts the total download burden around 1 GB per day. That seems reasonable to me if someone knows what they are doing and chooses to participate.
@dave is each aggregator downloading the entire database every time?
How are slices/jobs being decided and communicated to the client?
Could the slices be proactively created ahead of time by the server and handed out individual databases to reduce bandwidth?
Maybe that's all what you had in mind, just making sure I understand the base assumptions
@agates Yes we are thinking the same. I don’t want a full dB download. I’d rather have the utility server that builds the weekly dB dump also slice it up into 10k chunks and upload the chunks to object storage in a predetermined url scheme.
My big question is then how to assign those chunks.
The chunks will not represent the entire database since there are many feeds in the database that are “inactive” or abandoned. Probably only about 1/3rd deserve to be polled.
@dave I've implemented something similar related to file migration/validation (had to do it for 2 petabytes of files of varying sizes).
The simplest way is to turn "chunks" into tasks/jobs that get assigned when a worker asks for new work.
No need to plan assignment ahead of time, nodes will keep up with the work they can handle. Some might even handle more work than others if they choose to.
@dave Systems like BOINC often hand out the same job to 3+ systems also for the purpose of validation.
@agates What about a renewal interval where after a certain number of days the agent wipes their local dB and pulls a fresh chunk? Maybe every 7 days.
I want it to be a lazy process to keep from effectively DDoS’ing the hosts. If done right I think it can be done without too much extra traffic to them. Especially when you factor in us becoming a push only hub which will let other aggregators back off and lower their own traffic.
@dave So every week each agent would get a new 10k (or configurable) slice? Yeah that would work pretty well. Keeps up with new records while discarding old ones over time.
I would still have a sort of Chunk UUID that expires after a week, so the time period is tracked on both the agent and server, and there would some level of history (maybe keep 6 months of chunk history) with history and updates tied to an agent's API key to track bad agents or whatever.
Torrents have static content that is checksum'd to determine what the client needs to download from other peers/seeds. The database would be perpetually changing, at minimum from new adds, but theoretically I could see chapter updates or other metadata being changed after the fact (I'm not entirely sure what is in the db at this point). I believe IPFS is displacing torrents for static content with the db linking to content.
am I understanding that correctly?
@dave Crazy idea here.
What if the entire distributed feed parsing system was run over the distributed IPFS log and/or key-value store (OrbitDB) and clients could utilize that to subscribe to new episodes?
@dave It's one of the OrbitDB datastore types -- all built on top of IPFS pubsub.
I have to say it's kind of blowing my mind, might even be a route for 100% decentralized chat...
Intended for all stake holders of podcasting who are interested in improving the eco system