WHEW! I think I got it working again! Thank God!
I'm going to test my feed-scraping program on a more powerful node to see if MongoDB can keep up better. If I could get this down to a single machine that can process in one day, I guess I wouldn't mind that monthly expense. It would probably be cheaper than spinning up multiple machines once a week to take several days to process.
Can you help me understand something?
I have a Node script processing podcast feeds. I use the Async module to limit my script to only 100 feeds at a time. It's doing async work.
This is using all my server's resources (primarily the 2 CPUs) but getting the job done.
But this is still single-threaded, right?
Is there really any benefit to switching to multi-threaded? For example, 2 processes each with the Async limit set to 50? Would it make a difference?
I'm in over my head sometimes.
Maybe I spoke too soon. I tried FXP on a much larger dataset overnight, and it took just as long as xml2js did. But I'll retest.
And on my production server with a larger data set, it went from 28 seconds to 19!
@dave The discussion is a bit overwhelming now to find this quick answer. But are we now saying transcripts (large data) should be _in_ the RSS, but chapters (small data) should be external?
I'm pretty happy about this idea:
In short, a standard and privacy-respecting way for podcast apps to report playback data to podcast-hosting providers.
@dave For the sake of simplicity, let's call what I do to get all the Apple Podcasts listings "scraping."
I'm wondering if maybe there's an easy way I could pass some of data on to you. As you can see at https://mypodcastreviews.com/stats/, there are usually a few thousand podcasts added every weekday.
@adam I don't know if it's Mastodon or only this server, but search doesn't seem to work.
Anyone have a recommendation for a "throw any format at it" date parser?
@dave, are you saving pubDate fields as a string, or converting it to a date object?
I'm actively learning Git stuff in a course (via "Code with Mosh"), so I'll overcome the fear of doing something wrong with Git contributions soon.
(Also having trouble keeping up while being a full-time single dad to a toddler while also running my own business helping podcasters. But I'm so glad there's so much activity on this!)
@dave I dusted off my own feed-scraper for Podcast Industry Insights and I'm working on performance and validation stuff on it. How many server nodes are you now using to scrape feeds?
Intended for all stake holders of podcasting who are interested in improving the eco system