I'm in panic mode tonight as I tried to upgrade my engine's software and ran into problems and now can't access my 210 GB database!

The DB is on block storage with Vultr, so it won't be affected by a 3-month-old server snapshot restore. But I pray I didn't mess up anything in the data itself!

I'm going to test my feed-scraping program on a more powerful node to see if MongoDB can keep up better. If I could get this down to a single machine that can process in one day, I guess I wouldn't mind that monthly expense. It would probably be cheaper than spinning up multiple machines once a week to take several days to process.

Can you help me understand something?

I have a Node script processing podcast feeds. I use the Async module to limit my script to only 100 feeds at a time. It's doing async work.

This is using all my server's resources (primarily the 2 CPUs) but getting the job done.

But this is still single-threaded, right?

Is there really any benefit to switching to multi-threaded? For example, 2 processes each with the Async limit set to 50? Would it make a difference?

I'm in over my head sometimes.

Maybe I spoke too soon. I tried FXP on a much larger dataset overnight, and it took just as long as xml2js did. But I'll retest.

Show thread

And on my production server with a larger data set, it went from 28 seconds to 19!

Show thread

Whoa. FXP is actually faster than xml2js!

On my local machine, testing with only 11 RSS feeds, xml2js was averaging 1.4 s, but FXP is about 0.95 s!. That's almost 2/3 time!

Node's "xml2js" module had a "normalize" option that would lowercase all tag names, since some feeds aren't consistently capitalizing. (For example, "pubDate" vs. "pubdate.")

I can't figure out how to do that same kind of thing with "Fast XML Parser."

@dave The discussion is a bit overwhelming now to find this quick answer. But are we now saying transcripts (large data) should be _in_ the RSS, but chapters (small data) should be external?

I'm pretty happy about this idea:

github.com/Podcastindex-org/po

In short, a standard and privacy-respecting way for podcast apps to report playback data to podcast-hosting providers.

@dave For the sake of simplicity, let's call what I do to get all the Apple Podcasts listings "scraping."

I'm wondering if maybe there's an easy way I could pass some of data on to you. As you can see at mypodcastreviews.com/stats/, there are usually a few thousand podcasts added every weekday.

@adam

PodcastIndex.social is basically becoming the podcast-dev Twitter. I like it!

I'm fighting too much with xml2js, so I think I'll switch to fast-xml-parser.

I hope that will also improve my system's performance since it's taking 10 servers 3 days to process 1.6 million podcast feeds!

@adam I don't know if it's Mastodon or only this server, but search doesn't seem to work.

I'm worried the abusive accusations of being a robot are true!

I failed the "I'm not a robot" test about 30 times before I was able to log in to something.

I'm worried the accusations of being a robot are true!

I failed the "I'm not a robot" test about 30 times before I was able to log in!

Anyone have a recommendation for a "throw any format at it" date parser?

@dave, are you saving pubDate fields as a string, or converting it to a date object?

@adam @dave "He hasn't contributed any code, yet." Is that a the Podcast Index way of a "douchebag callout"? 😃

I'm actively learning Git stuff in a course (via "Code with Mosh"), so I'll overcome the fear of doing something wrong with Git contributions soon.

(Also having trouble keeping up while being a full-time single dad to a toddler while also running my own business helping podcasters. But I'm so glad there's so much activity on this!)

@dave I dusted off my own feed-scraper for Podcast Industry Insights and I'm working on performance and validation stuff on it. How many server nodes are you now using to scrape feeds?

@dave @adam The new site looks great! But the logo is still different. Did you officially unslant it?

Show more
PodcastIndex Social

Intended for all stake holders of podcasting who are interested in improving the eco system