RSS-Spider

Development, Ideas, Issues, problems, ßetas and what not…

Built for Speed… new Indexing engine goes online…

Filed under: Development — Dave at 10:22 pm on Saturday, February 11, 2006

Over the past two weeks there was a major drop in the speed at which searches were being returned. The MySql database hit a line in the sand somewhere and once crossed search speed suffered. At the time of this writing the database has over 5 million articles pulled from various RSS. The FULL_TEXT search has collapsed and searches for simple one word searchs like Clevealnd were taking 300+ seconds to return. Frankly I’m still amazed at the amount of page views we were getting at this time but looking at the search log I can see many people came in from the same IP address 4 or 5 times within a minute looking for the same thing. This says to me that they thought the site was slow or didn’t accept their querey so they clicked search again only to have to wait 4+ minutes! BAH!

As of 2006-02-11 20:01:02 RSS-Spider is now being powered by a new Full Text Index server called Sphinx. Searches that once took 300+ seconds to do now take under a second! Sphinx was simple to install and I’m seriously impressed with the overall speed gain!!! http://www.shodan.ru/projects/sphinx for more information!

Quick link to major company news

Filed under: What Not... — Dave at 10:27 pm on Tuesday, January 31, 2006

Since I’m always online either programming this site or screwing around in my Ameritrade account I’ve created a “cheat” page for me to quickly pull up anything that might be in the database about companies a whole slew of companies.  I’ll be adding more as I get though the chapters in the 100 best companies to invest in in 2006.  But for right now there are 480+ companies at http://www.rss-spider.com/company_list.php

Database backup & purge GONE WILD!

Filed under: Issues, Problems — Dave at 6:30 pm on Sunday, January 29, 2006

Last night we experienced an 8 hour outage during our weekly database backup & purge.  The database of headlines has grown so massive that it took way too long to purge out any old & “spammy” headlines.  The last time we did this purge was back in December and we only experienced a 15 minute outtage.  I was up till 6 am waiting for everything to finish.  If you had visited the site during this time you would have been directed over to the my.rss-spider.com page which was incorrectly listed as our forwarding page.  My.RSS-Spider.com is a project we’re working on which is in it’s infancy.  In a nut shell what it will do is allow users to have their own webspace to aggrogate RSS feeds from RSS Spider.  So instead of coming and searching for something every time you can simply go to your my.RSS-Spider.com home page and view all the new feeds matching your search criteria.  Again.. this is something thats way down the road, however, we are planning to have a Beta release in late March.

What was Hot Yesterday!

Filed under: Betas, Development — Dave at 6:23 pm on Sunday, January 29, 2006

Yesterday marked the launch of the Hot Words section of RSS-Spider. What this section does is mash up all the posts from any given day, sort all the words from that day and count the number of times any specific word appears. From there it takes the top 100 words as they appear and rank them in font size order.  So on January 25th the system processed all the documents in the database that had a pubdate of January 24 (this date comes from the RSS feed that the spider pulled) and found that in the top 100 terms used on that day Alito, Bush, Iraq and War all came up…

Clicking on any one of these terms will pull up all the articles stored in the database for that day with that term.

Currently we are only processing English language feeds, but our next step is to add a Hot Words for German users.

Whats hot! (Yesterday)

Filed under: Betas — Dave at 1:56 pm on Sunday, January 15, 2006

If you want to see what items are hot in the RSS feeds we’re polling check out the WhatsHotYesterday.php link under the Beta section.  What we’re doing here is taking all the feeds we’ve spidered that have a post date of yesterday (what ever that might be) and mashing the headlines and bodies together sorting out all the words then figuring out which words appeare the most.

So far we’ve see Alito pop up a few times on our Friday test, as well as Bush.  Guitar seems to be a big one.  Right now it’s only going to return blogs where the language is EN or english.

Soon we should have a whats hot yesterday database of every pubdate in our database.

« Previous PageNext Page »