Once you have an elasticsearch server running, you'll want to bootstrap it with feed and story indexes.
./manage.py index_feeds
Stories will be indexed automatically.
If you need to move search servers and want to just delete everything in the search database, you need to reset the MUserSearch table. Run
make shell
>>> from apps.search.models import MUserSearch
>>> MUserSearch.remove_all()
check that the tasked_feeds
queue is empty. You can drain it by running:
make shell
```
Feed.drain_task_feeds()
```
This happens when a deploy on the task servers hits faults and the task servers lose their
connection without giving the tasked feeds back to the queue. Feeds that fall through this
crack are automatically fixed after 24 hours, but if many feeds fall through due to a bad
deploy or electrical failure, you'll want to accelerate that check by just draining the
tasked feeds pool, adding those feeds back into the queue. This command is idempotent.
You got the downtime message either through email or SMS. This is the order of operations for determining what's wrong.
0a. If downtime goes over 5 minutes, go to Twitter and say you're handling it. Be transparent about what it is, NewsBlur's followers are largely technical. Also the 502 page points users to Twitter for status updates.
0b. Ensure you have secrets-newsblur/configs/hosts
installed in your /etc/hosts
so server hostnames
work.
-
Check www.newsblur.com to confirm it's down.
If you don't get a 502 page, then NewsBlur isn't even reachable and you just need to contact the hosting provider and yell at them.
-
Check which servers can't be reached on HAProxy stats page. Basic auth can be found in secrets/configs/haproxy.conf. Search the secrets repo for "gimmiestats".
Typically it'll be mongo, but any of the redis or postgres servers can be unreachable due to acts of god. Otherwise, a frequent cause is lack of disk space. There are monitors on every DB server watching for disk space, emailing me when they're running low, but it still happens.
-
Check Sentry and see if the answer is at the top of the list.
This will show if a database (redis, mongo, postgres) can't be found.
-
Check the various databases:
a. If Redis server (db_redis, db_redis_story, db_redis_pubsub) can't connect, redis is probably down.
SSH into the offending server (or just check both the `db_redis` and `db_redis_story` servers) and check if `redis` is running. You can often `tail -f -n 100 /var/log/redis.log` to find out if background saving was being SIG(TERM|INT)'ed. When redis goes down, it's always because it's consuming too much memory. That shouldn't happen, so check the [munin graphs](http://db_redis/munin/). Boot it with `sudo /etc/init.d/redis start`.
b. If mongo (db_mongo) can't connect, mongo is probably down.
This is rare and usually signifies hardware failure. SSH into `db_mongo` and check logs with `tail -f -n 100 /var/log/mongodb/mongodb.log`. Start mongo with `sudo /etc/init.d/mongodb start` then promote the next largest mongodb server. You want to then promote one of the secondaries to primary, kill the offending primary machine, and rebuild it (preferably at a higher size). I recommend waiting a day to rebuild it so that you get a different machine. Don't forget to lodge a support ticket with the hosting provider so they know to check the machine. If it's the db_mongo_analytics machine, there is no backup nor secondaries of the data (because it's ephemeral and used for, you guessed it, analytics). You can easily provision a new mongodb server and point to that machine. If mongo is out of space, which happens, the servers need to be re-synced every 2-3 months to compress the data bloat. Simply `rm -fr /var/lib/mongodb/*` and re-start Mongo. It will re-sync. If both secondaries are down, then the primary Mongo will go down. You'll need a secondary mongo in the sync state at the very least before the primary will accept reads. It shouldn't take long to get into that state, but you'll need a mongodb machine setup. You can immediately reuse the non-working secondary if disk space is the only issue.
c. If postgresql (db_pgsql) can't connect, postgres is probably down.
This is the rarest of the rare and has in fact never happened. Machine failure. If you can salvage the db data, move it to another machine. Worst case you have nightly backups in S3. The fabfile.py has commands to assist in restoring from backup (the backup file just needs to be local).
-
Point to a new/different machine
a. Confirm the IP address of the new machine with
fab list_do
.b. Change
secrets-newsbur/config/hosts
to reflect the new machine.c. Copy the new
hosts
file to all machines with:fab all setup_hosts
d. Changes should be instant, but you can also bounce every machine with:
fab web deploy fab task celery
e. Monitor
utils/tlnb.py
andutils/tlnbt.py
for lots of reading and feed fetching. -
If feeds aren't fetching, check that the
tasked_feeds
queue is empty. You can drain it by running:
```
Feed.drain_task_feeds()
```
This happens when a deploy on the task servers hits faults and the task servers lose their
connection without giving the tasked feeds back to the queue. Feeds that fall through this
crack are automatically fixed after 24 hours, but if many feeds fall through due to a bad
deploy or electrical failure, you'll want to accelerate that check by just draining the
tasked feeds pool, adding those feeds back into the queue. This command is idempotent.
When the new redis server is connected to the primary redis server:
make celery_stop make maintenance_on apd -l db-redis-story2 -t replicaofnoone aps -l db-redis-story,db-redis-story2 -t consul make maintenance_off make task