Posts Tagged ‘big data’

Cleaning up /tmp under HDFS

Monday, October 8th, 2018

The script to wipe out /tmp under HDFS (originally posted here). Could be run in a crontab to periodically delete files older than XXX days.

  1. #!/bin/bash
  2.  
  3. usage="Usage: cleanup_tmp.sh [days]"
  4.  
  5. if [ ! "$1" ]
  6.  then
  7.   echo $usage
  8.   exit 1
  9. fi
  10.  
  11. now=$(date +%s)
  12.  
  13. hadoop fs –ls /tmp/hive/hive/ | grep "^d" | while read f;
  14.  do
  15.   dir_date=`echo $f | awk '{print $6}'`
  16.   difference=$(( ( $now – $(date -d "$dir_date" +%s) ) / (24 * 60 * 60 ) ))
  17.  
  18. if [ $difference -gt $1 ];
  19.  then
  20.   hadoop fs –ls `echo $f | awk '{ print $8 }'`;
  21. ### hadoop fs -rm -r `echo $f | awk '{ print $8 }'`;
  22. fi
  23.  
  24. done

By default the script will be executed in a “dry” mode, listing the files that are older than XXX days. Once you’re comfortable with the output, comment the line containing ‘fs -ls’ and uncomment the one with ‘fs -rm’.

If you get Java memory errors while executing the script, make sure to pass HADOOP_CLIENT_OPTS variable prior to calling the script:

  1. export HADOOP_CLIENT_OPTS="-XX:-UseGCOverheadLimit -Xmx4096m"

Moving /var/log to a different drive under CentOS 7

Saturday, July 14th, 2018

A quick how-to with the set of instructions to move /var/log partition to a different drive. Done on CentOS 7.5.1804 with all partitions managed by LVM, while /var/log is moved to a USB key.

The tricky part with /var/log is that there is always something being written to it, and although simple archive/restore might work, you risk to lose changes from the moment you create archive till the moment you restore it. Depending on how big /var/log is it could be minutes/hours of data. The procedure below assumes an outage since it will be performed offline, however it assures that no data will be lost.

(more…)

Unattended installation of CentOS 7 with Kickstart

Sunday, March 13th, 2016

While setting up my first Hadoop cluster I faced with the dilemma of how to perform installations of CentOS 7 on multiple servers at once. If you have 20 data nodes to deploy, anything you chose to automate an installation will greatly reduce the deployment time, but most importantly, it will eliminate the possibility of human error (typo for example).

Initially, I started looking at the disk cloning direction. Since all my data nodes are identical, I was thinking to prepare one data node server, then dd the system drive, place it on a NFS share, boot the server and re-image the system drive using dd image from the share. Clonezilla and DRBL seem to be the perfect pair for a such scenario. And although you will spend some time configuring, testing and tuning it, it was still worth to look into it.

Then I realized that even if I manage to establish the setup above, I’ll still have to deal with manual post-installation tweaks, like regeneration of SSH keys and probably adjusting of MAC addresses. On top of that, to transfer raw dd image (in my case it was ~30GB) might take longer than initial installation itself. Therefore I ended up using Kickstart method. I’m pretty sure there are more efficient solutions and if you happen to know one I’d love to hear your comments.

(more…)