Grep across multiple files in Hadoop Filesystem

前端未结

关注

 5  788

时光说笑

I am working with Hadoop and I need to find which of ~100 files in my Hadoop filesystem contain a certain string.

I can see the files I wish to search like this:

相关标签:

5条回答

渐次进展

2020-12-30 02:22
This is a hadoop "filesystem", not a POSIX one, so try this:
```
hadoop fs -ls /apps/hdmi-technology/b_dps/real-time | awk '{print $8}' | \
while read f
do
  hadoop fs -cat $f | grep -q bcd4bc3e1380a56108f486a4fffbc8dc && echo $f
done
```
This should work, but it is serial and so may be slow. If your cluster can take the heat, we can parallelize:
```
hadoop fs -ls /apps/hdmi-technology/b_dps/real-time | awk '{print $8}' | \
  xargs -n 1 -I ^ -P 10 bash -c \
  "hadoop fs -cat ^ | grep -q bcd4bc3e1380a56108f486a4fffbc8dc && echo ^"
```
Notice the -P 10 option to xargs: this is how many files we will download and search in parallel. Start low and increase the number until you saturate disk I/O or network bandwidth, whatever is relevant in your configuration.

EDIT: Given that you're on SunOS (which is slightly brain-dead) try this:
```
hadoop fs -ls /apps/hdmi-technology/b_dps/real-time | awk '{print $8}' | while read f; do hadoop fs -cat $f | grep bcd4bc3e1380a56108f486a4fffbc8dc >/dev/null && echo $f; done
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

梦毁少年i

2020-12-30 02:26

hadoop fs -find /apps/mdhi-technology/b_dps/real-time  -name "*bcd4bc3e1380a56108f486a4fffbc8dc*"

hadoop fs -find /apps/mdhi-technology/b_dps/real-time  -name "bcd4bc3e1380a56108f486a4fffbc8dc"

0 讨论(0)

遥遥无期

2020-12-30 02:26
To find all files with any extension recursively inside hdfs location:
```
hadoop fs -find  hdfs_loc_path  -name ".log"
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
没有蜡笔的小新

2020-12-30 02:30
You are looking to applying grep command on hdfs folder
```
hdfs dfs -cat /user/coupons/input/201807160000/* | grep -c null
```
here cat recursively goes through all files in the folder and I have applied grep to find count.
0 讨论(0)
发布评论:

提交评论
- 加载中...
长发绾君心

2020-12-30 02:31
Using hadoop fs -cat (or the more generic hadoop fs -text) might be feasible if you just have two 1 GB files. For 100 files though I would use the streaming-api because it can be used for adhoc-queries without resorting to a full fledged mapreduce job. E.g. in your case create a script get_filename_for_pattern.sh:
```
#!/bin/bash
grep -q $1 && echo $mapreduce_map_input_file
cat >/dev/null # ignore the rest
```
Note that you have to read the whole input, in order to avoid getting java.io.IOException: Stream closed exceptions.

Then issue the commands
```
hadoop jar $HADOOP_HOME/hadoop-streaming.jar\
 -Dstream.non.zero.exit.is.failure=false\
 -files get_filename_for_pattern.sh\
 -numReduceTasks 1\
 -mapper "get_filename_for_pattern.sh bcd4bc3e1380a56108f486a4fffbc8dc"\
 -reducer "uniq"\
 -input /apps/hdmi-technology/b_dps/real-time/*\
 -output /tmp/files_matching_bcd4bc3e1380a56108f486a4fffbc8dc
hadoop fs -cat /tmp/files_matching_bcd4bc3e1380a56108f486a4fffbc8dc/*
```
In newer distributions mapred streaming instead of hadoop jar $HADOOP_HOME/hadoop-streaming.jar should work. In the latter case you have to set your $HADOOP_HOME correctly in order to find the jar (or provide the full path directly).

For simpler queries you don't even need a script but just can provide the command to the -mapper parameter directly. But for anything slightly complex it's preferable to use a script, because getting the escaping right can be a chore.

If you don't need a reduce phase provide the symbolic NONE parameter to the respective -reduce option (or just use -numReduceTasks 0). But in your case it's useful to have a reduce phase in order to have the output consolidated into a single file.
0 讨论(0)
发布评论:

提交评论
- 加载中...