问题
I'm trying to write a script in databricks that will select a file based on certain characters in the name of the file or just on the datestamp in the file.
For example, the following file looks as follows:
LCMS_MRD_Delta_LoyaltyAccount_1992_2018-12-22 06-07-31
I have created the following code in Databricks:
import datetime
now1 = datetime.datetime.now()
now = now1.strftime("%Y-%m-%d")
Using the above code I tried to select the file using following:
LCMS_MRD_Delta_LoyaltyAccount_1992_%s.csv'% now
However, if you look closely you will notice that there is a space between the datestamp and the timestamp, i.e between 22 and 06
LCMS_MRD_Delta_LoyaltyAccount_1992_2018-12-22 06-07-31
It is because if this space that is preventing my code above from working.
I don't think Databricks supports wildcards so the following won't work:
LCMS_MRD_Delta_LoyaltyAccount_1992_%s.csv'% now
Someone once suggested TRUNCATING the timestamp.
Can someone let me know if:
A.TRUNCATING will solve this problem
B.Is there a way to my code LCMS_MRD_Delta_LoyaltyAccount_1992_%s.csv'% now
To select the whole file? Bearing in mind I definitely need to select based on current date.. I just want to be able to use my code to select on the file.
回答1:
You can read filenames with dbutils and can check if a pattern matches in an if-statement: if now in filname. So instead of reading files with a specific pattern directly, you get a list of files and then copy the concrete files matching your required pattern.
The following code works in a databricks python notebook:
1. Writing three files to the filesystem:
data = """
{"a":1, "b":2, "c":3}
{"a":{, b:3}
{"a":5, "b":6, "c":7}
"""
dbutils.fs.put("/mnt/adls2/demo/files/file1-2018-12-22 06-07-31.json", data, True)
dbutils.fs.put("/mnt/adls2/demo/files/file2-2018-02-03 06-07-31.json", data, True)
dbutils.fs.put("/mnt/adls2/demo/files/file3-2019-01-03 06-07-31.json", data, True)
2. Reading the filnames as a list:
files = dbutils.fs.ls("/mnt/adls2/demo/files/")
3. Getting the actual date:
import datetime
now = datetime.datetime.now().strftime("%Y-%m-%d")
print(now)
Output: 2019-01-03
4. Copy actual files:
for i in range (0, len(files)):
file = files[i].name
if now in file:
dbutils.fs.cp(files[i].path,'/mnt/adls2/demo/target/' + file)
print ('copied ' + file)
else:
print ('not copied ' + file)
Output:
not copied file1-2018-12-22 06-07-31.json
not copied file2-2018-02-03 06-07-31.json
copied file3-2019-01-03 06-07-31.json
来源:https://stackoverflow.com/questions/54007074/how-to-truncate-and-or-use-wildcards-with-databrick