data-cleaning

How to extract emojis and flags from strings in Python?

久未见 提交于 2019-12-11 04:47:39
问题 import emoji def emoji_lis(string): _entities = [] for pos,c in enumerate(string): if c in emoji.UNICODE_EMOJI: print("Matched!!", c ,c.encode('ascii',"backslashreplace")) _entities.append({ "location":pos, "emoji": c }) return _entities emoji_lis("👧🏿 مدیحہ🇵🇰 así, se 😌 ds 💕👭") Matched!! 👧 \U0001f467 Matched!! 🏿 \U0001f3ff Matched!! 😌 \U0001f60c Matched!! 💕 \U0001f495 Matched!! 👭 \U0001f46d My code is working of all other emoji's but how can I detect country flags 🇵🇰? 回答1: Here is an article

How to use R to check data consistency (make sure no contradiction between case and value)?

一世执手 提交于 2019-12-11 03:14:07
问题 Let's say I have: Person Movie Rating Sally Titanic 4 Bill Titanic 4 Rob Titanic 4 Sue Cars 8 Alex Cars **9** Bob Cars 8 As you can see, there is a contradiction for Alex. All the same movies should have the same ranking, but there was a data error entry for Alex. How can I use R to solve this? I've been thinking about it for a while, but I can't figure it out. Do I have to just do it manually in excel or something? Is there a command on R that will return all the cases where there are data

Drop variable in panel data in R conditional based on a defined number of consecutive observations

我只是一个虾纸丫 提交于 2019-12-10 22:23:13
问题 I am quite new to R, my problem is as follows: I have a set of panel data organised as time series like this(only part is shown): Week_Starting Team A Team B Team C Team D 2010-01-02 1 2 3 4 2010-01-09 2 40 1 5 2010-01-16 15 <NA> 4 11 2010-01-23 25 <NA> 7 18 2010-01-30 38 <NA> 9 29 2010-02-06 <NA> <NA> 12 34 2010-02-13 <NA> <NA> 16 40 2010-02-20 <NA> <NA> 20 <NA> 2010-02-27 <NA> <NA> 15 28 2010-03-06 <NA> <NA> 20 <NA> 2010-03-13 <NA> <NA> 24 <NA> 2010-03-20 <NA> <NA> 24 <NA> 2010-03-27 <NA>

R - simple Record Linkage - the next step ?

浪尽此生 提交于 2019-12-09 13:44:29
问题 I am trying to do some simple direct linkage with the library('RecordLinkage') . So I only have one vector tv3 = c("TOURDEFRANCE", 'TOURDEFRANCE', "TOURDE FRANCE", "TOURDE FRANZ", "GET FRESH") The function that I need is compare.dedup of the library('RecordLinkage') and I get : compare.dedup(as.data.frame(tv3))$pairs $pairs id1 id2 tv3 is_match 1 1 2 1 NA 2 1 3 0 NA 3 1 4 0 NA 4 1 5 0 NA 5 2 3 0 NA .... I have trouble finding documentation for the next step. How do I then compare and find my

Cleaning data scraped using Scrapy

你说的曾经没有我的故事 提交于 2019-12-06 14:46:21
问题 I have recently started using Scrapy and am trying to clean some data I have scraped and want to export to CSV, namely the following three examples: Example 1 – removing certain text Example 2 – removing/replacing unwanted characters Example 3 –splitting comma separated text Example 1 data looks like: Text I want,Text I don’t want Using the following code: 'Scraped 1': response.xpath('//div/div/div/h1/span/text()').extract() Example 2 data looks like: Â - but I want to change this to £ Using

Python Pandas — Forward filling entire rows with value of one previous column

扶醉桌前 提交于 2019-12-05 16:03:14
New to pandas development. How do I forward fill a DataFrame with the value contained in one previously seen column? Self-contained example: import pandas as pd import numpy as np O = [1, np.nan, 5, np.nan] H = [5, np.nan, 5, np.nan] L = [1, np.nan, 2, np.nan] C = [5, np.nan, 2, np.nan] timestamps = ["2017-07-23 03:13:00", "2017-07-23 03:14:00", "2017-07-23 03:15:00", "2017-07-23 03:16:00"] dict = {'Open': O, 'High': H, 'Low': L, 'Close': C} df = pd.DataFrame(index=timestamps, data=dict) ohlc = df[['Open', 'High', 'Low', 'Close']] This yields the following DataFrame: print(ohlc) Open High Low

Cleaning data scraped using Scrapy

跟風遠走 提交于 2019-12-04 20:23:12
I have recently started using Scrapy and am trying to clean some data I have scraped and want to export to CSV, namely the following three examples: Example 1 – removing certain text Example 2 – removing/replacing unwanted characters Example 3 –splitting comma separated text Example 1 data looks like: Text I want,Text I don’t want Using the following code: 'Scraped 1': response.xpath('//div/div/div/h1/span/text()').extract() Example 2 data looks like: Â - but I want to change this to £ Using the following code: ' Scraped 2': response.xpath('//html/body/div/div/section/div/form/div/div/em/text(

Splitting a single column into multiple observation using R

一曲冷凌霜 提交于 2019-12-04 03:09:10
问题 I am working on HCUP data and this has range of values in one single column that needs to be split into multiple columns. Below is the HCUP data frame for reference : code label 61000-61003 excision of CNS 0169T-0169T ventricular shunt The desired output should be : code label 61000 excision of CNS 61001 excision of CNS 61002 excision of CNS 61003 excision of CNS 0169T ventricular shunt My approach to this problem is using the package splitstackshape and using this code library(data.table)

How to clean and re-code check-all-that-apply responses in R survey data?

送分小仙女□ 提交于 2019-12-04 01:52:53
问题 I've got survey data with some multiple-response questions like this: HS18 Why is it difficult to get medical care in South Africa? (Select all that apply) 1 Too expensive 2 No transportation to the hospital/clinic 3 Hospital/clinic is too far away 4 Hospital/clinic staff do not speak my language 5 Hospital/clinic staff do not like foreigners 6 Wait time too long 7 Cannot take time off of work 8 None of these. I have no problem accessing medical care where multiple responses were entered with

R - simple Record Linkage - the next step ?

拈花ヽ惹草 提交于 2019-12-03 20:08:25
I am trying to do some simple direct linkage with the library('RecordLinkage') . So I only have one vector tv3 = c("TOURDEFRANCE", 'TOURDEFRANCE', "TOURDE FRANCE", "TOURDE FRANZ", "GET FRESH") The function that I need is compare.dedup of the library('RecordLinkage') and I get : compare.dedup(as.data.frame(tv3))$pairs $pairs id1 id2 tv3 is_match 1 1 2 1 NA 2 1 3 0 NA 3 1 4 0 NA 4 1 5 0 NA 5 2 3 0 NA .... I have trouble finding documentation for the next step. How do I then compare and find my similar pair ? So I found the distance jarowinkler() but it returns only pairs. Basically, you can only