data-cleaning | 易学教程

Select Pandas rows with regex match

阅读更多关于 Select Pandas rows with regex match

问题 I have the following data-frame. and I have an input list of values I want to match each item from the input list to the Symbol and Synonym column in the data-frame and to extract only those rows where the input value appears in either the Symbol column or Synonym column(Please note that here the values are separated by '|' symbol). In the output data-frame I need an additional column Input_symbol which denotes the matching value. So here in this case the desired output will should be like

R: Cleaning up a wide and untidy dataframe

阅读更多关于 R: Cleaning up a wide and untidy dataframe

问题 I have a data frame that looks like: d<-data.frame(id=(1:9), grp_id=(c(rep(1,3), rep(2,3), rep(3,3))), a=rep(NA, 9), b=c("No", rep(NA, 3), "Yes", rep(NA, 4)), c=c(rep(NA,2), "No", rep(NA,6)), d=c(rep(NA,3), "Yes", rep(NA,2), "No", rep(NA,2)), e=c(rep(NA, 7), "No", NA), f=c(NA, "No", rep(NA,3), "No", rep(NA,2), "No")) >d id grp_id a b c d e f 1 1 1 NA No <NA> <NA> <NA> <NA> 2 2 1 NA <NA> <NA> <NA> <NA> No 3 3 1 NA <NA> No <NA> <NA> <NA> 4 4 2 NA <NA> <NA> Yes <NA> <NA> 5 5 2 NA Yes <NA> <NA>

Removing all “H” within the strings, EXCEPT the ones including “CH”

阅读更多关于 Removing all “H” within the strings, EXCEPT the ones including “CH”

问题 I am trying to remove all "H" within the strings, EXCEPT the ones including "CH" in the following example: strings <- c("Cash","Wishes","Chain","Chip","Check") I found that the code below remove only "H" data<- gsub("H", "", strings) 回答1: You can do this with a negative look-behind. gsub("(?<!c)h", "", strings, perl=TRUE, ignore.case = TRUE) 来源： https://stackoverflow.com/questions/47538826/removing-all-h-within-the-strings-except-the-ones-including-ch

Clean one column from long and big data set

阅读更多关于 Clean one column from long and big data set

问题 I am trying to clean only one column from the long and big data sets. The data has 18 columns, more than 10k+ rows about 100s of csv files, Of which I want to clean only one column. Input fields only few from the long list userLocation, userTimezone, Coordinates, India, Hawaii, {u'type': u'Point', u'coordinates': [73.8567, 18.5203]} California, USA , New Delhi, Ft. Sam Houston,Mountain Time (US & Canada),{u'type': u'Point', u'coordinates': [86.99643, 23.68088]} Kathmandu,Nepal, Kathmandu, {u

Separate keywords and @ mentions from dataset

阅读更多关于 Separate keywords and @ mentions from dataset

问题 I have a huge set of data which has several columns and about 10k rows in more than 100 csv files, for now I am concerned about only one column with message format and from them I want to extract two parameters. I searched extensively around and I found two solutions that seem close but are not enough close to solve the question here. ONE & TWO Input : Col name "Text" and every message is a separate row in a csv. "Let's Bounce!ðŸ˜‰ #[message_1] Loving the energy & Microphonic Mayhem whileâ€¦"

R; DPLYR: Convert a list of dataframes into a single organized dataframe

阅读更多关于 R; DPLYR: Convert a list of dataframes into a single organized dataframe

问题 I have a list with multiple entries, an example entry looks like: > head(gene_sets[[1]]) patient Diagnosis Eigen_gene ENSG00000080824 ENSG00000166165 ENSG00000211459 ENSG00000198763 ENSG00000198938 ENSG00000198886 1 689_120604 AD -0.5606425 50137 38263 309298 528233 523420 730537 2 412_120503 AD 0.9454632 44536 23333 404316 730342 765963 1168123 3 706_120605 AD 0.6061834 16647 22021 409498 614314 762878 1171747 4 486_120515 AD 0.8164779 21871 9836 518046 697051 613621 1217262 5 469_120514 AD

I have some problems with data-cleaning

阅读更多关于 I have some problems with data-cleaning

问题 I have scraped a table from wikipedia page and I am going to clean the data next. I have transformed the data in to Pandas format and now I have some problems cleaning the data Here are the codes I have executed to scrape the table from the wikipedia page import requests import pandas as pd website_url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text from bs4 import BeautifulSoup soup = BeautifulSoup(website_url,'lxml') print(soup.prettify()) My_table =

Force date as new line on reading non-delimited text file

阅读更多关于 Force date as new line on reading non-delimited text file

问题 I am trying to read in and work with a horribly formatted debug log. There are no consistent delimeters and it does not appear line breaks are encoded either. What I'd like to do is read in and parse the data to have a new line for each date (YYYY-MM-DD format). I am trying to work within the tidyverse but cannot seem to get something that will parse the file correctly. Is there a way to force lines to be delimited by a date pattern? None of these work: library(tidyverse) Log_File <- read

Reshape Data Frame Based on Corresponding Column's Identifier R

阅读更多关于 Reshape Data Frame Based on Corresponding Column's Identifier R

问题 I'm tried to reshape a two column data frame by collapse the corresponding column values that match in column 2 - in this case ticker symbols to their own unique row while making the contents of column 1 which are the fields of data that correspond to those tickers their own columns. See for example a small sample since it's a data frame with 500 tickers and 4 fields: test22 Ticker Current SharePrice $6.57 MFM Current NAV $7.11 MFM Current Premium/Discount -7.59% MFM 52WkAvg SharePrice $6.55

R - identify which columns contain currency data $

阅读更多关于 R - identify which columns contain currency data $

问题 I have a very large dataset with some columns formatted as currency, some numeric, some character. When reading in the data all currency columns are identified as factor and I need to convert them to numeric. The dataset it too wide to manually identify the columns. I am trying to find a programmatic way to identify if a column contains currency data (ex. starts with '$') and then pass that list of columns to be cleaned. name <- c('john','carl', 'hank') salary <- c('$23,456.33','$45,677.43','