Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing accents errors in excel #14

Open
PaulaSpinola opened this issue Aug 22, 2019 · 4 comments
Open

Fixing accents errors in excel #14

PaulaSpinola opened this issue Aug 22, 2019 · 4 comments
Labels
bug Something isn't working general General stuff (e.g. topic to discuss, book recommendation, best sushi in town)

Comments

@PaulaSpinola
Copy link

I need to download some text data in Portuguese which comes with accents (e.g. ã, í, é, ...). The website gives me the option to download the file either in excel or pdf. Although I want the data in excel, it reads the accents with error whereas pdf doesn't (the files are attached). Would anyone know how to fix such errors in excel?
Many thanks
Paula

certificados_cnrm_AM.xls.xlsx
certificados_cnrm_AM.pdf

@alhenry alhenry transferred this issue from ucl-ihi/Practicals Aug 22, 2019
@alhenry
Copy link
Member

alhenry commented Aug 22, 2019

Hi Paula,

This may not the best way to do it, but I think I manage to open it with:

  1. Download the .pdf version
  2. Open in Adobe Acrobat (I use Adobe Acrobat Pro DC ver 19.12.20036 in MacOS X)
  3. Select File > Export to > spread sheet > Format: Excel Workbook
  4. Select Settings
  5. Set Language to "Brazilian Portuguese"

Screenshot 2019-08-22 15 39 49

Results here:
certificados_cnrm_AM.xlsx

Is this what you want?

@alhenry alhenry added bug Something isn't working general General stuff (e.g. topic to discuss, book recommendation, best sushi in town) labels Aug 22, 2019
@PaulaSpinola
Copy link
Author

PaulaSpinola commented Aug 23, 2019

Many thanks Albert.
I already tried that but then another problem comes up. Some individual level data which should be shown in a unique row are broken into two rows (e.g. rows 8 & 9, rows 17 & 18, rows 35 & 36). I am actually trying to write a VBA code to correct that, but am struggling on that (I might write another post on it if that is ok).
I also try the other way around of what you suggested. Download the excel file from the system after making sure the Portuguese language is active in excel spreadsheets. It didn't work either.

image

@alhenry
Copy link
Member

alhenry commented Aug 23, 2019

Hmmm, I don't know VBA, but if the problem now is just the "broken rows", this can be fixed using other programming languages as I think the main reason for this is due to incorrect page break in column (2 and 6). Here is an example using R:

library(readxl)
library(dplyr)
library(stringr)

file <- "https://github.com/ucl-ihi/CodeClub/files/3530647/certificados_cnrm_AM.xlsx"
tmp <- tempfile(fileext = ".xlsx")
httr::GET(url = file, httr::write_disk(tmp))

df <- read_excel(tmp, skip = 1)

df_neat <- df %>%
  # remove column 3 & 13 (importing issue)
  select(-3, -13) %>% 
  # concatenate value column 2 & 6 with the next row's value
  # if the next row's value in column 1 (ID) is missing (NA)
  mutate_at(vars(2,6),
            ~if_else(is.na(lead(df[[1]])),
                     paste(., lead(.)), .)) %>% 
  # remove duplicated whitespace and \n character
  mutate_if(is.character, str_squish) %>% 
  # remove rows with missing column 1 (ID) value 
  filter(!is.na(.[[1]]))

# Fix column name
old_cols <- colnames(df) %>% str_squish
new_cols <- c(old_cols[1],
              str_split(old_cols[3], " ")[[1]],
              old_cols[11], old_cols[13])
colnames(df_neat) <- new_cols

writexl::write_xlsx(df_neat, "OUTFILE.xlsx")

Results:
OUTFILE.xlsx

You may need to reformat some of the columns (e.g. change Date column format to "Short Date) in excel to get the output that you want.

@danlewer
Copy link

danlewer commented Sep 2, 2019

It might be easier to work with the PDF - there are various ways to extract tables from a PDF - e.g. https://pdftables.com - or lots of advice if you Google!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working general General stuff (e.g. topic to discuss, book recommendation, best sushi in town)
Projects
None yet
Development

No branches or pull requests

3 participants