Sun 의 공부 블로그

2025/01/09 영어 문장 모음

Published papers that seem to refute your hypothesis or data are not etched in stone.The “For Authors” section of a website will have some nitty-gritty information.actors such as limitations of a methodology and other limits to generalizability (selection bias, unaddressed, or unappreciated confounders)A newcomer to the field should get a crash course in the field from this section.Essential dat..

영어 2025.01.09

[AWS] python으로 s3에 있는 parquet 파일 읽기

목적S3에 저장해놓은 parquet 파일을 AWS ec2 등의 서버가 아닌 개인 컴퓨터에서 쉽게 읽고싶다.aws configure로 계정정보(access key, secret key)를 저장하는 것이 아닌 코드에서 계정정보를 관리하고싶다.필자는 pyarrow, s3fs가 동작하지 않아 빠르게 이용할 수 있는 다른 방법을 찾고 싶었다.코드import boto3import ioimport pandas as pd...s3_config = { "aws_access_key_id": "{ACCESS KEY}", "aws_secret_access_key": "{SECRET KEY}", "region_name": "{MY REGION}"}...def pd_read_s3_parquet(key, bucke..

기타 2022.02.03

[Hadoop] HDFS에서 S3로 Distcp

목적 on-premise hdfs에서 s3로 데이터 올릴 때 get, put으로 데이터를 올리는 과정이 번거롭다 방법 hadoop distcp -Dfs.s3a.access.key=$AWS_ACCESS_KEY -Dfs.s3a.secret.key=$AWS_SECRET_KEY -Dfs.s3a.endpoint=$AWS_END_POINT $HDFS_SOURCE_PATH s3a://$S3_DEST_PATH fs.s3a.endpoint의 경우 AWS 서비스 엔드포인트를 참고 (ex. s3.ap-northeast-2.amazonaws.com)

기타 2022.01.17

[Spark] json->parquet 저장시 특수문자 해결

목적 parquet의 경우, attribute name에 " ,;{}()\n\t=" 문자가 들어가면 rename을 시켜주어야한다. 에러메시지는 아래와 같다. 'Attribute name "my column" contains invalid character(s) among " ,;{}()\\n\\t=". Please use alias to rename it.;' ... pyspark.sql.utils.AnalysisException: 'Attribute name "some-ar ray" contains invalid character(s) among " ,;{}()\\n\\t=". Please use alias to rename it.;' json을 parquet로 저장 시 key값에 해당 문자들이 들어..

기타 2021.10.07

[Spark] json -> parquet로 저장시 스키마 충돌 문제 해결

목적 json으로 저장된 파일을 partition을 추가하여 parquet로 저장하려한다. 이때, json 내에 빈 array가 있을 경우 inferSchema로 인해 array(string)의 스키마로 표현된다. 이 경우 다른 파일에서 array내에 struct 등의 다른 값이 들어올 경우 에러가 발생한다. pre-defined된 schema가 있다면 좋겠으나 필자의 경우는 그렇지 못하였다. 필자에게 발생한 에러메시지는 아래와 같다. java.lang.ClassCastException: optional binary element (UTF8) is not a group at org.apache.parquet.schema.Type.asGroupType(Type.java:207) at org.apache.s..

기타 2021.10.06

[Postgresql] 테이블 DDL 확인하기

목적 Postgresql에는 show create 문이 존재하지 않는다. ddl을 확인하는 방법을 알아보자. 과정 postgresql에서는 ddl을 확인하기위해 pg_dump를 이용해야 한다. 아래의 커맨드를 입력하여 ddl을 확인한다. pg_dump -h ${hostname} -U ${username} -t '${schemaname}.${tablename}' --schema-only ${dbname} 더 필요한 옵션이 있다면 pg_dump --help를 입력하여 옵션을 확인한다.

기타 2021.09.14

[Postgresql] Postgresql13 rpm으로 설치

목적 필자의 경우 yum으로 설치한 postgre는 remote와 버전이 맞지 않아 pg_dump등의 작업이 가능하지 않았다. 로컬 테스트 환경도 만들겸 겸사겸사 postgresql 13.3 버전을 설치하고자 한다. 과정 필자의 경우 13.3버전을 설치하여야 했다. Postgresql의 다운로드 사이트에 들어가 원하는 os를 선택한다. 필자의 경우는 linux기반의 설치를 해야했기에 linux >> Red Hat / Rocky/CentOS를 선택했다. 필자는 yum으로 설치가 안되어 최하단의 Direct RPM download를 이용하였다. 붉은 "direct download"를 선택하자. 다음 화면에서 POSTGRESQL13 >> RHEL/CentOS/Oracle Linux 7 - x86_64를 선택했..

기타 2021.08.27

[Spark] JSON string 파싱하기

목적 spark에서 json data source를 이용할 경우 json이 newline으로 구분되어있어야 json이 제대로 인식된다. 필자가 이용하는 데이터의 경우 아래와 같이 newline이 제대로 입력되어있지 않았다. {"a":"b"}{"a":"c"}{"a":"d"}{"b":"e"}... 이럴 때에 써먹을만한 간단한 파이썬 함수를 기록한다. 내용 import re import json def json_splitter(input_json): r = re.split('(\{.*?\})(?= *\{)', input_json) accumulator = '' res = [] for subs in r: accumulator += subs try: json_dict = json.loads(accumulator)..

기타 2021.08.26

[Gradle]Could not create service of type ChecksumService using BuildSessionScopeServices.createChecksumService().

발생 Gradle 빌드 시 아래와 같은 에러 메시지가 발생하였다. Gradle could not start your build. \> Could not create service of type ChecksumService using BuildSessionScopeServices.createChecksumService(). > Cannot lock checksums cache \*\* as it has already been locked by this process. 해결법 같은 서버에 여러 개의 gradle daemon이 동작하여서 발생한 문제일 수 있다. 따라서 ps -ef | grep gradle 커맨드를 이용, gradle 프로세스들을 kill을 이용해 죽인 후 다시 시도해본다.

기타 2021.08.25

[Postgresql] 설치

목적 postgresql및 해당하는 cli를 이용하기위해 postgre를 설치한다. 과정 필자는 aws ec2 환경에서 진행하였다. linux 기반의 yum을 이용할 수 있는 환경에서는 모두 이용가능할 것으로 보인다. postgresql의 설치과정은 아래와 같이 진행하였다. # postgresql 및 그에 필요한 환경들 설치 sudo yum install -y postgresql postgresql-server postgresql-devel postgresql-contrib postgresql-docs # postgresql이 동작하도록 설정 sudo service postgresql initdb sudo systemctl start postgresql sudo systemctl enable postgr..

기타 2021.08.24

Sun 의 공부 블로그

전체 글 52

티스토리툴바